Corpora

Computervermittelte Kommunikation

Corpora of Computer-Mediated Communication

This page provides supplementary information to the following article:

Michael Beißwenger & Angelika Storrer (in press): Corpora of Computer-Mediated Communication. To appear in: Anke Lüdeling & Merja Kytö (eds): Corpus Linguistics. An International Handbook. (Series: Handbücher zur Sprache und Kommunikationswissenschaft/Handbooks of Linguistics and Communication Science). Mouton de Gruyter, Berlin.  PDF-Preprint

Types of CMC corpora

1) Project-related corpora of raw data (examples)

2) Corpora of raw data for general use (examples)

3) Project-related annotated corpora (examples)

4) Annotated corpora for general use (examples)

Types of CMC corpora

  • project-related corpora were compiled as an empirical basis for research questions in a particular project
  • corpora for general use do not directly pertain to a particular project, but were established rather as a data pool for the investigation of diverse potential research questions
  • corpora of raw data have been left in the condition in which they were originally acquired from the Internet
  • In annotated corpora , the data have been subjected to annotation processes (e.g. an SGML/XML-based annotation of data segments that may be relevant for purposes of analysis)

1) Project-related corpora of raw data (examples)

CoSy:50 Corpus (Simeon Yates)50 submissions from 152 computer conferences (see Yates, Simeon J. 1996, Oral and written linguistic aspects of computer conferencing , in: Herring, Susan C. (ed), Computer-Mediated Communication. Linguistic, Social and Cross-Cultural Perspectives. Amsterdam/Philadelphia, pp. 29-46)Swiss German webchat corpora (Beat Siebenhaar)(see e.g. Siebenhaar, Beat, Die dialektale Verankerung regionaler Chats in der deutschsprachigen Schweiz , in: Eggers, Eckhard/Stellmacher, Dieter/Schmidt, Jürgen Erich (eds): Tagungsband IGDD-Kongress Marburg. Stuttgart)Contrastive German-Swedish IRC-Corpus (Christiane Pankow)(see e.g. Pankow, Christiane 2003, Zur Darstellung nonverbalen Verhaltens in deutschen und schwedischen IRC-Chats. Eine Korpusuntersuchung , in:  Linguistik online 15 )

2) Corpora of raw data for general use (examples)

The Netscan Usenet databasehttp://netscan.research.microsoft.com/The Enron Email Datasethttp://www-2.cs.cmu.edu/~enron/ (> 0.5 million business-related e-mail messages)The SpamAssassin Public Corpushttp://spamassassin.apache.org/publiccorpus/ (approx. 6,000 e-mail messages; Apache SpamAssassin Project )Korpus deutschsprachiger Newsgroups(see Feldweg, Helmut/Kibiger, Ralf/Thielen, Christine 1995, Zum Sprachgebrauch in deutschen Newsgruppen , in: Schmitz, Ulrich (ed), Neue Medien (Osnabrücker Beiträge zur Sprachtheorie 50), 143-154)WWE-2006 weblog datasetthis corpus of weblog posts was temporarily available to the participants of the 3rd Annual Workshop on the Weblogging Ecosystem (see http://www.blogpulse.com/www2006-workshop/

3) Project-related annotated corpora (examples)

E-Mail corpus from the COSMA project160 e-mail messages with appointment arrangements (see Declerck, Thierry/Klein, Judith 1997,  Ein Email-Korpus zur Entwicklung und Evaluierung der Analysekomponente eines Terminvereinbarungssystems )Website corpus from the Hypnotic project(see Rehm, Georg, 2001,  Hypertextsorten: Definition, Struktur, Klassifikation )

4) Annotated corpora for general use (examples)

Düsseldorf CMC CorpusCorpus project at Düsseldorf University (Dieter Stein, http://www.phil-fak.uni-duesseldorf.de/anglist3/ ); no online accessDortmunder Chat-Korpushttp://www.chatkorpus.uni-dortmund.de (with online access to a selection of XML-annotated data)