8 June, 2005 at 12:53 Leave a comment
Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources
Contents
Tools: Taggers, Parsers, NER, NP chunking, Language models, Concordances, Summarization, Other
Corpora: Large collections, Particular languages, Treebanks, Discourse, WSD, Literature, Acquisition
SGML/XML
Dictionaries
Lexical/morphological resources
Courses, Syllabi, and other Educational Resources
Mailing lists
Other stuff on the Web: General, IR, IE/Wrappers, People, Societies
Tools
Part of Speech Taggers
Freely downloadable
Stanford POS tagger
- Loglinear tagger in Java (by Kristina Toutanova)
TreeTagger
- A decision tree based tagger from the University of Stuttgart (Helmut Scmid). It’s language independent, but comes complete with parameter files for English, German, French (and Old French), and Italian. (Solaris and Linux versions.) Usable online here. Used at VISL.
SVMTool
- POS Tagger based on SVMs (uses SVMlight). LGPL.
Maximum Entropy part of speech tagger
- By Adwait Ratnaparkhi. JAVA version downloadable. A sentence boundary detector is also included. Now works with JDK1.3+. Class files, not source. <!– No longer! Available by ftp. –>
ACOPOST (formerly ICOPOST)
- Open source C taggers originally written by by Ingo Schröder. Implements maximum entropy, HMM trigram, and transformation-based learning. C source available under GNU public license.
fnTBL
- A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models.
mu-TBL
- An implementation of a Transformation-based Learner (a la Brill), usable for POS tagging and other things by Torbjörn Lager. Web demo also available. Prolog.
YamCha
- SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)
QTAG Part of speech tagger
- An HMM-based Java POS tagger from Birmingham U. (Oliver Mason). English and German parameter files. [Java class files, not source.]
The TOSCA/LOB tagger.
- Currently available for MS-DOS only. But the decision to make this famous system available is very interesting from an historical perspective, and for software sharing in academia more generally. LOB tag set.
Brill’s Transformation-based learning Tagger
- A symbolic C tagger. <!– Also available by ftp, and as a Windows version, with stuff for French. –>
Original Xerox Tagger
- A common lisp HMM tagger available by ftp. <!– There is also an adaptation for Spanish from the CRATER project. –>
Lingua-EN-Tagger
- Perl POS tagger by Maciej Ceglowski and Aaron Coburn. Version 0.11. (A bigram HMM tagger.)
Free, but require registration
LT POS and LT TTT
- Edinburgh Language Technology Group tagger and text tokenizer (and sentence splitter). Binary only for Solaris. Doesn’t allow you to train your own taggers.
TATOO, The ISSCO tagger.
- HMM tagger. Need to register to download.
PoSTech Korean morphological analyzer and tagger
- Online registration.
TnT – A Statistical Part-of-Speech Tagger
- Trainable for various languages, comes with English and German pre-compiled models. Runs on Solaris and Linux.
Usable by email or on the web, but not distributed freely
Memory-based tagger
- From ILK group, Catholic University Brabant (Jakub Zavrel/Walter Daelemans). Does Dutch, English, Spanish, Swedish, Slovene. Other MBL demos are also available.
Birmingham tagger
- Accepts only plain ASCII email message contents. The tagset used is similar to the Brown/LOB/Penn set.
CLAWS tagger
- The UCREL CLAWS tagger is available for trial use on the web. (It’s limited to 300 words though — this site is more of an advertisement for licensing the real thing — available as software for Suns or as a paid service.) You can also find info on CLAWS tagsets, though that page doesn’t seem to link to the C7 tagset.
The AMALGAM tagger
- The AMALGAM Project also has various other useful resources, in particular a web guide to different tag sets in common use. The tagging is actually done by a (retrained) version of the Brill tagger (q.v.).
Xerox XRCE MLTT Part Of Speech Taggers
- Tags any of 14 languages (European and Arabic), online on the web.
Portuguese taggers on the web: Projecto Natura and a QTAG adaptation.
Not free
Lingsoft
- Lingsoft in Finland has (symbolic) analysis tools for many European languages. More information can be obtained by emailing
info@lingsoft.fi
. There is an online demo. Conexor
- Conexor in Finland has demonstrations of EngCG-style taggers and parsers, for English, Swedish, and Spanish.
Xerox
- Xerox has morphological analyzers and taggers for many languages. There are demos of some of their tools on the web. More information can be obtained by contacting Daniella Russo.
Infogistics
- Infogistics, an Edinburgh spinoff has a tagging and NP/Verb group chunker available commercially, including an evaluation version.
Parsers
Information on available probabilistic parsers can be found on the FSNLP: probabilistic parsing links page.
Named Entity Recognition
Downloadable
LingPipe
- Tools include statistical named-entity recognition, a heuristic sentence boundary detector, and a heuristic within-document coreference resolution engine. Java. GPL. By Bob Carpenter, Breck Baldwin and co.
YamCha
- SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)
NP chunking
Downloadable
YamCha
- SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)
fnTBL
- A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models.
Language modeling toolkits
Downloadable
Downloadable, but requires registration
The SRI Language Modeling toolkit
- by Andreas Stolcke is another good system for building language models, freely available for research purposes.
Not yet classified
Lextools is a package of tools for creating weighted finite-state transducers (WFST) from high-level linguistic descriptions. Lextools binaries are available free for non-commercial use at: http://www.research.att.com/sw/tools/lextools/. Supported platforms are: linux (i686), sgi (mips2) and sun4. Lextools is built on top of, and requires, the AT&T WFST toolkit (version 3.6), available free for non-commercial use from: http://www.research.att.com/sw/tools/fsm/
Friendly concordancing and text analysis tools
Wordsmith Tools (Mike Scott)
- The thing to get if you are working in the Windows world.
Text summarization tools
A prototype Java Summarisation applet (System Quirk)
MEAD
- A public domain portable multi-document summarization system. (Dragomir Radev and others.)
Other
Downloadable
NLTK
- An open source Python package for NLP application development (Ed Loper and Steven Bird).
Ted Pedersen’s code
- Ngram Statistics Package: Perl code that implements: Fisher’s exact test, the likelihood ratio, Pearson’s chi squared test, the Dice Coefficient, and Mutual Information; Duluth Senseval-2 word sense disambiguation systems; Senseval-1 data in Senseval-2 format; various other WSD datasets in Senseval formats, and semantic distances derived via WordNet.
ISIP tools
- The main aim is a publically available speech recognition system (alpha release available), but along the way there are also toolkits for discrete HMMs and statistical decision trees, and for various aspects of signal processing.
Mem. A Perl implementation of Generalized and Improved Iterative Scaling
- by Hugo WL ter Doest.
Automorphology
- A system (for Windows) for automatically learning the morphological forms of words in a corpus by John Goldsmith.
Wordnet
- Wordnet is available by ftp, compiled for a variety of machine types. For money, one can also get EuroWordNet for various European languages, an Italian/English/Spanish MultiWordNet and there’s now a site for Global Wordnet. (See also Mappings between WordNet versions and Perl WordNet-Similarity module by Ted Pedersen, and WordNet Domains (coarse-grained sense topic classifications).)
Penn XTAG project
- A wide-coverage tree-adjoining grammar written in a mixture of C and Common Lisp. Also includes a large coverage morphological analyzer. Now includes more tools such as TCL/Tk tree viewer.
Dan Melamed’s Tools
- A collection of tools including a simulated annealling program, a post-processor for English stemming for the Penn XTAG morphology system, Good-Turing smoothing software, general text processing tools, text statistics tools and bitext geometry tools (mainly written in Perl 5).
MULTEXT
- Constructing corpora and tools for processing multilingual corpora. Contact: Jean Veronis
veronis@univ-aix.fr
. Some stuff including a multilingual text editor is downloadable. MULTEXT EAST has parallel versions of Orwell’s 1984 available free (upon registration) for a number of Central European languages. Naive Bayes algorithm
- Software from the Rainbow/Libbow software package that implements several algorithms for text categorization, including naive Bayes, TF.IDF, and probabilistic algorithms. Accompanies Tom Mitchell’s ML text.
HDDI
- Text Data Mining API from Lehigh University.
Emdros: a text database engine for linguistic analysis and research
Chasen
- Japanese morphological analyzer. Descendent of JUMAN.
Free, but require registration
Stuttgart’s IMS Corpus Workbench (CWB)
- A workbench for full-text retrieval from large corpora (with a query language and corpus indexing). Includes the Corpus Query Processor (CQP) and xkwic. Available free for research groups (currently only as Solaris 1/2 or Linux binaries), on signing a license agreement.
Gate
- University of Sheffield’s General Architecture for Text Engineering. Primarily an Information Extraction system.
MITRE’s Alembic Workbench
- A workbench for the development of tagged corpora. Includes a tagger based on Brill’s TBL approach.
SNoW
- SNoW is a learning program that can be used as a general purpose multi-class classifier and is specifically tailored for learning in the presence of a very large number of features. The learning architecture is a sparse network of linear units over a pre-defined or incrementally acquired feature space (Dan Roth).
Tilburg University’s TiMBL
- Tilburg’s Memory Based Learner. A general near-neighbour-based machine learning package, but optimized for statistical NLP applications. Follow the “Software” link.
Unsure
INTEX
- a finite-state transducer analysis system for English, French, and Italian that runs under NextStep. Contact: Max Silberztein
silberz@ladl.jussieu.fr
The PennTools page collects information on a variety of NLP systems, many of which are available externally.
Corpora
Large collections aimed at the NLP community
LDC (Linguistic Data Consortium)
- Email:
ldc@ldc.upenn.edu
. Provides the largest range of corpora on CD-ROM. Cost ranges from cheap (e.g., ACL-DCI disk) to pricey. CDs can be purchased individually; institutions can become members and receive discounts on CDs. Their catalog and some other info is available by ftp. There’s an LDC Online service for searches over the web (mainly intended for members, but there are samplers available). ACL/DCI (Association for Computational Linguistics Data Collection Initiative)
- Email:
fel@unagi.cis.upenn.edu
. Results are obtainable through LDC. European Language Resources Association
- Distribution agency is ELDA. Rapidly growing collection of materials in European languages.
ICAME (International Computer Archive of Modern English)
- Sells various corpora (including Brown and London-Lund). Information on corpora on the web, by sending the message
help
tofileserv@nora.hd.uib.no
, by ftp tonora.hd.uib.no
. Also, manuals for these corpora. TRACTOR
- TELRI Research Archive of Computational Tools and Resource. Corpora, many multilingual, in European community languages. Small fee for joining in order to be able to get corpora (unless you have contributed corpora).
CLR (Consortium for Lexical Research)
- Email:
lexical@nmsu.edu
. Focuses more on language processing tools and lexicons, but does have some corpora. As of Feb 1996, you can get most of their stuff by anonymous ftp toclr.nmsu.edu
. Their catalog is available as a postscript file. OTA (Oxford Text Archive)
- Provides mainly literary texts. Has a bright new web site. Email:
info@ota.ahds.ac.uk
. Most materials are available on the web or by anonymous ftp toota.ox.ac.uk
. Some require negotiations with the providers. BNC (British National Corpus)
- A 100 million word corpus of British English. Now available to people outside the European Union! You can search it online from their simple web interface or via View, a much better interface by Mark Davies, and there is an index to genres by David Lee.
European Corpus Initiative Multilingual Corpus I (ECI/MCI)
- A 98 million word corpus, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, and Malay. Cheap. Need to sign a license agreement available at either the WWW site. Also available from the LDC.
Survey of English Usage
- At the Department of English Language and Literature at University College London. Includes the British part of ICE, the International Corpus of English project. Now available tag, and parsed for function. 83,419 sentences. Includes ICECUP, dedicated retrieval software. ICE-NZ. ICE-HK. ICE-East Africa version 1 is on the ICAME-2 CD; version 2 is available ICE-Singapore, ICE-India, and ICE Philippines also available.
Corpora held by Lancaster University
- This link provides its own annotations.
The European Language Activity Network
- Promises a uniform query language for accessing corpora in all EU languages — but isn’t quite there yet.
Talkbank.
- Rich video and transcripts.
Particular languages
English
English language corpora available from the sites above are not repeated here.
Corpora by Geoffrey Sampson’s team
- The SUSANNE corpus and the CHRISTINE corpus (SUSANNE markup of a speech corpus).
Michigan Corpus of Academic Spoken English (MICASE). 1.7 million words from 1997-2001.
Penn-Helsinki Parsed Corpus of Middle English
- A syntactically annotated corpus of the Middle English prose samples in the Helsinki Corpus of Historical English, with additions. 1.3 million words. $200.
Corpus of Professional, Spoken American-English (CPSA)
- 2 million words from faculty and committee meetings and White House press conferences (50K work sample free on internet).
Lancaster Parsed Corpus
Dialogue Diversity Corpus (Bill Mann)
Chinese
English language corpora available from the sites above are not repeated here.
The Lancaster Corpus of Mandarin Chinese (LCMC)
- By Tony McEnery and Richard Xiao. Distinguished by being a balanced corpus, and freely available.
Multilingual
EMILLE/CIIL
- Monolingual written corpus data for 14 South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu). Orthographically transcribed spoken data and parallel corpus data for five South Asian languages (Bengali, Gujarati, Hindi, Punjabi and Urdu). In addition, the parallel corpus contains the English originals from which the translations stored in the corpus were derived. All data in the corpus is CES and Unicode compliant. The EMILLE corpus totals some 94 million words. Downloadable.
OPUS
- An open source parallel corpus, aligned, in many languages, based on free Linux etc. manuals.
World Health Organization Computer Assisted Translation page.
- Also includes a good selection of links on Computer Assisted Translation. (See also the copyright page.)
Searchable Canadian Hansard French-English parallel texts (1986-1993)
- From the Laboratoire de Recherche Appliquée en Linguistique Informatique, Universite de Montréal
European Union web server
- Parallel text in all EU languages. (In particular try European legislation.)
TELRI CD-ROMs
- Parallel and other text in central and eastern european languages.
Bosnian
Czech
Parallel Czech-English
- Literature translations in Czech and English
Czech National Corpus project: SYN2000
- 100 million words of contemporary Czech.
French
Association des Bibliophiles Universels
- Various French literary works.
American and French Research on the Treasury of the French Language (ARTFL)
- 150 million word corpus of various genres of French. You have to be a member to use it (but membership is fairly cheap).
German
COSMAS Corpus
- Large (over a billion words!) online-searchable German and Austrian corpora. This is the publically available part of the 1.85 billion word Mannheimer Corpus Collection
NEGRA Corpus
- Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Available free of charge to academics. 20,000 sentences, tagged, and with syntactic structures. Free for academic use.
Russian
Library of Russian Internet Libraries
- Various literary works.
Slovene
Slovene-English parallel corpus
- 1 M words, free to download + on-line concordances.
Coming soon: Slovene reference corpus of 100 M words
Spanish and Portuguese
TychoBrahe Parsed Corpus of Historical Portuguese
- Over a million words of Portuguese from different historical periods, some of it morphologically analyzed/tagged. Free. <!–
Spanish corpora available by ftp. –>
Information about Mark Davies’ collection of (mainly historical Spanish and Portuguese.
- It’s not clear what their availability is.
The CUMBRE corpus. Contact Professor Aquilino Sánchez
The CRATER Spanish corpus
- Morphosyntactically tagged telecommunication manuals) is available by ftp.
Corpus resources for Portuguese
- In total about 70 million words, available free, from various sources (newswire, etc.)
Folha de S. Paulo newspaper
- 4 annual CDROMs with full text.
COMPARA
- Portuguese-English parallel corpus. (In general, various resources at Linguateca site.
See also under ELRA, above.
Swedish
Spraakdata, Department of Swedish, Göteborgs University.
- Has various searcable part of speech tagged Swedish corpora (Parole, Bank of Swedish, etc.), and some material in Zimbabwean languages.
Treebanks
Name | Language | Size | Availability | Comments |
---|---|---|---|---|
Penn Treebank | US English | 2 million + words | Available (distributed by LDC) | 1 million WSJ, 1 million speech, surface syntax (1970s TG) |
BLLIP WSJ corpus | US English | 30 million words | Available (distributed by LDC) | WSJ newswire. Automatically parsed, not hand checked. Same structure as Penn Treebank, except for some additional coreference marking |
ICE-GB | UK English | 1 million words (83,394 sentences) | Available; c. 500 pounds | British part of ICE, the International Corpus of English project. Tagged and parsed for function. Half spoken material. |
NEGRA Corpus | German | 20,000 sentences | Available free of charge to academics on completion of license agreement. | Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Tagged, and with syntactic structures. |
TIGER corpus | German | 700,000 words | Available free of charge for research purposes on completion of license agreement. | German newspaper text (Frankfurter Rundschau). Semi-automatically parsed. They also have a good treebank search tool, TIGERSearch. |
Alpino Dependency Treebank | Dutch | 150,000 words | Freely downloadable | Assorted subcorpora. By far the largest is the full cdbl (newspaper) part of the Eindhoven corpus. |
The Prague Dependency Treebank 1.0 | Czech | 500,000 words | Free on completion of license agreement (available through LDC). | Analyzed at the levels of parts of speech, syntactic functions (and, in the future, semantic roles) level in a dependency framework. Text from newspapers and weekly magazines. |
Bulgarian Treebank | Bulgarian | n/a | Only some POS-tagged texts available so far (free on the web) | An under construction Bulgarian HPSG treebank. |
Penn Chinese Treebank | Chinese | 100,000 words | Available (LDC) | Based on Xinhua news articles. 1980s-style GB syntax. |
Danish Dependency Treebank 1.0 | Danish | 100,000 words | Available free under the GPL. | Built on a portion of the Parole corpus. |
Floresta Sintá(c)tica | Portuguese | 168,000 words hand-corrected; 1,000,000 words automatically parsed | Hand corrected part is free web download; automatically parsed part available through email contact | Text from CETEMPúblico corpus. Phrase structure and dependency representations. Available in several formats, including Penn Treebank format. |
Name | Language | Size | Availability | Comments |
Verbmobil Tübingen: under construction treebanked corpus of German, English, and Japanese sentences from Verbmobil (appointment scheduling) data
Syntactic Spanish Database (SDB) University of Santago de Compostela. 160,000 clauses / 1.5 million words.
CKIP Chinese Treebank (Taiwan). Based on Academia Sinica corpus. (There’s also a 100 sentence Chinese treebank at U. Maryland.)
LDC Korean Treebank.
Dublin-Essex Treebank project
- Deriving Linguistic Resources from Treebanks.
Treebanks
CSTBank: Cross-document Structure Theory: marking sentence functional relationships across related documents.
Resources for Word Sense Disambiguation
The Senseval web site
- Has a comprehensive selection of resources for WSD, including a good list of WSD data resources, but not yet the new SEMCOR.
Ted Pedersen’s code
- Includes various WSD systems.
SenseClusters
- Open source package for unsupervised discovery of word senses by clustering together instances of a word (or words) that are used in similar contexts in raw text, supporting a wide range of clustering techniques based on both context vectors and similarity matrices, and including links to SVDPACKC and CLUTO. Ted Pedersen and Amruta Purandare.
Literature
There are now quite large collections of online literature, available in various languages (though the majority are in English, of course). Below are pointers to some of the main collections:
Entirely or mainly English
Alex: A Catalogue of Electronic Texts on the Internet
- Seems to have one of the largest collection. Searching and browsing facilities through gopher menus. Many languages.
Wiretap Electronic Text Archive
- Extensive and good quality. Still in the gopher age, though.
The On-line Books Page
- The index here only covers books in English, but there are lots of links to other collections of material in all languages.
Project Gutenberg
- The oldest and largest project to get out of copyright literature online, freely available. (Or see the mirror, Sailor’s Project Gutenberg site.)
The Electronic Text Center of the University of Virginia
- Large collection of SGML text, mainly in English, but also in other major languages.
Center for Electronic Texts in the Humanities
- Princeton/Rutgers collaboration. They didn’t have it together with their web site when I stopped by, but they may soon.
Oxford Electronic Text Library Editions
- Available from Oxford University Press, 200 Madison Ave, NY, NY 10016 212-679-7300. The Complete Works of Jane Austen is $95.00, and is reviewed in Computers and the Humanities, 28:4-5 (Aug/Oct, 1994), 317-321.
Coreference annotated texts
- From University of Woverhampton (R. Mitkov, C. Barbu et al.).
<!–

EText
If just any EText will do, there is more and more of it available. Two big lists are at U. Konstanz and at CMU.
Others: Have a look at http://clwww.essex.ac.uk/cgi-bin/w3c/w3c If you want to run concordances online, this is a good site. But if you want to download the texts to use on your machine, then try: http://english-server.hss.cmu.edu/ or http://www.promo.net/pg/index.html Another useful site for collecting corpora of literary texts: ‘Works Printed in English, 1477-1799’ at http://www.shu.ac.uk/emls/iemls/resour/mirrors/eshp/ren.htm –>
Acquisition data
CHILDES database.
- Database of child language transcriptions in English and many other languages. Texts are also available by ftp. Certain usage requirements. Manuals and programs for accessing the data (the CLAN concordancer) are also available online. Now in Unicode XML.
SGML/XML
Robin Cover’s SGML/XML Web Page
- This is a wonderful compendium of information on SGML and XML, including information on the Text Encoding Initiative (TEI). This document is also a guide to many text collections (ones using SGML).
Information about the Text Encoding Initiative (TEI). (The Pizza Chef acts as a TEI tag set selector.)
Xaira
- XML Aware Indexing and Retrieval Application. The successor of SARA.
Microsoft’s XML page
W3C XML page.
The Corpus Encoding Standard.
- An SGML instance designed for language engineering applications. Also the XML version.
Dictionaries
Dictionaries of subcategorization frames
The following dictionaries all list surface subcategorization frames (each with a different annotation scheme). They are also all available in electronic form from the publishers (not free).
COBUILD
- Collins Cobuild English Language Dictionary. London: Collins, 1987. The COBUILD web site lets you search their Bank of English corpus (but you need to pay to get more than a trial.
LDOCE
- Longman Dictionary of Contemporary English. Burnt Mill, Essex: Longman, 1978.
OALD
- Oxford Advanced Learner’s Dictionary of Current English. Oxford: Oxford University Press, Fourth Edition, 1989. The third edition also had information on subcategorization frames, although in a different incompatible format. However, a partial version of the third edition (with this information) is available free online from the Oxford Text Archive.
Not exactly a dictionary, but other popular sources are:
Levin (1993)
- Beth Levin. 1993. English Verb Classes and Alternations: A Preliminary Investigation. Chicago. Discusses linguistic distinctions (like unergative/unaccusative verbs, dative shift, etc., not made by the above dictionaries). The index of verbs is online.
English subcategorization evaluation resources
- Gold standard data, from Cambridge University (Anna Korhonen)
See also COMLEX and CELEX available from the LDC.
Dictionaries of assorted languages on the web
The old version of Robert Beard’s Web of Online Dictionaries long ago mutated into YourDictionary.com. I’m told the IPO has been delayed. Nevertheless, it’s the most comprehensive index of dictionaries available on the web.
Names
U.S. names with frequency information, are available from the Census Bureau.
SGML structured dictionaries
Cambridge International Dictionary of English and other products in SGML.
Lexical/morphological resources
English SENSEVAL Resources
- Dictionary entries and tagged examples for 35 words.
ARIES Natural Language Tools
- Lexicons and morphological analysis for Spanish. There is a free Prolog demonstrator, but the real lexicons and C/C++ access tools cost money.
Courses, Syllabi, and other Educational Resources
“Techie”
Foundations of Statistical Natural Language Processing
- Some information about, and sample chapters from, Christopher Manning and Hinrich Schütze’s new textbook, published in June 1999 by MIT Press. Read about courses using this book.
Corpus-based Linguistics
- Christopher Manning’s Fall 1994 CMU course syllabus (a postscript file).
Statistical NLP: Theory and Practice
- Christopher Manning’s Spring 1996 CMU course materials.
John Lafferty and Roni Rosenfeld’s Spring 1997 CMU course Language and Statistics.
Boston University (John D. Burger and Lynette Hirschman)
- A good course and web site, by the looks!
Draft of Data-Intensive Linguistics
- By Chris Brew and Marc Moens.
Statistical Natural Language Processing course
- By Joakim Nivre. Elsnet suported. <!–
Partial draft of the text Speech and Language Processing
- By Daniel Jurafsky and James Martin. Covers Statistical NLP stuff, as well as symbolic NLP and speech. –>
Short Course: Statistical Methods in NLP
- By Philip Resnik
Linguist’s Guide to Statistics by Brigitte Krenn and Christer Samuelsson.
Statistical and Corpora Based Methods for Processing Natural Languages
- By Alon Itai, Technion Computer Science Department. (Don’t read those old drafts of mine though … get the real thing!)
CS 241 Statistical Models in Natural-Language Processing
- Eugene Charniak, Brown University.
Michael Littman, Duke: 1997, 1998.
“Corpus Linguistics”
A tutorial on concordances and corpora by Cathy Ball
Web material accompanying McEnery and Wilson’s book on Corpus Linguistics
Tony Berber Sardinha’s Corpus Linguistics course
- Powerpoint slides in an interesting mixture of English and Portuguese (plus the rest of his homepage!)
Concordancing and corpus linguistics
- Notes prepared by Phil Benson, Hong Kong University.
Computational Approaches to Collocations
- Discussion of all the measures that have been used, and software for calculating them. By Evert and Krenn.
Mailing lists
Mailing lists that have information on these topics include:
Corpora
- The main mailing list for info on corpus-based linguistics. Subscribe by sending the message:
subscribe corpora
tolistserv@uib.no
. Or if you want to subscribe with a different email address, send:subscribe corpora email-address
(Note that you’re now speaking to a Majordomo server, not a listserv, so you don’t send your name!). Or you can subscribe on the web. Empiricist
- The empiricist list appears to be defunct now. You used to send a “subscribe” message to
empiricists-request@unagi.cis.upenn.edu
.
Other stuff on the Web
General resources
NIST Human Language Technology programs
- Including: TREC, TIDES, ACE, ….
Text summarization
- Tons of resources (tutorialis, bibliographies, and software) for document summarization, maintained by Dragomir Radev.
PropositionBank @ UPenn
Bookmarks for Corpus-based Linguists An extensive annotated collection by David Lee, aimed at linguistics more than NLP (includes web-searchable corpora and concordancing options).
HLTCentral
- European site aiming to increase transfer of language technologies to the commercial market. News, etc.
Linguistic annotation
- A description of formats for linguistic annotation by Steven Bird.
CTI Textual Studies, University of Oxford, Guide to Digital Resources
- Lists text analysis tools, corpora, and other stuff.
U. Essex W3-Corpora
- Lots of teaching material, links, and online corpora.
Computational Linguistics and NLP (Kenji Kita, Tokushima U.)
- A good well organized list of CL references, concentrating on corpus-based and statistical NLP methods. See also Software tools for NLP.
HLT Central
- European Human Language Technology site
Survey of the State of the Art in Human Language Technology
ACL SIGLEX list of Lexical Resources
Online materials for a course on Learning Dynamical Systems at Brown University.
- Lots of neat info.
Expert Advisory Group for Language Engineering Standards (EAGLES) home page
- European standards organization.
Materials prepared for Michael Barlow’s Corpus Linguistics course
Corpus Linguistics University of Birmingham
Chris Brew’s Teaching Materials for statistical NLP
- Not much there last time I looked; you might also try his home page.
Edinburgh LTG HelpDesk’s FAQ
- Many of the questions in the concern issues related to corpora and tagging.
Content Analysis Resources
- Qualitative Text Analysis, Concordances, etc.
Information Retrieval
The SMART IR system
ACM SIGIR
Managing Gigabytes
TREC conference
Text-based Intelligent Systems (Bruce Croft)
Information Extraction/Wrapper Induction
Introduction to Information Extraction Technology. A tutorial by Douglas E. Appelt and David Israel.
IE data sets
- Updated versions (i.e., now well-formed XML) of classic IE data sets: Seminar Announcements and Corporate Acquisitions.
Web -> KB. CMU World Wide Knowledge Base project (Tom Mitchell). Has a lot of the best recent probabilistic model IE work, and links to data sets.
RISE: Repository of Online Information Sources Used in Information Extraction Tasks, including links to people, papers, and many widely used data sets, etc. (Ion Muslea). Appears to not have been updated since 1999.
Message Understanding Conference (MUC) information. A US government funded information extraction exercise (from the 1990s).
Web IR and IE (Einat Amitay). Various links on IR and IE on the web.
Web question answering system (University of Michigan)
GATE: General Architecture for Text Engineering (Sheffield)
Genia Project. Biomedical text information extraction corpus (Tsujii lab). And IE tutorial slides.
People’s homepages
Home pages with something useful on them.
University of Texas at Austin Machine Learning Research Group
Steven Abney (until 1997)
Adam Berger
- Various stuff on statistical MT and maximum entropy models <!–
Ted Dunning –>


Societies/Journals
International Quantitative Linguistics Association/Journal of Quantitative Linguistics
- Not very hip.
Association for Computational Linguistics/Computational Linguistics
- Hipper
Still under construction…

http://nlp.stanford.edu/links/statnlp.html
Christopher Manning — <manning@cs.stanford.edu> — Last modified: Tue Jan 4 10:21:09 PDT 2005
<!– Perhaps see also: statnlp2.html other not very processed announcements Look at IMS tagger and CorpusBench from Textware in Denmark The IMS CWB is available at no charge (restricted to non-commercial research purposes, however) from the University of Stuttgart, Institute for Natural Language Processing (IMS), Germany. Contact Ulrich Heid (uli@ims.uni-stuttgart.de) for further information. An updated version of the COSMAS corpus access interface at the Institut fnow available at http://corpora.ids-mannheim.de/~cosmas – 484 million words, – 128 million words publicly accessible, – 26 million words morphologically tagged, – stemming, – concordancing, – collocation analysis, … http://www.shlrc.mq.edu.au/~hdevries/SUMM.HTM – stuff on text summarization http://www.bangor.ac.uk/ar/cb/ceg/ceg_eng.html Welsh 1 million word corpus. CORIS/CODIS, a 100-million-word corpus of contemporary written Italian, is available on line. The current version of the CORIS/CODIS corpus is available on-line for research purposes, for people employed in academic and research institutions, and will continue to be available free of charge, on an experimental basis, until the release of the final version. Before signing the agreement, to obtain personal access to the corpus, the demo version corpus may be consulted, using the data retrieval software on the web. We suggest you go to the presentation at: http://www.cilta.unibo.it/ricerca.htm. Here you can find a detailed description of the corpus construction and a note about how to obtain personal access. http://www.cs.utexas.edu/users/pebronia/text-mining/ PP attachment data: http://www.latl.unige.ch/personal/cathy_f.html And Collins Brooks stuff? Do all CoNLL data sets. Look at: http://www.ling.upenn.edu/mideng/ppcme2dir/ The PPCME2 and the CorpusSearch program are available on CD-ROM, either separately or together. There is a charge of $200 for a five-user license to the corpus and $50 for a five-user site license for the search program. Contact Anthony Kroch (kroch@.ling.upenn.edu) if more extensive licenses are desired. Funds paid for the PPCME2 will go toward improving the corpus and increasing it in size, and updates to the PPCME2 will be available at nominal cost to corpus license holders. Funds paid for CorpusSearch will go to the author. Instructions for ordering the CD-ROMs are posted here. http://www-users.york.ac.uk/~lang22/parsed-corpora-series The English Parsed Corpora Series The Penn-Helsinki Parsed Corpus of Middle English II 1.3 million words Ann Taylor, Anthony Kroch The York-Helsinki Parsed Corpus of Old English Poetry 78,000 words Susan Pintzuk, Leendert Plug The York-Toronto-Helsinki Parsed Corpus of Old English 1.5+ million words Ann Taylor, Anthony Warner, Susan Pintzuk, Frank Beths The Penn-Helsinki Parsed Corpus of Early Modern English 1.5 million words Anthony Kroch, Beatrice Santorini The Penn Treebank (Modern English) 2.5+ million words Mitch Marcus, et.al. ALso see entry in: http://www.ldc.upenn.edu/annotation/ INTEX FST toolbox (originally for French) http://www.nyu.edu/pages/linguistics/intex/ Badger IE system UMass. SOurce available: http://www-nlp.cs.umass.edu/software/badger.html List of treebanks: http://www.lsi.upc.es/~civit/corpus_linguistics.html http://www.id.cbs.dk/~mtk/treebank/guideT.html http://www.limsi.fr/Individu/jacquemi/ Christian Jacquemin I want to inform those interested in the analysis of spoken Italian that now there is a new database called BADIP (Banca dati dell’italiano parlato) containing an online edition of the 500,000 word LIP-Corpus. The edition is being enriched with POS-tags and lemmata, more data are being added continuously. Other corpora of spoken Italian will be included in the database as soon as possible. The database is part of the Language Server of the University of Graz (Austria). Access to BADIP is free: http://languageserver.uni-graz.at/badip The Saarbruecken Corpus of Spoken English (ScoSE) is now available on the internet. http://www.uni-saarland.de/fak4/norrick/scose.htm The Corpus consists of three parts. The first two parts are made up of transcriptions of jokes and stories from audio-taped talk recorded by Prof. Neal R. Norrick and his students at Northern Illinois University and Saarland University. Most of the excerpts come from real conversations among family members and friends, fellow students and colleagues. The third part comprises stories recorded in interviews with senior citizens aged 80 and older in a retirement community in Indianapolis, Indiana. Hebrew: http://cs.haifa.ac.il/~shlomo/corpora/ Croatian National Corpus is at: http://www.hnk.ffzg.hr/cnc.htm http://www.speech.kth.se/~bea/treebank.html Good! http://www.bultreebank.org/Links.html http://www.ims.uni-stuttgart.de/projekte/ TIGER/related/links.shtml Proto spanish/catalan/basque treebank: http://www.dlsi.ua.es/projectes/3lb/index_en.html Mine the stuff at http://treebank.linguist.jussieu.fr/ Add : http://www.sfs.nphil.uni-tuebingen.de/dereko/ http://www.carnegie.rice.edu/ninch_january_2003.ppt We are pleased to invite you to visit the online demo of the LX-Suite of tools for the shallow processing of Portuguese at http://lxsuite.di.fc.ul.pt http://www.inl.nl/pub/modint.htm – corpus list, in Dutch. http://www-igm.univ-mlv.fr/~unitex/ http://www.isi.edu/~hdaume/HandAlign/index.html Hand Aligner. MT Eval: http://www.isi.edu/natural-language/mteval/ http://www.issco.unige.ch/projects/isle/femti/ http://pie.usna.edu Phrases In English (BNC database) http://www.isi.edu/~och/YASMET.html http://www2.parc.com/istl/groups/nltt/fsbank/default.html – Parc 700 dependency bank. Natural language generation systems: http://www.fb10.uni-bremen.de/anglistik/langpro/NLG-table/NLG-table-root.htm http://www.dynamicmultimedia.com.au/siggen/ For information about the contents and structure of DWDS corpus, see: http://www.dwdscorpus.de DWDS (Digitales Wörterbuch der deutschen Sprache) denotes the proposed new electronic dictionary of the German language. Online concordances: (http://132.208.224.131/Concord.htm) and/or Virtual Language Centre’s Web Concordancer (http://www.edict.com.hk/concordance/). 1. MICASE online at http://www.hti.umich.edu/m/micase/; you can search = the Michigan Corpus of Academic Spoken English (roughly 2 mio. tokens) = online and specify different speaker and speech attributes 2. WebCorp at http://www.webcorp.org.uk/ which retrieves concordances = http://sara.natcorp.ox.ac.uk/lookup.html. http://www.lsi.upc.es/~nlp FreeLing toolset http://www.linguateca.pt Tools for portuguese. Have you tried the LabEL resources? : http://label.ist.utl.pt/ (click on Recursos Publicos). http://www.cs.technion.ac.il/~erelsgl/bxi/mcht2/NO3.doc Hebrew morphological analyzer – software nearby See: http://www.ling.ohio-state.edu/~dickinso/corpus.html BioNLP resources: http://www.tufts.edu/~amorga02/bcresources.html Early french: See ABU collection of the French “Conservatoire Numérique des Arts et Métiers” at http://abu.cnam.fr/ You might find some interresting things there: http://www.uottawa.ca/academic/arts/lfa/index.html and there http://www.mshs.univ-poitiers.fr/cescm/menestrel/france.htm For lexical searches on Middle French, use: http://atilf.atilf.fr/dmf.htm The National French Library might also help you if you know what you are looking for (eg, they have Le Chevalier à la Charrette, but I don’t know what else). http://gallica.bnf.fr/ Cheers Nicolas Mazziotta The Sofie Parallell Treebank is under development, by the members of The Nordic Treebank Network. It consists of analyzed sentences from the book “Sophie’s World” (by Jostein Gaarder). Currently there are sentences analyzed in Estonian, Norwegian, Danish (both dependency structures and constituent structures), Swedish, Icelandic and German. We also plan to add English, Dutch and Faeroese. For more information: http://www.hf.uio.no/tekstlab/prosjekter/SOFIE.htm http://www.sciences.univ-nantes.fr/info/perso/permanents/enguehard/recherche/CoRRecT/CoRRecT_gb.htm Terminology recognition corpus. I think the ‘European Parliament Proceedings Parallel Corpus 1996-2003’ may work for you. http://people.csail.mit.edu/people/koehn/publications/europarl/ OALD: 2003 at http://ota.ahds.ac.uk/texts/2469.html. Hello, I’d like to report a language resource for Hebrew: a Hebrew lexicon, which is constantly being updated and expanded. The lexicon contains over 16,000 lexicon items, each with morpho-syntactical information. The lexicon aims to achieve full coverage (excluding proper names) of Modern Hebrew words that appear on daily online news-sites. Further work is being done to allow it to be multilingual (starting with English). Yet further development will aim at integrating wordnets with it. The resource is based on XML with a specialized XML Schema. There’s also a GUI browser/editor available for those who wish to be able to easily process the resource or similar resources without having to develop software. The link is at: http://cl.haifa.ac.il/~shlomo/corpora/schema/hebrew_lexicon/ A demo of a finite-state morphological analyzer for Hebrew based on this lexicon is available at: http://cl.haifa.ac.il/~shlomo/fsma.html Shlomo Yona shlomo@cs.haifa.ac.il http://cs.haifa.ac.il/~shlomo/ From: Joerg Tiedemann Sender: owner-corpora@lists.uib.no To: corpora@hd.uib.no Subject: [Corpora-List] European Constitution in parallel Date: Mon, 25 Apr 2005 01:03:20 +0200 (CEST) The EU constitution is now part of OPUS parallel corpus. 21 languages, aligned at the sentence level! download: http://logos.uio.no/opus/EUconst.html query: http://logos.uio.no/cgi-bin/opus/opuscqp.pl?corpus=3DEUconst Everything is machine annotated & automatically aligned. Tokenization,=20 sentence splitting, alignment are not 100% correct … Enron entity annotated email: http://www-2.cs.cmu.edu/~einat/datasets.html –>
Entry filed under: Uncategorized.
Trackback this post | Subscribe to the comments via RSS Feed