Corpus development at the Institute for Dutch Lexicology INL
The Institute of Dutch Lexicology INL is a research institute subsidized by the Dutch and Belgian governments. Corpus development at the INL dates from the mid-seventies. Up to 1990, the INL text corpora were mainly developed for lexicographical purposes. Presently, they are used for a broad variety of research and applications. INL text corpora of present-day Dutch include two linguistically annotated corpora which can be consulted via Internet: the 5 Million Words Corpus 1994, which covers a variety of topics and text types, and the 27 Million Words Newspaper Corpus 1995. The retrieval program developed for the latter will be demonstrated.
Characteristics of the 27 Million Words Newspaper Corpus 1995
The newspaper texts, dating from 1994 and 1995, were obtained in machine-readable form, on a contract basis with the publishing company. The contract specifies the conditions of use. The texts were input for automatic linguistic encoding. Part of speech (POS) and headword were automatically assigned to the word forms in the electronic texts by a lemmatizer/POS-tagger developed by the INL. Most of the data has not been corrected, neither on the level of the proper text, nor on the level of POS and headword. The linguistically encoded texts were loaded into an on-line retrieval system developed by the INL. Queries may concern the whole corpus, or a subcorpus defined by the user along the parameters year and month of publication. The system allows the user to search for single words or word patterns, including some, still rather primitive, predefined syntactic patterns which can be revised by the user. Search definitions may include references to word forms, POS and head words, both separately and in combination by use of Boolean operators and proximity searches. Output data most often is a list of items, or a series of concordances with a user-defined context size. With limitations due to copyright, the output of searches can be transferred to the user's computer by e-mail (it is not allowed to transfer complete texts or substantial text fragments). Among the other facilities are the use of wild cards and various sorting facilities.
Access to the 27 Million Words Newspaper Corpus 1995
Consultation of the corpus is free for non-commercial purposes. Please contact the director of the INL, Prof. dr. P.G.J. van Sterkenburg, about the conditions for commercial applications. To get access to the corpus, an individual user agreement has to be signed. An electronic user agreement form can be obtained from our mailserver Mailserv@Rulxho.Leidenuniv.NL. Type in the body of your e-mail message: SEND [27MLN95]AGREEMNT.USE. Please make a hard copy of the agreement form, sign it, keep a copy yourself, and return a signed copy to: Institute for Dutch Lexicology INL, P.O. Box 9515, 2300 RA Leiden. After receipt of the signed user agreement, you will be informed about your username and password. Use of a VT 220 (or higher) terminal, or an appropriate terminal-emulator (e.g. Kermit) is recommended. If you need additional information, please send an e-mail message to Helpdesk@Rulxho.Leidenuniv.NL, or send a fax to Mrs. dr. J.G. Kruyt (31 71 27 2115).