Wortschatz Leipzig > About the Project

About the Project

Extensive text data is the foundation for all modern methods in the field of natural language processing (NLP). They are the basis for a variety of applications for information extraction, for the creation of powerful language models (Large Language Models, LLMs) and other machine learning methods. The performance and quality of word and document embeddings or modern transformer models is a direct result of the scope and quality of the text resources used.

Logo of the project "Wortschatz Leipzig"

The Wortschatz Leipzig or Leipzig Corpora Collection project has been making information on the German language available since the mid-1990s. To this end, freely available documents are regularly – usually annually – collected and processed. The results are corpora and corpus-based dictionaries in which a page with statistical information, example sentences and links to related words can be accessed for each word. Due to the amount of underlying data of several hundred million sentences, information can be found for almost all words. This makes it one of the most comprehensive information systems on the German language.

Since the beginnings of the project, its focus has continuously evolved. New text sources have been opened up, new methods developed and new applications made available. In particular, the resulting tools and existing expertise in the area of processing and analysing text data have been extended to more and more languages. At the centre of the work is the open provision of the data obtained; the project's web services alone have now answered billions of requests.

Due to the large volume of the underlying data of up to several hundred million sentences per language, the project's resources contain statistical data for almost all words and linguistic phenomena. Additional languages will continue to be added. Data is now available for more than 250 languages, most of which can be accessed online via web portals, web services or downloaded as standardised corpora as part of the Leipzig Corpora Collection (LCC).

To protect copyright and data protection, the text corpora are usually made available as randomised lists of sentences from which the original full texts cannot be reconstructed. All the documents contained can be assigned to the respective original article via metadata.