FAQ – Data

All words are included and presented as they were found in the underlying documents. For that reason, misspellings (like "goverment" instead of "government"), words in outdated spelling (like "thou") or dialectal variations may be contained in a corpus. The use of randomly chosen Web pages as source material can also lead to the inclusion of sentences or words that can be considered racist, sexist or problematic in other ways.

Besides issues related to the source material, errors in our processing pipeline may also lead to errors in our data (e.g. the extraction of word fragments like "ing" by our tokenizer). In most cases, the frequency of an ill-formed word is significantly smaller than the frequency of its correct version. In the case of outdated spelling we sometimes provide a link to the correct word. If you find a systematic error in our data, we are always happy about a short hint.

The Wortschatz Leipzig / Leipzig Corpora Collection creates corpora mostly based on documents from the Internet, which are processed automatically by our toolchain. If a specific word does not occur in the source documents, it is not contained in the resulting corpus. We do not select documents manually for inclusion in a corpus (except in some cases of domain-specific corpora).

Information about the downloads can be found here or at the repository of the Saxon Academy of Sciences and Humanities in Leipzig.

The project uses mostly documents from the Internet for the creation of its corpora. As this material is subject to copyright law, every text is splitted in its sentences and those sentences are randomly ordered to destroy the original document structure. After this processing step, the original documents are deleted and can not be provided anymore.

We use corpus names that encode the most relevant information about the used source material. All corpus names comply with the following structur
LANGUAGE_GENRE_DATE
With…
  • Language – Information about the language of the source material based on ISO 639-3, optionally extended by country of origin using ISO 3166
  • Genre – Information about the kind of source material. Typical values are "web", "wikipedia", "news" (news material, often via RSS feeds) or "newscrawl" (news material, crawled from Websites)
  • Date – Information about the timespan in which the source material was acquired
Examples for corpus names are
  • deu_news_2023 – news material in German language of 2023
  • deu-at_news_2023 – news material in German language from Austria of 2023
  • deu-at_web_2021-2024 – Web text in German language from Austria between 2021 and 2024
  • deu_wikipedia_2024 – Wikipedia texts in German language of 2024

The project utilises a complex process chain for corpus and dictionary creation that is continuously being developed.
It comprises the following steps

  • Webcrawling
  • Removal of HTML markup (XML markup for Wikipedia, etc.)
  • Document-based language identification
  • Sentence segmentation
  • Removal of sentence duplicates
  • Pattern-based sentence cleaning
  • Sentence-based language identification
  • Corpus creation
    • Tokenisation and word indexing
    • Calculation of word frequencies
    • Calculation of word co-occurrences
  • Optional postprocessing (depending on the availability of the corresponding tools)
    • POS tagging (Assignment of part of speech tags to words)
    • Lemmatisation
    • Detection and removal of near-duplicate sentences
    • Word similarity based on co-occurrences
    • Word similiarity based on string similarity (Levenshtein)