All words are included and presented as they were found in the underlying documents. For that reason, misspellings (like "goverment" instead of "government"), words in outdated spelling (like "thou") or dialectal variations may be contained in a corpus. The use of randomly chosen Web pages as source material can also lead to the inclusion of sentences or words that can be considered racist, sexist or problematic in other ways.
Besides issues related to the source material, errors in our processing pipeline may also lead to errors in our data (e.g. the extraction of word fragments like "ing" by our tokenizer). In most cases, the frequency of an ill-formed word is significantly smaller than the frequency of its correct version. In the case of outdated spelling we sometimes provide a link to the correct word. If you find a systematic error in our data, we are always happy about a short hint.The Wortschatz Leipzig / Leipzig Corpora Collection creates corpora mostly based on documents from the Internet, which are processed automatically by our toolchain. If a specific word does not occur in the source documents, it is not contained in the resulting corpus. We do not select documents manually for inclusion in a corpus (except in some cases of domain-specific corpora).
Information about the downloads can be found here or at the repository of the Saxon Academy of Sciences and Humanities in Leipzig.
The project uses mostly documents from the Internet for the creation of its corpora. As this material is subject to copyright law, every text is splitted in its sentences and those sentences are randomly ordered to destroy the original document structure. After this processing step, the original documents are deleted and can not be provided anymore.
The project utilises a complex process chain for corpus and dictionary creation that is continuously being developed.
It comprises the following steps