Wortschatz Leipzig > Wissensrohstoff Text

Wissensrohstoff Text – Eine Einführung in das Text Mining

Book Content

Most of the world's knowledge is contained in digitally available texts. These texts represent a significant source of knowledge, but how can this knowledge be extracted? In this updated and expanded new edition of the first German textbook on this topic, you will learn how digital text can be prepared, processed and used in applications with the help of text mining.

Authors

Chris Biemann
Professor Dr. Chris Biemann is Scientific Director of the Hub of Computing and Data Science and heads the Language Technology Group, both at University of Hamburg.
Gerhard Heyer
Professor Dr. Gerhard Heyer was head of the Natural Language Processing Group at the Institute of Computer Science at Leipzig University.
Uwe Quasthoff
Professor Dr. Uwe Quasthoff headed the project Deutscher Wortschatz at the Natural Language Processing group at Leipzig University.

Glossary

The glossary for the book is available here: Download (German)

Data

Here you will find various resources that are used or referenced in the book. These include the text data used and the ASV Online Toolbox in which you can try out procedures explained in the book directly in your browser.

The corpora used in the book can be downloaded here in various sizes (measured in number of sentences). The format of the downloads is explained here.

German news corpus (Germany) 2019, various sizes (measured in number of sentences)
German Web corpus (Germany) 2019, various sizes (measured in number of sentences)

Tools

Some of the procedures described in the book can be tested directly via our toolbox. We recommend using the Online Toolbox; however, the older ASV Toolbox is still available for download.

ASV Online Toolbox

The Online Toolbox is a modular collection of different tools for analysing written language and allows you to test many of the methods presented directly in the browser.


To the toolbox…
ASV Toolbox

The ASV Toolbox is a collection of different tools for analysing written language. It was developed at the Department of Automatic Speech Processing at the University of Leipzig and is no longer being developed further. Sie kann bei der Language Technology Group, Universität Hamburg heruntergeladen werden.


To the ASV Toolbox download… Depiction arrow pointing to the upper right