Posts

Showing posts from September, 2016

Text mining - Document classification

The statistical approach of the "text mining" consists in to transform a collection of text documents in a matrix of numeric values on which we can apply machine learning algorithms. The "unstructured document" designation is often used when one talks about text documents. This does not mean that he does not have a certain organization (titles, chapters, paragraphs, questions and answers, etc.). It shows first of all that we cannot express directly the collection in the form of a data table that is usually handled in data mining. To obtain this kind of data representation, a preprocessing phase is needed, then we extract relevant features to define the data table. These steps can influence heavily the relevance of the results. In this tutorial, I take an exercise that I lead with my students for my text mining course at the University. We perform all the analysis under R with the dedicated packages for text mining such as “XML” or “tm”. The issue here is to perform ...