Document Classification With Natural Language Processing

Document Classification With Natural Language Processing

Table of Contents

1. Introduction

2. Preprocessing & Feature Extraction

3. A Basic Naive Bayes Model

4. Scikit-learn Pipelines

5. An Improved Naive Bayes Model

6. Conclusions


Natural language processing or NLP is an hot topic in data science and machine learning. While research in NLP dates back to the 1950's, the real revolution in this domain came in 1980's and 1990's with the introduction of statistical models and fast computational power. Before this most language processing tasks made use of hand-coded rules which were generally not very robust.

The span of topics in Natural Language Processing is immense and I'll just getting to the tip of the iceberg with going over the topic of document classification. I will be working the Scikit-learn library and using its provided dataset, the 20 News Groups, which is collection of almost 20,000 articles on 20 different topics or 'newsgroups'. We can obtain the training and testing sets directly with the following commands:

In [1]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)

We can then view the total number of articles:

In [2]:
len( + len(

All the documents within the data set belong to the following 20 topics,

In [3]:

We can look at the actual message within the documents,

In [4]:
From: (where's my thing)
Subject: WHAT car is this!?
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

- IL
   ---- brought to you by your neighborhood Lerxst ----

But if we look at the target classes they are actually encoded using a categorical variable,

In [5]:[0]

In order to view the human readible version of the target we create a dictionary to map the categorical variables (numbers) to labels:

In [6]:
codings = list(range(len(twenty_train.target_names)))
target_dic = dict(zip(codings,twenty_train.target_names))

Now we can see the actual document's topic:

In [7]:

One piece of termanology that is used frequently in NLP is the text **corpus**. The text corpus is a large and structured set of texts and can be used to describe what the and datasets are.

Now that we have an idea of what kind of data we are working with we can start to do some machine learning on it. As with all datasets there is some required preprocessing before machine learning. With numerical data there needs to be some cleaning and scaling of the features. However, in Natural Language Processing there is much more substantial preprocessing stage that we'll go over next.

Preprocessing & Feature Extraction

The first thing we need to go over is the concept of the **bag of words model**. In the bag-of-words model, a text (such as a sentence or a document) is represented as "bag" or list of its words, disregarding grammar and even word order, but keeping multiplicity of the words. A two document example is:

  • D1: Hi, I am Mike and I like Boston.

  • D2: Boston is a city and people in Boston like the Red Sox.

From these two documents, a list, or 'bag-of-words' is constructed

bag = ['Hi', 'I', 'am', 'Mike', 'and', 'like', 'Boston', 'is', 
       'a', 'city, 'and', 'people', 'in', 'the', 'red', 'sox]

Notice how in our bag-of-words we have dropped repetitions of the words 'I', 'is' and 'Mike', we will show how multiplicity of words enters into our model next.

The bag-of-words model is mainly used as a tool of feature generation. After transforming the text into a "bag of words", we can calculate various measures to characterize the document. In order to do so we have to generate a vector for each document that represents the number of times each entry in the bag of words appears in the text. The order of entries in the vector corresponds to the order of the entries in the bag-of-words list. For example, document D1 would have a vector,

[1, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0 ,0, 0, 0, 0, 0]

while the second document, D2, would have the vector,

[0, 0, 0, 0, 0, 0, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1]

Each entry of the lists refers to frequency or count of the corresponding entry in the bag-of-words list. When we have a stacked collection of (row) vectors, or matrix, where each row corresponds to a document (vector), and each column corresponds to a word in the bag-of-words list, then this will be known as our term-frequency ($\text{tf}$) document matrix. The general formula for an entry in the $\text{tf}$ matrix is,

$$\text{tf}(d,t) \, = \, f_{t,d}$$

where $f_{t,d}$ is the number of times the term $t$ occurs in document $d \in \mathcal{D}$, where $\mathcal{D}$ is our text corpus. We can create a term-frequency matrix using Scikit-learns CountVectorizer class:

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()

X_train_tf = count_vect.fit_transform(

print("Shape of term-frequency matrix:", X_train_tf.shape)
Shape of term-frequency matrix: (11314, 130107)

The term-frequency is a sparse matrix where each row is a document in our training corpus ($\mathcal{D}$) and each column corresponds to a term/word in the bag-of-words list. This can be confirmed by comparing the number of rows in the term-frequency matrix to the number of documents in the training set:

In [9]:
print("Number of training documents: ", len(
Number of training documents:  11314

Most often term-frequency alone is not a good measure of the importance of a word/term to a document's topic. Very common words like "the", "a", "to" are almost always the terms with the highest frequency in the text. Thus, having a high raw count of the number of times a term appears in a document does not necessarily mean that the corresponding word is more important. Furtermore, longer documents could have high frequency of terms that do not correlate with the document topic, but instead occur with high numbers solely due to the length of the document.

To circumvent the limination of term-frequency, we often normalize it by the inverse document frequency (idf). This results in the term frequency-inverse document frequency (tf-idf) matrix. The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents in the corpus. We can give a formal defintion of the inverse-document-frequency by letting $\mathcal{D}$ be the corpus or the set of all documents and $N$ is the number of documents in the corpus and $N_{t,D}$ be the number of documents that contain the term $t$ then,

$$idf(t,\mathcal{D}) \, = \, \log\left(\frac{N_{\mathcal{D}}}{1 + N_{t,\mathcal{D}}}\right) \, = \, - \log\left(\frac{1 + N_{t,\mathcal{D}}}{N_{\mathcal{D}}}\right) $$

The reason for the presence of the $1$ is for smoothing. Without it, if the term/word did not appear in any training documents, then its inverse-document-frequency would be $idf(t,\mathcal{D}) = \infty$. However, with the presense of the $1$ it will now have $idf(t,\mathcal{D}) = 0$.

Now we can formally defined the term frequnecy-inverse document frequency as a normalized version of term-frequency,

$$\text{tf-idf}(t,d) \, = \, tf(t,d) \cdot idf(t,\mathcal{D}) $$

Like the term-frequency, the term frequency-inverse document frequency is a sparse matrix, where again, each row is a document in our training corpus ($\mathcal{D}$) and each column corresponds to a term/word in the bag-of-words list. The $\text{tf-idf}$ matrix can be constructed using the sklearn TfidfTransformer class:

In [10]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_tf)
(11314, 130107)

We should note that the $\text{tf-idf}$ matrix is the same shape as the $\text{tf}$ matri, but the two have different values.

Now that we have built our $\text{tf-idf}$ matrix we can start to look at the which terms/words are most associated with document topics and then build a predictive model to classify the documents' topic. Let's first go over the model we will be using for prediction.

A Basic Naive Bayes Model

One of the most basic models for text classification is the Naive Bayes model. The Naive Bayes classification model predicts the document topic, $y = \{C_{1},C_{2},\ldots, C_{20}\}$ where $C_{k}$ is the class or topic based on the document feactures $\textbf{x} \in \mathbb{N}^{p}$, and $p$ is the number of terms in our bag-of-words list. The feature vector,

$$\textbf{x} \, = \, \left[ x_{1}, x_{2}, \ldots , x_{p} \right] $$

contains counts $x_{i}$ for the $\text{tf-idf}$ value of the i-th term in our bag-of-words list. Using Bayes Theorem we can develop a model to predict the topic class ($C_{k}$) of a document from its feature vector $\textbf{x}$,

$$P\left(C_{k} \, \vert \, x_{1}, \ldots , x_{p} \right) \; = \; \frac{P\left(x_{1}, \ldots, x_{p} \, \vert \, C_{k} \right)P(C_{k})}{P\left(x_{1}, \ldots, x_{p} \right)}$$

The Naive Bayes model makes the "Naive" assumption the probability of each term's $\text{tf-idf}$ is conditionally independent of every other term. This reduces our conditional probability function to the product,

$$ P\left(x_{1}, \ldots, x_{p} \, \vert \, C_{k} \right) \; = \; \Pi_{i=1}^{p} P\left(x_{i} \, \vert \, C_{k} \right)$$

Subsequently Bayes' theorem for our classification problem becomes,

$$P\left(C_{k} \, \vert \, x_{1}, \ldots , x_{p} \right) \; = \; \frac{ P(C_{k}) \, \Pi_{i=1}^{p} P\left(x_{i} \, \vert \, C_{k} \right)}{P\left(x_{1}, \ldots, x_{p} \right)}$$

Since the denominator is independent of the class ($C_{k}$) we can use a Maxmimum A Posteriori method to estimate the document topic ,

$$ \hat{y} \, = \, \text{arg max}_{k}\; P(C_{k}) \, \Pi_{i=1}^{p} P\left(x_{i} \, \vert \, C_{k} \right)$$

The prior, $P(C_{k}),$ is often taken to be the relative frequency of the class in the training corpus, while the form of the conditional distribution $P\left(x_{i} \, \vert \, C_{k} \right)$ is a choice of the modeler and determines the type of Naive Bayes classifier.

We will use a multinomial Naive Bayes model which works well when our features are discrete variables such as those in our $\text{tf-idf}$ matrix. In the multinomial Naive Bayes model the conditional probability takes the form,

$$ P\left(x_{1}, \ldots, x_{p} \, \vert \, C_{k} \right) \, = \, \frac{\left(\sum_{i=1}^{p} x_{i}\right)!}{\Pi_{i=1}^{p} x_{i}!} \Pi_{i=1}^{p} p_{k,i}^{x_{i}}$$

where $p_{k,i}$ is the probability that the $k$-th class will have the $i$-th bag-of-words term in its feature vector. This leads to our posterior distribution having the functional form,

$$P\left(C_{k} \, \vert \, x_{1}, \ldots , x_{p} \right) \; = \; \frac{ P(C_{k})}{P\left(x_{1}, \ldots, x_{p} \right)} \, \frac{\left(\sum_{i=1}^{p} x_{i}\right)!}{\Pi_{i=1}^{p} x_{i}!} \Pi_{i=1}^{p} p_{k,i}^{x_{i}}$$

We can instantiate a multinomial Naive Bayes classifier using the Scikit-learn library and fit it to our $\text{tf-idf}$ matrix using the commands,

In [14]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

mod = MultinomialNB(),
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

The term alpha=1 means we are using Laplace smoothing. We can now look at the accuracy of our classifier using Scikit-learns accuracy_score function:

In [15]:
X_test_tf = count_vect.transform(
X_test_tfidf = tfidf_transformer.transform(X_test_tf)

predicted = mod.predict(X_test_tfidf)
print("Accuracy:", accuracy_score(, predicted))
Accuracy: 0.77389803505

Since the topics within the training corpus are pretty well balanced the accuracy is pretty representative of the performance of our model. However, we can get more a detailed view of the performance of our classifier by using the Scikit-learn library's classification report function,

In [16]:
from sklearn.metrics import classification_report
                          precision    recall  f1-score   support

             alt.atheism       0.80      0.52      0.63       319
        0.81      0.65      0.72       389       0.82      0.65      0.73       394       0.67      0.78      0.72       392
   comp.sys.mac.hardware       0.86      0.77      0.81       385
       0.89      0.75      0.82       395
         0.93      0.69      0.80       390
            0.85      0.92      0.88       396       0.94      0.93      0.93       398       0.92      0.90      0.91       397       0.89      0.97      0.93       399
               sci.crypt       0.59      0.97      0.74       396
         sci.electronics       0.84      0.60      0.70       393
              0.92      0.74      0.82       396
            0.84      0.89      0.87       394
  soc.religion.christian       0.44      0.98      0.61       398
      talk.politics.guns       0.64      0.94      0.76       364
   talk.politics.mideast       0.93      0.91      0.92       376
      talk.politics.misc       0.96      0.42      0.58       310
      talk.religion.misc       0.97      0.14      0.24       251

             avg / total       0.82      0.77      0.77      7532

We can see that the average precision is around 82%, but the average recall and f1-score are 77%. We can try to improve this using some more advanced preprocessing techniques. This will be made much easier by using the concept of Scikit-learn's pipeline utility.

Scikit-learn Pipelines

In Scitkit-learn pipelines are a sequence of transforms followed by a final estimator. Intermediate steps within the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The CountVectorizer and TfidfTransformer are used as transformers in our above example. The final estimator of a pipeline only needs to implement the fit method. We can see the simplicity of pipelines by using it to re-implement our above analysis using the Naive Bayes model:

In [24]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([('vect', CountVectorizer()),
                 ('tfidf', TfidfTransformer()),
                 ('model', MultinomialNB()),])

mod =,

predicted = mod.predict(
print("Accuracy:", accuracy_score(, predicted))
Accuracy: 0.77389803505

Notice how when using pipelines we passed and into the fit and predict methods instead of the X_train_tfidf and X_test_tfidf objects as we did in the previous section. The transformations occur under-the-hood using the Scikit-learn pipeline functionality.

An Improved Naive Bayes Model

We can look to improve our model by removing stop words which are common words in the english language and do not add any information into the text. These includes words such as, "the", "at", "is", etc. We can remove them in the CountVectorizer constructor call,

In [18]:
pipe = Pipeline([('vect', CountVectorizer(stop_words='english')),
                  ('tfidf', TfidfTransformer()),
                  ('model', MultinomialNB()),])

mod =,

predicted = mod.predict(


print("Accuracy:", accuracy_score(, predicted))
                          precision    recall  f1-score   support

             alt.atheism       0.80      0.69      0.74       319
        0.78      0.72      0.75       389       0.79      0.72      0.75       394       0.68      0.81      0.74       392
   comp.sys.mac.hardware       0.86      0.81      0.84       385
       0.87      0.78      0.82       395
         0.87      0.80      0.83       390
            0.88      0.91      0.90       396       0.93      0.96      0.95       398       0.91      0.92      0.92       397       0.88      0.98      0.93       399
               sci.crypt       0.75      0.96      0.84       396
         sci.electronics       0.84      0.65      0.74       393
              0.92      0.79      0.85       396
            0.82      0.94      0.88       394
  soc.religion.christian       0.62      0.96      0.76       398
      talk.politics.guns       0.66      0.95      0.78       364
   talk.politics.mideast       0.95      0.94      0.94       376
      talk.politics.misc       0.94      0.52      0.67       310
      talk.religion.misc       0.95      0.24      0.38       251

             avg / total       0.83      0.82      0.81      7532

Accuracy: 0.816914498141

We can see that removing stop words gives us a decent improvement in our recall and f1-score!

Not only do pipelines allow us to swap out our model much easier, (say replace our Naive Bayes classifier with a support vector machine) but they also allow us to assemble several steps that can be cross-validated together while setting different parameters. To do this, pieplines enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’.

In [19]:
from sklearn.model_selection import GridSearchCV
parameters = {'tfidf__use_idf': (True, False),
              'model__alpha': (1e1, 1e-3),
              'model__fit_prior': (True,False)}

Notice how when we wish to change the model parameter alpha (which is a smoothing regularizer) by including on "model" and not the mod object. We can perform the grid search in parallel by setting n_jobs=-1:

In [20]:
grid_search = GridSearchCV(pipe, parameters, n_jobs=-1)

and then fit the data to perform the actual grid search,

In [21]:
gs_model =,

Our model gs_model uses the paremeters which had the best cross-validated score. We can see a full description of the grid search,

In [22]:
{'mean_fit_time': array([ 8.15868568,  7.01430893,  6.70327282,  6.73731907,  7.27285361,
         6.670681  ,  7.26607998,  8.0114274 ]),
 'mean_score_time': array([ 5.09360393,  3.90251939,  3.3038861 ,  2.89747898,  2.99467278,
         3.0746096 ,  3.55534299,  3.57902328]),
 'mean_test_score': array([ 0.81014672,  0.74509457,  0.84567792,  0.7722291 ,  0.89826763,
         0.90374757,  0.89800247,  0.90321725]),
 'mean_train_score': array([ 0.8885449 ,  0.81337281,  0.92668332,  0.84505932,  0.99898358,
         0.99818811,  0.99898358,  0.9982323 ]),
 'param_model__alpha': masked_array(data = [10.0 10.0 10.0 10.0 0.001 0.001 0.001 0.001],
              mask = [False False False False False False False False],
        fill_value = ?),
 'param_model__fit_prior': masked_array(data = [True True False False True True False False],
              mask = [False False False False False False False False],
        fill_value = ?),
 'param_tfidf__use_idf': masked_array(data = [True False True False True False True False],
              mask = [False False False False False False False False],
        fill_value = ?),
 'params': [{'model__alpha': 10.0,
   'model__fit_prior': True,
   'tfidf__use_idf': True},
  {'model__alpha': 10.0, 'model__fit_prior': True, 'tfidf__use_idf': False},
  {'model__alpha': 10.0, 'model__fit_prior': False, 'tfidf__use_idf': True},
  {'model__alpha': 10.0, 'model__fit_prior': False, 'tfidf__use_idf': False},
  {'model__alpha': 0.001, 'model__fit_prior': True, 'tfidf__use_idf': True},
  {'model__alpha': 0.001, 'model__fit_prior': True, 'tfidf__use_idf': False},
  {'model__alpha': 0.001, 'model__fit_prior': False, 'tfidf__use_idf': True},
  {'model__alpha': 0.001, 'model__fit_prior': False, 'tfidf__use_idf': False}],
 'rank_test_score': array([6, 8, 5, 7, 3, 1, 4, 2], dtype=int32),
 'split0_test_score': array([ 0.80953642,  0.74410596,  0.84211921,  0.77271523,  0.90092715,
         0.90596026,  0.90119205,  0.90516556]),
 'split0_train_score': array([ 0.88751824,  0.81177875,  0.92426051,  0.84454172,  0.99907149,
         0.99827563,  0.99907149,  0.99814299]),
 'split1_test_score': array([ 0.81129075,  0.74688577,  0.8470713 ,  0.77232971,  0.89398357,
         0.90087464,  0.89424861,  0.90087464]),
 'split1_train_score': array([ 0.88900676,  0.8154091 ,  0.92839146,  0.84604164,  0.99893913,
         0.99814348,  0.99893913,  0.9984087 ]),
 'split2_test_score': array([ 0.80961232,  0.74429102,  0.84784918,  0.771641  ,  0.89989379,
         0.90440786,  0.89856612,  0.90361126]),
 'split2_train_score': array([ 0.8891097 ,  0.81293058,  0.92739799,  0.84459459,  0.99894012,
         0.9981452 ,  0.99894012,  0.9981452 ]),
 'std_fit_time': array([ 0.12689986,  0.82615212,  0.17525967,  0.1984766 ,  0.14209814,
         0.19749698,  0.51194187,  0.33546722]),
 'std_score_time': array([ 0.10623418,  0.81094472,  0.11457724,  0.08389665,  0.13548952,
         0.26187784,  0.26293552,  0.16763378]),
 'std_test_score': array([ 0.00080981,  0.00126924,  0.00253816,  0.00044424,  0.00305951,
         0.00212866,  0.00286346,  0.00177435]),
 'std_train_score': array([  7.27174443e-04,   1.51471226e-03,   1.76053862e-03,
          6.94942174e-04,   6.21655197e-05,   6.18947902e-05,
          6.21655197e-05,   1.24737951e-04])}

We can also see how our model performs on the test set,

In [23]:
predicted = gs_model.predict(

print("Accuracy: ", accuracy_score(predicted,
                          precision    recall  f1-score   support

             alt.atheism       0.86      0.81      0.83       319
        0.66      0.74      0.70       389       0.72      0.63      0.67       394       0.65      0.72      0.68       392
   comp.sys.mac.hardware       0.83      0.82      0.82       385
       0.83      0.76      0.80       395
         0.80      0.82      0.81       390
            0.89      0.89      0.89       396       0.93      0.96      0.95       398       0.96      0.93      0.94       397       0.94      0.97      0.96       399
               sci.crypt       0.89      0.94      0.91       396
         sci.electronics       0.79      0.74      0.76       393
              0.90      0.83      0.86       396
            0.86      0.91      0.89       394
  soc.religion.christian       0.86      0.94      0.90       398
      talk.politics.guns       0.77      0.90      0.83       364
   talk.politics.mideast       0.97      0.93      0.95       376
      talk.politics.misc       0.76      0.65      0.70       310
      talk.religion.misc       0.72      0.64      0.68       251

             avg / total       0.83      0.83      0.83      7532

Accuracy:  0.831651619756

A small improvement, but an improvement none-the-less!

The Naive Bayes classifier can be fast compared to more sophisticated methods due to the decoupling of the class conditional feature distributions, i.e.

$$ P\left(x_{1}, \ldots, x_{p} \, \vert \, C_{k} \right) \; = \; \Pi_{i=1}^{p} P\left(x_{i} \, \vert \, C_{k} \right)$$

The decoupling of the class conditional distributions allows for each distribution to be independently estimated as a one dimensional distribution and helps to alleviate problems with the curse of dimensionality.


In this blog post we covered document classification using Scikit-learn and the 20 News Groups dataset. We went over the basics of term frequency-inverse document frequency, pipelines and the Naive Bayes classifier. While we were able to improve the accuracy of the basic Naive Bayes classifier from 77% to 83%, there are much more sophisticated models like a support vector machines which can achieve higher performance.

One thing that we did not address was the topic of stemming and lemmatisation, which both have to do with reducing a word down to its base form and when used can improve the performance of text classification models. Lemmatisation differs from stemming because it depends on identifying the intended part of speech and meaning of a word in a sentence. Stemmers and lemmatizers are both provided by Natural Language Tool Kit or NLTK and spaCy libraries, but that will have to wait for another day!