**2. Preprocessing & Feature Extraction**

**5. An Improved Naive Bayes Model**

Natural language processing or NLP is an hot topic in data science and machine learning. While research in NLP dates back to the 1950's, the real revolution in this domain came in 1980's and 1990's with the introduction of statistical models and fast computational power. Before this most language processing tasks made use of hand-coded rules which were generally not very robust.

The span of topics in Natural Language Processing is immense and I'll just getting to the tip of the iceberg with going over the topic of document classification. I will be working the Scikit-learn library and using its provided dataset, the 20 News Groups, which is collection of almost 20,000 articles on 20 different topics or 'newsgroups'. We can obtain the training and testing sets directly with the following commands:

In [1]:

```
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
```

We can then view the total number of articles:

In [2]:

```
len(twenty_train.data) + len(twenty_test.data)
```

Out[2]:

All the documents within the data set belong to the following 20 topics,

In [3]:

```
twenty_train.target_names
```

Out[3]:

We can look at the actual message within the documents,

In [4]:

```
print("\n".join(twenty_train.data[0].split("\n")))
```

But if we look at the target classes they are actually encoded using a categorical variable,

In [5]:

```
twenty_train.target[0]
```

Out[5]:

In [6]:

```
codings = list(range(len(twenty_train.target_names)))
target_dic = dict(zip(codings,twenty_train.target_names))
```

Now we can see the actual document's topic:

In [7]:

```
target_dic[twenty_train.target[0]]
```

Out[7]:

One piece of termanology that is used frequently in NLP is the text **corpus**. The text corpus is a large and structured set of texts and can be used to describe what the `twent_train.data`

and `twenty_test.data`

datasets are.

Now that we have an idea of what kind of data we are working with we can start to do some machine learning on it. As with all datasets there is some required preprocessing before machine learning. With numerical data there needs to be some cleaning and scaling of the features. However, in Natural Language Processing there is much more substantial preprocessing stage that we'll go over next.

The first thing we need to go over is the concept of the **bag of words model**. *In the bag-of-words model, a text (such as a sentence or a document) is represented as "bag" or list of its words, disregarding grammar and even word order, but keeping multiplicity of the words.* A two document example is:

**D1:**Hi, I am Mike and I like Boston.**D2:**Boston is a city and people in Boston like the Red Sox.

From these two documents, a list, or 'bag-of-words' is constructed

```
bag = ['Hi', 'I', 'am', 'Mike', 'and', 'like', 'Boston', 'is',
'a', 'city, 'and', 'people', 'in', 'the', 'red', 'sox]
```

Notice how in our bag-of-words we have dropped repetitions of the words 'I', 'is' and 'Mike', we will show how multiplicity of words enters into our model next.

The bag-of-words model is mainly used as a tool of feature generation. After transforming the text into a "bag of words", we can calculate various measures to characterize the document. In order to do so we have to generate a vector for each document that represents the number of times each entry in the bag of words appears in the text. The order of entries in the vector corresponds to the order of the entries in the bag-of-words list. For example, document D1 would have a vector,

```
[1, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0 ,0, 0, 0, 0, 0]
```

while the second document, D2, would have the vector,

```
[0, 0, 0, 0, 0, 0, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1]
```

Each entry of the lists refers to frequency or count of the corresponding entry in the bag-of-words list. When we have a stacked collection of (row) vectors, or matrix, where each row corresponds to a document (vector), and each column corresponds to a word in the bag-of-words list, then this will be known as our **term-frequency ($\text{tf}$) document matrix**. The general formula for an entry in the $\text{tf}$ matrix is,

where $f_{t,d}$ is the number of times the term $t$ occurs in document $d \in \mathcal{D}$, where $\mathcal{D}$ is our text corpus. We can create a term-frequency matrix using Scikit-learns CountVectorizer class:

In [8]:

```
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_tf = count_vect.fit_transform(twenty_train.data)
print("Shape of term-frequency matrix:", X_train_tf.shape)
```

In [9]:

```
print("Number of training documents: ", len(twenty_train.data))
```

Most often term-frequency alone is not a good measure of the importance of a word/term to a document's topic. Very common words like "the", "a", "to" are almost always the terms with the highest frequency in the text. Thus, having a high raw count of the number of times a term appears in a document does not necessarily mean that the corresponding word is more important. Furtermore, longer documents could have high frequency of terms that do not correlate with the document topic, but instead occur with high numbers solely due to the length of the document.

To circumvent the limination of term-frequency, we often normalize it by the **inverse document frequency (idf)**. This results in the **term frequency-inverse document frequency (tf-idf)** matrix. The *inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents in the corpus*. We can give a formal defintion of the inverse-document-frequency by letting $\mathcal{D}$ be the corpus or the set of all documents and $N$ is the number of documents in the corpus and $N_{t,D}$ be the number of documents that contain the term $t$ then,

The reason for the presence of the $1$ is for smoothing. Without it, if the term/word did not appear in any training documents, then its inverse-document-frequency would be $idf(t,\mathcal{D}) = \infty$. However, with the presense of the $1$ it will now have $idf(t,\mathcal{D}) = 0$.

Now we can formally defined the term frequnecy-inverse document frequency as a normalized version of term-frequency,

$$\text{tf-idf}(t,d) \, = \, tf(t,d) \cdot idf(t,\mathcal{D}) $$Like the term-frequency, the term frequency-inverse document frequency is a sparse matrix, where again, each row is a document in our training corpus ($\mathcal{D}$) and each column corresponds to a term/word in the bag-of-words list. The $\text{tf-idf}$ matrix can be constructed using the sklearn TfidfTransformer class:

In [10]:

```
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_tf)
X_train_tfidf.shape
```

Out[10]:

We should note that the $\text{tf-idf}$ matrix is the same shape as the $\text{tf}$ matri, but the two have different values.

Now that we have built our $\text{tf-idf}$ matrix we can start to look at the which terms/words are most associated with document topics and then build a predictive model to classify the documents' topic. Let's first go over the model we will be using for prediction.

One of the most basic models for text classification is the Naive Bayes model. The Naive Bayes classification model predicts the document topic, $y = \{C_{1},C_{2},\ldots, C_{20}\}$ where $C_{k}$ is the class or topic based on the document feactures $\textbf{x} \in \mathbb{N}^{p}$, and $p$ is the number of terms in our bag-of-words list. The feature vector,

$$\textbf{x} \, = \, \left[ x_{1}, x_{2}, \ldots , x_{p} \right] $$contains counts $x_{i}$ for the $\text{tf-idf}$ value of the i-th term in our bag-of-words list. Using Bayes Theorem we can develop a model to predict the topic class ($C_{k}$) of a document from its feature vector $\textbf{x}$,

$$P\left(C_{k} \, \vert \, x_{1}, \ldots , x_{p} \right) \; = \; \frac{P\left(x_{1}, \ldots, x_{p} \, \vert \, C_{k} \right)P(C_{k})}{P\left(x_{1}, \ldots, x_{p} \right)}$$The Naive Bayes model makes the "Naive" assumption the probability of each term's $\text{tf-idf}$ is **conditionally independent** of every other term. This reduces our **conditional probability function** to the product,

Subsequently Bayes' theorem for our classification problem becomes,

$$P\left(C_{k} \, \vert \, x_{1}, \ldots , x_{p} \right) \; = \; \frac{ P(C_{k}) \, \Pi_{i=1}^{p} P\left(x_{i} \, \vert \, C_{k} \right)}{P\left(x_{1}, \ldots, x_{p} \right)}$$Since the denominator is independent of the class ($C_{k}$) we can use a Maxmimum A Posteriori method to estimate the document topic ,

$$ \hat{y} \, = \, \text{arg max}_{k}\; P(C_{k}) \, \Pi_{i=1}^{p} P\left(x_{i} \, \vert \, C_{k} \right)$$The **prior**, $P(C_{k}),$ is often taken to be the relative frequency of the class in the training corpus, while the form of the conditional distribution $P\left(x_{i} \, \vert \, C_{k} \right)$ is a choice of the modeler and determines the type of Naive Bayes classifier.

We will use a multinomial Naive Bayes model which works well when our features are discrete variables such as those in our $\text{tf-idf}$ matrix. In the multinomial Naive Bayes model the conditional probability takes the form,

$$ P\left(x_{1}, \ldots, x_{p} \, \vert \, C_{k} \right) \, = \, \frac{\left(\sum_{i=1}^{p} x_{i}\right)!}{\Pi_{i=1}^{p} x_{i}!} \Pi_{i=1}^{p} p_{k,i}^{x_{i}}$$where $p_{k,i}$ is the probability that the $k$-th class will have the $i$-th bag-of-words term in its feature vector. This leads to our **posterior distribution** having the functional form,

We can instantiate a multinomial Naive Bayes classifier using the Scikit-learn library and fit it to our $\text{tf-idf}$ matrix using the commands,

In [14]:

```
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
mod = MultinomialNB()
mod.fit(X_train_tfidf, twenty_train.target)
```

Out[14]:

`alpha=1`

means we are using Laplace smoothing. We can now look at the accuracy of our classifier using Scikit-learns accuracy_score function:

In [15]:

```
X_test_tf = count_vect.transform(twenty_test.data)
X_test_tfidf = tfidf_transformer.transform(X_test_tf)
predicted = mod.predict(X_test_tfidf)
print("Accuracy:", accuracy_score(twenty_test.target, predicted))
```

In [16]:

```
from sklearn.metrics import classification_report
print(classification_report(twenty_test.target,
predicted,
target_names=twenty_test.target_names))
```

In Scitkit-learn pipelines are a sequence of transforms followed by a final estimator. Intermediate steps within the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The `CountVectorizer`

and `TfidfTransformer`

are used as transformers in our above example. The final estimator of a pipeline only needs to implement the fit method. We can see the simplicity of pipelines by using it to re-implement our above analysis using the Naive Bayes model:

In [24]:

```
from sklearn.pipeline import Pipeline
pipe = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('model', MultinomialNB()),])
mod = pipe.fit(twenty_train.data, twenty_train.target)
predicted = mod.predict(twenty_test.data)
print("Accuracy:", accuracy_score(twenty_test.target, predicted))
```

Notice how when using pipelines we passed `twenty_train.data`

and `twenty_test.data`

into the `fit`

and predict methods instead of the `X_train_tfidf`

and `X_test_tfidf`

objects as we did in the previous section. The transformations occur under-the-hood using the Scikit-learn pipeline functionality.

We can look to improve our model by removing stop words which are common words in the english language and do not add any information into the text. These includes words such as, "the", "at", "is", etc. We can remove them in the `CountVectorizer`

constructor call,

In [18]:

```
pipe = Pipeline([('vect', CountVectorizer(stop_words='english')),
('tfidf', TfidfTransformer()),
('model', MultinomialNB()),])
mod = pipe.fit(twenty_train.data, twenty_train.target)
predicted = mod.predict(twenty_test.data)
print(classification_report(twenty_test.target,
predicted,
target_names=twenty_test.target_names))
print("Accuracy:", accuracy_score(twenty_test.target, predicted))
```

We can see that removing stop words gives us a decent improvement in our recall and f1-score!

Not only do pipelines allow us to swap out our model much easier, (say replace our Naive Bayes classifier with a support vector machine) but they also allow us to assemble several steps that can be cross-validated together while setting different parameters. To do this, pieplines enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’.

In [19]:

```
from sklearn.model_selection import GridSearchCV
parameters = {'tfidf__use_idf': (True, False),
'model__alpha': (1e1, 1e-3),
'model__fit_prior': (True,False)}
```

`alpha`

(which is a smoothing regularizer) by including on "model" and not the `mod`

object. We can perform the grid search in parallel by setting `n_jobs=-1`

:

In [20]:

```
grid_search = GridSearchCV(pipe, parameters, n_jobs=-1)
```

and then fit the data to perform the actual grid search,

In [21]:

```
gs_model = grid_search.fit(twenty_train.data, twenty_train.target)
```

`gs_model`

uses the paremeters which had the best cross-validated score. We can see a full description of the grid search,

In [22]:

```
gs_model.cv_results_
```

Out[22]:

We can also see how our model performs on the test set,

In [23]:

```
predicted = gs_model.predict(twenty_test.data)
print(classification_report(twenty_test.target,
predicted,
target_names=twenty_test.target_names))
print("Accuracy: ", accuracy_score(predicted,twenty_test.target))
```

A small improvement, but an improvement none-the-less!

The Naive Bayes classifier can be fast compared to more sophisticated methods due to the decoupling of the class conditional feature distributions, i.e.

$$ P\left(x_{1}, \ldots, x_{p} \, \vert \, C_{k} \right) \; = \; \Pi_{i=1}^{p} P\left(x_{i} \, \vert \, C_{k} \right)$$The decoupling of the class conditional distributions allows for each distribution to be independently estimated as a one dimensional distribution and helps to alleviate problems with the curse of dimensionality.

In this blog post we covered document classification using Scikit-learn and the 20 News Groups dataset. We went over the basics of term frequency-inverse document frequency, pipelines and the Naive Bayes classifier. While we were able to improve the accuracy of the basic Naive Bayes classifier from 77% to 83%, there are much more sophisticated models like a support vector machines which can achieve higher performance.

One thing that we did not address was the topic of stemming and lemmatisation, which both have to do with reducing a word down to its base form and when used can improve the performance of text classification models. Lemmatisation differs from stemming because it depends on identifying the intended part of speech and meaning of a word in a sentence. Stemmers and lemmatizers are both provided by Natural Language Tool Kit or NLTK and spaCy libraries, but that will have to wait for another day!