Text Classification 2: Natural Language Toolkit

Table of Contents

1. Where We Left Off

2. NLTK: Stop Words, Stemming & Lemmatization

3. Hyperparameter Tunning With GridSearchCV

4. Conclusion

Where We Left Off

In the last blogpost we covered text classification using Scikit-learn and Imbalance-Learn on summaries of papers from arxiv. We went over the basics of term frequency-inverse document frequency, Naive Bayes and Support Vector Machines. We additionally discussed techniques for handling imbalanced data both the data level and the algorithm level. In this post we'll pick up where we left off and cover uses of the Natural Language Toolkit (NLTK) and hyperparameter tunning. Specifically we will discuss stop words, stemming and lemmatization on the previously mentioned classifiers.

First thing we need to do is connect to our MongoDB database:

In [1]:
import pymongo
conn = pymongo.MongoClient('mongodb://mongodb:27017')
db   = conn.db_arxiv

Then get the data in the Pandas dataframe format again:

In [2]:
import pandas as pd

# projection for subselecting only `text` and `category` fields
project = {"_id":0,"text":1,"category":1}

# get the training set
train_df = pd.DataFrame(db.train_cs_papers.find({},project))

# get the testing set
test_df = pd.DataFrame(db.test_cs_papers.find({},project))

Let's relabel our target variable. We create the mapping between the target and text category as well as the y_test vector for one vs rest classification to make the ROC and precission/recall curve:

In [3]:
from sklearn.preprocessing import LabelEncoder, label_binarize
import numpy as np

labeler            = LabelEncoder()
train_df["target"] = labeler.fit_transform(train_df["category"])
test_df["target"]  = labeler.transform(test_df["category"])

# get the number of classes
n_classes = len(train_df["target"].unique())

# classes = [0,1,2,3]
classes   = np.sort(train_df["target"].unique())

# relabel the test set
y_test = label_binarize(test_df["target"], 

mapping = dict(zip(labeler.classes_, range(len(labeler.classes_))))
{'ai': 0, 'cv': 1, 'ml': 2, 'ro': 3}

Let's remind ourselves of where we left off with the modeling by looking at the weighted Support Vector Classifier we left off with:

In [4]:
from sklearn.feature_extraction.text import (CountVectorizer,

from sklearn.metrics        import balanced_accuracy_score
from sklearn.pipeline       import Pipeline 

from sklearn.svm import LinearSVC

svm_pipe = Pipeline([('vect',    CountVectorizer()),
                     ('tfidf',   TfidfTransformer()),
                     ('model',   LinearSVC(class_weight='balanced',

We use our model evaluator function and partial so that we only have to feed in the different pipeline each time we want to call it:

In [5]:
from utils.evaluator import evaluate_model
from functools import partial

evaluate_pipeline = partial(evaluate_model,

              precision    recall  f1-score   support

          ai       0.88      0.91      0.89       500
          cv       0.94      0.92      0.93       500
          ml       0.87      0.89      0.88       500
          ro       0.89      0.75      0.81        75

    accuracy                           0.90      1575
   macro avg       0.89      0.86      0.88      1575
weighted avg       0.90      0.90      0.90      1575

balanced_accuracy 0.8636666666666667

The ROC and precision/recall curves for this model are,

In [6]:
from utils.Plot_ROC_PR_Curve import plot_roc_pr

y_pred = svm_pipe.decision_function(test_df["text"])

plot_roc_pr(y_pred = y_pred, y_test = y_test)

Now lets improve our models using the Natural Language Toolkit!

NLTK: Stop Words, Stemming, & Lemmatization

Stop Words

We can look to improve our model by removing stop words, which are common words in the english language and do not add any information into the text. These includes words such as, "the", "at", "is", etc. Let's look at an example using the Natural Language Toolkit (NLTK). First we get an example document that we can show the effect of what removing stop words from a document does.

In [6]:
# example document
doc = train_df["text"][242]
E-RES is a system that implements the Language E, a logic for reasoning about
narratives of action occurrences and observations. E's semantics is
model-theoretic, but this implementation is based on a sound and complete
reformulation of E in terms of argumentation, and uses general computational
techniques of argumentation frameworks. The system derives sceptical
non-monotonic consequences of a given reformulated theory which exactly
correspond to consequences entailed by E's model-theory. The computation relies
on a complimentary ability of the system to derive credulous non-monotonic
consequences together with a set of supporting assumptions which is sufficient
for the (credulous) conclusion to hold. E-RES allows theories to contain
general action laws, statements about action occurrences, observations and
statements of ramifications (or universal laws). It is able to derive
consequences both forward and backward in time. This paper gives a short
overview of the theoretical basis of E-RES and illustrates its use on a variety
of examples. Currently, E-RES is being extended so that the system can be used
for planning.

We import the nltk package and download the data required for stopwords.

In [7]:
import nltk

for package in ['stopwords','punkt','wordnet']:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Unzipping tokenizers/
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/

Now we collect the stop words as a set called stop_words. To see the impact of removing stop words we tokenize the example document above, filter it for stop words, and use the join method to make it a string again:

In [8]:
# collect the stopwords
stop_words    = set(stopwords.words('english')) 

# tokenize the words 
tokens  = word_tokenize(doc.replace("\n", " "))

# remove stop words from each line/list
import re 
pattern          = re.compile('[\W_]+',re.UNICODE)
filtered_tokens  = filter(lambda x : len(x) > 1, (pattern.sub("",token).lower() 
                                                  for token in tokens if token not in stop_words))

print(" ".join((filtered_tokens)))
eres system implements language logic reasoning narratives action occurrences observations semantics modeltheoretic implementation based sound complete reformulation terms argumentation uses general computational techniques argumentation frameworks the system derives sceptical nonmonotonic consequences given reformulated theory exactly correspond consequences entailed modeltheory the computation relies complimentary ability system derive credulous nonmonotonic consequences together set supporting assumptions sufficient credulous conclusion hold eres allows theories contain general action laws statements action occurrences observations statements ramifications universal laws it able derive consequences forward backward time this paper gives short overview theoretical basis eres illustrates use variety examples currently eres extended system used planning

You may we removed stop words and punctuation as well as converting the characters to lowercase.

The CountVectorizer class also has the ability to remove stop words by declaring to remove them in the constructor. We could use this approach, but instead lets' create our own tokenizer that removes stop words so that we can add stopwords outside of those predefined by Scikit-Learn if needed:

In [9]:
class StopWordTokenizer(object):
    StopWordsTokenizer tokenizes words and removes stopwords that are 
    passed in through the the constructor.
    def __init__(self, stop_words):
        import re
        self.stop_words = stop_words
        self.pattern    = re.compile('[\W_]+',re.UNICODE)

    def __call__(self, doc):
        unfiltered_tokens = (self.pattern.sub("",token) for token in word_tokenize(doc.replace("\n", " ")) 
                             if token not in stop_words)
        return list(filter(lambda x : len(x) > 1, unfiltered_tokens))

Let's now see the impact this has on our SVC model:

In [10]:
svm_pipe2  = Pipeline([('vect',    CountVectorizer(tokenizer=StopWordTokenizer(stop_words))),
                       ('tfidf',   TfidfTransformer()),
                       ('model',   LinearSVC(class_weight='balanced',

              precision    recall  f1-score   support

          ai       0.88      0.90      0.89       500
          cv       0.94      0.92      0.93       500
          ml       0.86      0.88      0.87       500
          ro       0.90      0.75      0.82        75

    accuracy                           0.89      1575
   macro avg       0.90      0.86      0.88      1575
weighted avg       0.89      0.89      0.89      1575

balanced_accuracy 0.8616666666666667

An improvement in the precision of Robotics, but an over slight decline in the balanced accuracy.

Let's visualize the TF-IDF matrix and the most token/words as we did in the first post:

In [14]:
from utils.feature_plots import plot_tfidf

plot_tfidf(pipe    = svm_pipe2,
           labeler = labeler,
           X       = train_df["text"],
           y       = train_df["target"],
           vect    = "vect",
           tfidf   = "tfidf",
           top_n   = 25)