2. NLTK: Stop Words, Stemming & Lemmatization
3. Hyperparameter Tunning With GridSearchCV
In the last blogpost we covered text classification using Scikit-learn and Imbalance-Learn on summaries of papers from arxiv. We went over the basics of term frequency-inverse document frequency, Naive Bayes and Support Vector Machines. We additionally discussed techniques for handling imbalanced data both the data level and the algorithm level. In this post we'll pick up where we left off and cover uses of the Natural Language Toolkit (NLTK) and hyperparameter tunning. Specifically we will discuss stop words, stemming and lemmatization on the previously mentioned classifiers.
First thing we need to do is connect to our MongoDB database:
import pymongo
conn = pymongo.MongoClient('mongodb://mongodb:27017')
db = conn.db_arxiv
Then get the data in the Pandas dataframe format again:
import pandas as pd
# projection for subselecting only `text` and `category` fields
project = {"_id":0,"text":1,"category":1}
# get the training set
train_df = pd.DataFrame(db.train_cs_papers.find({},project))
# get the testing set
test_df = pd.DataFrame(db.test_cs_papers.find({},project))
Let's relabel our target variable. We create the mapping between the target and text category as well as the y_test
vector for one vs rest classification to make the ROC and precission/recall curve:
from sklearn.preprocessing import LabelEncoder, label_binarize
import numpy as np
labeler = LabelEncoder()
train_df["target"] = labeler.fit_transform(train_df["category"])
test_df["target"] = labeler.transform(test_df["category"])
# get the number of classes
n_classes = len(train_df["target"].unique())
# classes = [0,1,2,3]
classes = np.sort(train_df["target"].unique())
# relabel the test set
y_test = label_binarize(test_df["target"],
classes=classes)
mapping = dict(zip(labeler.classes_, range(len(labeler.classes_))))
print(mapping)
Let's remind ourselves of where we left off with the modeling by looking at the weighted Support Vector Classifier we left off with:
from sklearn.feature_extraction.text import (CountVectorizer,
TfidfTransformer)
from sklearn.metrics import balanced_accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
svm_pipe = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('model', LinearSVC(class_weight='balanced',
random_state=50))])
We use our model evaluator function and partial so that we only have to feed in the different pipeline each time we want to call it:
from utils.evaluator import evaluate_model
from functools import partial
evaluate_pipeline = partial(evaluate_model,
train_df,
test_df,
mapping)
evaluate_pipeline(svm_pipe)
The ROC and precision/recall curves for this model are,
from utils.Plot_ROC_PR_Curve import plot_roc_pr
y_pred = svm_pipe.decision_function(test_df["text"])
plot_roc_pr(y_pred = y_pred, y_test = y_test)
Now lets improve our models using the Natural Language Toolkit!
We can look to improve our model by removing stop words, which are common words in the english language and do not add any information into the text. These includes words such as, "the", "at", "is", etc. Let's look at an example using the Natural Language Toolkit (NLTK). First we get an example document that we can show the effect of what removing stop words from a document does.
# example document
doc = train_df["text"][242]
print(doc)
We import the nltk package and download the data required for stopwords.
import nltk
for package in ['stopwords','punkt','wordnet']:
nltk.download(package)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
Now we collect the stop words as a set called stop_words
. To see the impact of removing stop words we tokenize the example document above, filter it for stop words, and use the join method to make it a string again:
# collect the stopwords
stop_words = set(stopwords.words('english'))
# tokenize the words
tokens = word_tokenize(doc.replace("\n", " "))
# remove stop words from each line/list
import re
pattern = re.compile('[\W_]+',re.UNICODE)
filtered_tokens = filter(lambda x : len(x) > 1, (pattern.sub("",token).lower()
for token in tokens if token not in stop_words))
print(" ".join((filtered_tokens)))
You may we removed stop words and punctuation as well as converting the characters to lowercase.
The CountVectorizer class also has the ability to remove stop words by declaring to remove them in the constructor. We could use this approach, but instead lets' create our own tokenizer that removes stop words so that we can add stopwords outside of those predefined by Scikit-Learn if needed:
class StopWordTokenizer(object):
"""
StopWordsTokenizer tokenizes words and removes stopwords that are
passed in through the the constructor.
"""
def __init__(self, stop_words):
import re
self.stop_words = stop_words
self.pattern = re.compile('[\W_]+',re.UNICODE)
def __call__(self, doc):
unfiltered_tokens = (self.pattern.sub("",token) for token in word_tokenize(doc.replace("\n", " "))
if token not in stop_words)
return list(filter(lambda x : len(x) > 1, unfiltered_tokens))
Let's now see the impact this has on our SVC model:
svm_pipe2 = Pipeline([('vect', CountVectorizer(tokenizer=StopWordTokenizer(stop_words))),
('tfidf', TfidfTransformer()),
('model', LinearSVC(class_weight='balanced',
random_state=50))])
evaluate_pipeline(svm_pipe2)
An improvement in the precision of Robotics, but an over slight decline in the balanced accuracy.
Let's visualize the TF-IDF matrix and the most token/words as we did in the first post:
from utils.feature_plots import plot_tfidf
plot_tfidf(pipe = svm_pipe2,
labeler = labeler,
X = train_df["text"],
y = train_df["target"],
vect = "vect",
tfidf = "tfidf",
top_n = 25)