2. The Dataset: Creating, Storing and Exploring
3. TF-IDF: Preprocessing & Feature Extraction
5. Imablanced Learn: Fixing Imbalanced Data
6. Weighted Support Vector Machines
Natural language processing (NLP) is an hot topic in data science and machine learning. While research in NLP dates back to the 1950's, the real revolution in this domain came in 1980's and 1990's with the introduction of statistical models and fast computing. Before this most language processing tasks made use of hand-coded rules which were generally not very robust.
The span of topics in Natural Language Processing is immense and I'll just getting to the tip of the iceberg with the topic of document classification, also known as text classification. I will be working with the Scikit-learn library and an imbalanced dataset (corpus) that I will create from summaries of papers published on arxiv. The topic of each paper is already labeled as the category therefore alleviating the need for me to label the dataset. The imbalance in the dataset will be caused by the imbalance in the number of samples in each of the categories we are trying to predict. Imbalanced data occurs quite frequently in classification problems and makes developing a good model more challenging. Often times it is too expensive or not possible to get more data on the classes that have too few samples. Developing strategies for dealing with imbalanced data is therefore paramount for creating a good classification model. We will cover some of the basics of dealing with imbalanced data using the Imbalance-Learn library using Naive Bayes and Support Vector classifier from Scikit-learn. In the next post we'll dive more int NLP with the Natural Language Tool Kit (NLTK).
Let's dive into the data!
The first thing I wanted to do was create a database to store the data between working sessions on this project. Storing the dataset means that I wouldn't have to re-download the data each time I run the notebook. I chose to create a MongoDB database using Docker and Docker Compose since its easy to use and I have experience with it and PyMongo.
We can create a client with pymongo and create a new database using the commands:
import pymongo
conn = pymongo.MongoClient('mongodb://mongodb:27017')
db = conn.db_arxiv
import arxiv
We will focus on only four sub topics of Computer Science in this blogpost:
These topics are somewhat related, but different enough that I expect there to be significant differences in the words used in the summaries of the papers. The only exception being machine learning and artifical intelligence are used somewhat interchangably and I expect there to be some overlap in there words. In all I expect pretty good performance for our future model.
Let's query the the arxiv api and create a list of dictionaries for each topic. Each dictionary will correspond to a paper on arxiv. Each dictionary will include,
Let's collect 2000 articles on artificial intelligence, machine learning and computer vision as well as 300 topics on robotics:
# artificial intelligence
ai_results = [{'link' : doc['id'],
'code' : doc['arxiv_primary_category']['term'],
'category' : 'ai',
'text' : doc['summary']}
for doc in arxiv.query(query='cat:cs.AI', max_results=2000)]
# machine learning
ml_results = [{'link' : doc['id'],
'code' : doc['arxiv_primary_category']['term'],
'category' : 'ml',
'text' : doc['summary']}
for doc in arxiv.query(query='cat:cs.LG', max_results=2000)]
# computer visison
cv_results = [{'link' : doc['id'],
'category' : 'cv',
'code' : doc['arxiv_primary_category']['term'],
'text' : doc['summary']}
for doc in arxiv.query(query='cat:cs.CV', max_results=2000)]
# robotics
ro_results = [{'link' : doc['id'],
'category' : 'ro',
'code' : doc['arxiv_primary_category']['term'],
'text' : doc['summary']}
for doc in arxiv.query(query='cat:CS.RO', max_results=300)]
We can look at some the results:
cv_results[89]
We can see that there are some '\n' sprinkled in the summary text; removing these is actually taken care of in the libraries that we use.
The four topics we are trying to predict are also called the target classes or simply the classes. We refer to artifical intelligence, machine learning and computer vision as the "majority classes" and robotics as the "minority class" due to the discrepencies in the number of datapoints. Let's break out data set into a 75% training and 25% testing. We create the training set:
maj_train = 1500
min_train = 225
train = (ai_results[0:maj_train] + ml_results[0:maj_train] +
cv_results[0:maj_train] + ro_results[0:min_train])
And then insert them into the MongoDB databases as the training collection
:
# insert them into training collection
db.train_cs_papers.insert_many(train, ordered=False)
We then do the same for the test set:
test = (ai_results[maj_train-1 :-1] + ml_results[maj_train-1:-1] +
cv_results[maj_train-1:-1] + ro_results[min_train-1:-1])
# insert them into training collection
db.test_cs_papers.insert_many(test, ordered=False)
Now that we have stored the datasets, let's read them back and connvert them to a Pandas dataframe:
import pandas as pd
# projection for subselecting only `text` and `category` fields
project = {"_id":0,"text":1,"category":1}
# get the training set
train_df = pd.DataFrame(db.train_cs_papers.find({},project))
# get the testing set
test_df = pd.DataFrame(db.test_cs_papers.find({},project))
Let's take a look what the dataframes look like:
train_df.head()
We can see the total number of papers we have:
train_df.shape[0] + test_df.shape[0]
In order for our model to predict the topic of each summary we need to convert the category into a number. For that we use the Scikit-Learn LabelEncoder class. We append a new column to the dataframe called target
that will be the numerical value of the class:
from sklearn.preprocessing import LabelEncoder
labeler = LabelEncoder()
train_df["target"] = labeler.fit_transform(train_df["category"])
test_df["target"] = labeler.transform(test_df["category"])
In order to view the human readible version of the target we create a dictionary to map the categorical variables (numbers) to labels:
mapping = dict(zip(labeler.classes_, range(len(labeler.classes_))))
print(mapping)
Let's take a look at the number of samples in the training and test set:
train_cts = train_df.groupby("target").size()
test_cts = test_df.groupby("target").size()
import matplotlib.pyplot as plt
%matplotlib inline
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,5), sharey=True)
train_cts.plot(kind='bar',ax= ax1,rot=0)
test_cts.plot(kind='bar',ax= ax2,rot=0)
ax1.set_title('Train Set')
ax2.set_title('Test Set')
ax1.set_ylabel("Counts")
In terms of percentages they are the exact same percentage of target classes:
percents = 100 * train_df.groupby("target").size() / train_df.shape[0]
percents.plot(kind='bar', title='Target Class Distributions', rot=0)
plt.ylabel("%")
We can see that classes 0, 1, 2 are pretty even, but class 3 is much smaller than the other, the imbalance in the number of samples in the classes is what we mean by imbalanced data. We can look an example article summary and its target value:
print("Category:\n{}\n".format(train_df["category"][5]))
print("Target: {}\n".format(train_df["target"][5]))
print("Text:\n", train_df["text"][5])
Now lets take a look a the most commonly used words in each target class using a word cloud
from wordcloud import WordCloud, STOPWORDS
def plot_wordcloud(df: pd.DataFrame, category: str, target: int) -> None:
words = " ".join(train_df[train_df["target"] == target]["text"].values)
plt.rcParams['figure.figsize'] = 10, 20
wordcloud = WordCloud(stopwords=STOPWORDS,
background_color="white",
max_words=1000).generate(words)
plt.title("WordCloud For {}".format(category))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
for category, target in mapping.items():
plot_wordcloud(train_df, category, target)
We can see that some of the most common words to each topic (i.e. "problem", "model", "algorithm") are commonly occuring across topics. If we run into issues with our model performance we may consider including these common words as stop words.
Now that we have an idea of what kind of data we are working with we can start to do some machine learning on it. As with all datasets there is some required preprocessing before machine learning. With numerical data there needs to be some cleaning and scaling of the features. However, in Natural Language Processing there is much more substantial preprocessing stage that we'll go over next.
The first thing we need to go over is the concept of the **bag of words model**. In the bag-of-words model, a text (such as a sentence or a document) is represented as "bag" or list of its words, disregarding grammar and even word order, but keeping multiplicity of the words. A two document example is:
D1: Hi, I am Mike and I like Boston.
D2: Boston is a city and people in Boston like the Red Sox.
From these two documents, a list, or 'bag-of-words' is constructed
bag = ['Hi', 'I', 'am', 'Mike', 'and', 'like', 'Boston', 'is',
'a', 'city, 'and', 'people', 'in', 'the', 'red', 'sox]
Notice how in our bag-of-words we have dropped repetitions of the words 'I', 'is' and 'Mike', we will show how multiplicity of words enters into our model next.
The bag-of-words model is mainly used as a tool of feature generation. After transforming the text into a "bag of words", we can calculate various measures to characterize the document. In order to do so we have to generate a vector for each document that represents the number of times each entry in the bag of words appears in the text. The order of entries in the vector corresponds to the order of the entries in the bag-of-words list. For example, document D1 would have a vector,
[1, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0 ,0, 0, 0, 0, 0]
while the second document, D2, would have the vector,
[0, 0, 0, 0, 0, 0, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1]
Each entry of the lists refers to frequency or count of the corresponding entry in the bag-of-words list. When we have a stacked collection of (row) vectors, or matrix, where each row corresponds to a document (vector), and each column corresponds to a word in the bag-of-words list, then this will be known as our term-frequency ($\text{tf}$) document matrix. The general formula for an entry in the $\text{tf}$ matrix is,
$$\text{tf}(d,t) \, = \, f_{t,d}$$where $f_{t,d}$ is the number of times the term $t$ occurs in document $d \in \mathcal{D}$, where $\mathcal{D}$ is our text corpus. We can create a term-frequency matrix for the above example using Scikit-learns CountVectorizer class:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = ["Hi, I am Mike and I like Boston.",
"Boston is a city and people in Boston like the Red Sox."]
# get the term frequency
tf = vectorizer.fit_transform(corpus)
# get the words in bag-of-words
print(vectorizer.get_feature_names())
# get the counts!
print(tf.toarray())
The order of the words is off and CountVectorizer dropped single letter words and punctutation, but the rest is the same!
Notice CountVectorizer converts everything to lowercase, drops single letter words and punctuation.
Let's try CountVectorizer on our dataset:
count_vect = CountVectorizer()
X_train_tf = count_vect.fit_transform(train_df["text"])
print("Shape of term-frequency matrix:", X_train_tf.shape)
The term-frequency is a sparse matrix where each row is a document in our training corpus ($\mathcal{D}$) and each column corresponds to a term/word in the bag-of-words list. This can be confirmed by comparing the number of rows in the term-frequency matrix to the number of documents in the training set:
print("Number of training documents: ", train_df.shape[0])
Most often term-frequency alone is not a good measure of the importance of a word/term to a document's topic. Very common words like "the", "a", "to" are almost always the terms with the highest frequency in the text. Thus, having a high raw count of the number of times a term appears in a document does not necessarily mean that the corresponding word is more important. Furtermore, longer documents could have high frequency of terms that do not correlate with the document topic, but instead occur with high numbers solely due to the length of the document.
To circumvent the limination of term-frequency, we often normalize it by the inverse document frequency (idf). This results in the term frequency-inverse document frequency (tf-idf) matrix. The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents in the corpus. We can give a formal defintion of the inverse-document-frequency by letting $\mathcal{D}$ be the corpus or the set of all documents and $N$ is the number of documents in the corpus and $N_{t,D}$ be the number of documents that contain the term $t$ then,
$$idf(t,\mathcal{D}) \, = \, \log\left(\frac{N_{\mathcal{D}}}{1 + N_{t,\mathcal{D}}}\right) \, = \, - \log\left(\frac{1 + N_{t,\mathcal{D}}}{N_{\mathcal{D}}}\right) $$The reason for the presence of the $1$ is for smoothing. Without it, if the term/word did not appear in any training documents, then its inverse-document-frequency would be $idf(t,\mathcal{D}) = \infty$. However, with the presense of the $1$ it will now have $idf(t,\mathcal{D}) = 0$.
Now we can formally defined the term frequnecy-inverse document frequency as a normalized version of term-frequency,
$$\text{tf-idf}(t,d) \, = \, tf(t,d) \cdot idf(t,\mathcal{D}) $$Like the term-frequency, the term frequency-inverse document frequency is a sparse matrix, where again, each row is a document in our training corpus ($\mathcal{D}$) and each column corresponds to a term/word in the bag-of-words list. The $\text{tf-idf}$ matrix can be constructed using the sklearn TfidfTransformer class:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_tf)
X_train_tfidf.shape
We should note that the $\text{tf-idf}$ matrix is the same shape as the $\text{tf}$ matrix, but the two have different values. In either case the matrix $\text{tf-idf} \in R^{n \times p}$ and $\text{tf} \in R^{n \times p}$ where $p >> n$, i.e. there are more features than datapoints. This is called a high dimenisonal problem and causes issues for classifiers as we will dicuss.
Now that we have built our $\text{tf-idf}$ matrix we can start to look at the which terms/words are most associated with document topics and then build a predictive model to classify the documents' topic. We visualize the most import words for each class by can taking the mean of each column in the $\text{tf-idf}$ matrix using the functions developed here. I adapted the code to be faster by using SciPy's built in sparse matrix methods, Scikit-Learn Pipelines (to be discussed later) and use Plotly instead of Matplotlib:
from utils.feature_plots import plot_tfidf
from sklearn.pipeline import Pipeline
plot_tfidf(pipe = Pipeline([("vect",count_vect), ("tfidf",tfidf_transformer)]),
labeler = labeler,
X = train_df["text"],
y = train_df["target"],
vect = "vect",
tfidf = "tfidf",
top_n = 25)