Classification Models For Cancer Detection


Contents


1. Introduction

2. Classes

3. Data Cleaning

4. Visualizing The Data

5. Modeling

6. Conclusion

The source code for this project can be found here


Introduction


The purpose of this blog post/notebook is to practice using classification models for cancer detection. We will be working with breast tissue data provided by the University of California Irvine (UCI) Machine learning course which can be found here. The dataset contains a sampling of 699 patient cells for which we have the following features:

  1. Sample code number: id number
  2. Clump Thickness: 1 - 10
  3. Uniformity of Cell Size: 1 - 10
  4. Uniformity of Cell Shape: 1 - 10
  5. Marginal Adhesion: 1 - 10
  6. Single Epithelial Cell Size: 1 - 10
  7. Bare Nuclei: 1 - 10
  8. Bland Chromatin: 1 - 10
  9. Normal Nucleoli: 1 - 10
  10. Mitoses: 1 - 10
  11. Class: (2 for benign, 4 for malignant)

We will use features 2-10 to determine the class of the sample cell, i.e. whether the cells are normal or cancerous. We drop the id number since it will not be predictive of the cell class.

The process of developing a model that can classify cancer cells invovles the following steps

  • Cleaning the data.
  • Visualizing the data.
  • Training, testing and evaluating different models for cancer detection.

We will mostly be using the pandas, seaborn and scikit-learn libraries, which we import below:

In [31]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import requests

from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV

I built three classes for this project that encapsulate all the methods and functionality that we use to analyze the data and develop classification models. Constructing these class was also nice because they can easily be altered for other projects which will save me development time later in the future.

The cancer_cell_data class obtains the cancer cell data from the UCI website and cleans it. It stores the cleaned data as a Pandas dataframe (df) as a member variable of the class_cell_data object. The cancer_cell_data class also has a few of the pandas dataframe methods overloaded for convenience.


Classes


In [32]:
class cancer_cell_data(object):
    """This class is used to create an object which will hold the 
    cancer cell data in a pandas data frame.  It will overload some
    features of the dataframe just for convience.  The constructor
    will fetch the data from the url.
    
    self.df - Pandas dataframe."""
    
    col_names = ['ID','thickness', 'unif. size', 'unif. shape',\
                 'adhesion','epithelial size', 'nuclei',\
                 'chromatin', 'nucleoli', 'mitosis',
                 'class']
    
    url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/'
    url+= 'breast-cancer-wisconsin/breast-cancer-wisconsin.data'
    
    def __init__(self):
        """Reads in the data from the url and assigns the column names
        to be names in headers. Also sets the index to be the ID. """
        self.df = pd.read_csv(self.url, names=self.col_names,)
        self.df.set_index('ID', inplace=True)
        self.means = None
        self.stds  = None
        
    def head(self):
        """Returns the first five lines of the data frame."""
        return self.df.head()
    
    def rows(self):
        """Returns the number of rows in the pandas data frame."""
        return self.df.shape[0]
        
    def cols(self):
        """Returns the number of columns in the pandas data frame."""
        return self.df.shape[1]
    
    def describe(self):
        """Returns the descriptive statistics of the dataset without 
        suppressing the NANS if there are any."""
        return self.df.describe(include='all')
    
    def clean(self):
        """Replace the ? with nans, then print the number of nans
        to show there arent many, and delete them."""
        
        # replace the ? with nans
        self.df.replace('?', np.nan, inplace=True)
        
        # convert the nuclei values to numbers since they are
        # actually stored as strings
        self.df['nuclei'] = pd.to_numeric(self.df['nuclei'])
        
        # print the number of characters in each
        print "Number of different values in nuclei column."
        print self.df['nuclei'].value_counts(dropna=False)
        self.df.dropna(inplace=True)
        
        # drop the ID's
        self.df.reset_index(inplace=True)
        self.df.drop('ID',axis=1,inplace=True)
        
        # set benign = 0, malignant = 1
        for i in range(0,self.df.shape[0]):
            if(self.df.loc[i,'class'] == 2):
                self.df.loc[i,'class'] = 0
            else:
                self.df.loc[i,'class'] = 1
                
                
    def scale(self):
        """Normalize the data so that the columns have zero mean
        and unit variance. Stores the column means and stds so
        that you can rescale back."""
        self.means = self.df.mean()
        self.stds  = self.df.std()
        self.df = (self.df - self.means ) / self.stds
        

The data_visualizer class contains the visualization methods that we will use through out this project. The data_visualizer object does not store any of the data, instead each time you call a member function from the data_visualizer object you must pass a cancer_cell_data object which contains the cleaned cancer cell data.

In [33]:
class data_visualizer(object):
    ''' This class is just a container for different visualizations
    of the data stored in cancer_cell_data object.'''
    
    def __init__(self):
        """Empty constructor."""
    
    def dist_of_cells(self, data):
        """Makes a histogram of the number of cancer cells and
        normal cells."""
        plt.figure(figsize=(6,4))
        data.df['class'].hist()
        plt.ylabel('Number of Targets',fontsize=12)
        plt.xlabel('Class of cell',fontsize=12)
        plt.title('Histogram of number cells that are cancerous',
                  fontsize=12)
        
    def corr(self, data):
        """Makes a heatmap of the correlation matrix."""
        sns.heatmap(data.df.corr())
        
    def dist_of_features(self, data):
        """Plots the histograms and 
        kernal destinsity estimator of the different features."""
      
        sns.set(style="white", palette="muted", color_codes=True)
       
        # Set up the matplotlib figure
        f, axes = plt.subplots(3,3, figsize=(7, 7), sharex=True)
        sns.despine(left=True)
        
        # get the column names
        col_names = data.df.columns.tolist()
        
        # loop through the features and map them to a place in the 
        # figure
        for i in range(0,9):
            row = i / 3
            col = i % 3
            sns.distplot(data.df[col_names[i]],
                         kde=True, color="b", ax=axes[row, col])
        # drop the y-axis label
        plt.setp(axes, yticks=[])
        plt.tight_layout()
        
    def feature_violin(self, data):
        """Produces violin plot of the features with respect to 
        the cell class."""
        for i, col_name in enumerate(data.df.columns[:-1]):
            plt.figure(i)
            sns.factorplot(x="class",col=col_name,
                           data=data.df,kind="violin")
            

The Classifier class contains all the classification methods that we will use to develop predictive models for cancer cell detection. A Classifier object is instantiated by passing the cancer_cell_data object through the constructor. The constructor then breaks up the data from the cancer_cell_data object into training and test set data and stores them as member variables of the Classifier object. We do this initially so each classifier method uses the same training and test set data.

In [34]:
class Classifier(object):
    """This class aggregrates classifier models as well as general
    functions for measuring the performance of the classifiers."""
    
    def __init__(self, data):
        """Sets the features and targets of the data set."""
        self.X = data.df[data.df.columns[0 : data.df.shape[1]-1]]
        self.Y = data.df[data.df.columns[data.df.shape[1]-1]]
        self.feature_names = data.df.columns[0:data.df.shape[1]-1]
        self.X_train = None
        self.X_test = None
        self.Y_train = None
        self.Y_test = None
        self.reg = None
        self.Y_pred = None
        
    def train_test_split(self):
        """Splits the features and target in to training and testing
        set."""
        self.X_train, self.X_test, self.Y_train, self.Y_test = \
                    train_test_split(self.X ,self.Y, #test_size=0.3,
                                     random_state=1)
            
    def logistic_reg_model(self):
        """Use logitstic regression to model the data."""
        self.reg = LogisticRegression()
        # preform 10-fold cross validation and output the score.
        scores = cross_val_score(self.reg, self.X_train, 
                                 self.Y_train, cv=10)
        
        print "Cross val. accuracy = " + str(scores.mean()) + '\n'
        
        # fit the model to the data
        self.reg.fit(self.X_train, self.Y_train)
        
        # print the coefficients for the model features
        print "Coefficient prediction strength"
        coef_prob = [(self.X.columns[i], self.reg.coef_[0][i]) 
                     for i in range(len(self.reg.coef_[0]))]
        
        for coef in coef_prob:
            print coef
            
    def logistic_reg_L1_penalty(self):
        """Use logitstic regression with L1 regularization
        to model the data."""
        
        # preform 10-fold cross validation with
        # grid search to optimize L1 penalityand output the score.
        self.reg = LogisticRegressionCV(cv=10, 
                                        penalty='l1',
                                        solver='liblinear')
        
        # preform the fitting
        self.reg.fit(self.X_train, self.Y_train)
        
        print "Cross val. accuracy = " +\
              str(self.reg.score(self.X_train, self.Y_train)) + '\n'
        
        # fit the model to the data
        self.reg.fit(self.X_train, self.Y_train)
        
        # print the coefficients for the model features
        print "Coefficient prediction strength"
        coef_prob = [(self.X.columns[i], self.reg.coef_[0][i]) 
                     for i in range(len(self.reg.coef_[0]))]
        
        for coef in coef_prob:
            print coef
            
        
    def get_test_stats(self):
         # get the test accuracy
        self.Y_pred = self.reg.predict(self.X_test)
        
        print "Test accuracy = " + str(
                accuracy_score(self.Y_test, self.Y_pred))
        
        precision, recall, f1, other = precision_recall_fscore_support(
                                            self.Y_test, self.Y_pred,
                                            average='binary')
        
        print "Test Precision = " + str(precision)
        print "Test Recall = " + str(recall)
        print "Test F1 Score = " + str(f1)
        
        
    def ROC(self):
        plt.rcParams['figure.figsize'] = (8, 6)
        plt.rcParams['font.size'] = 14
        
        # plot ROC curve
        self.Y_pred = self.reg.predict(self.X_test)
        fpr, tpr, thresholds = roc_curve(self.Y_test, self.Y_pred)
        roc_auc = auc(fpr, tpr)
    
        # method I: plt

        plt.title('Receiver Operating Characteristic')
        plt.plot(fpr, tpr, 'b', label='AUC = %0.2f' % roc_auc)
        plt.legend(loc='lower right')
        plt.plot([0, 1], [0, 1], 'r--')
        plt.xlim([0, 1])
        plt.ylim([0, 1])
        plt.ylabel('True Positive Rate')
        plt.xlabel('False Positive Rate')

Cleaning


Now we can get to the progess of data wrangling and cleaning. We first instantiate a cancer_cell_data object that obtains and drops the column containing the id number.

In [35]:
data = cancer_cell_data()

Let's first figure out what we can learn about the 10 columns of our data set. The info on the feature columns and the datatypes they contain can be found out from the info method. Below you will see the column descriptions containing the column name, number of rows in column, and datatype within the column:

In [36]:
data.df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 699 entries, 1000025 to 897471
Data columns (total 10 columns):
thickness          699 non-null int64
unif. size         699 non-null int64
unif. shape        699 non-null int64
adhesion           699 non-null int64
epithelial size    699 non-null int64
nuclei             699 non-null object
chromatin          699 non-null int64
nucleoli           699 non-null int64
mitosis            699 non-null int64
class              699 non-null int64
dtypes: int64(9), object(1)
memory usage: 60.1+ KB

We see that as discussed in the introduction the features are mostly integers except for nuclei column which contains types of "object". This seems a little bit strange since feature values are supposed to be either integers or real numbers. Let's see what else we can learn about the features by using the describe function for pandas dataframe. The describe function will provide us with some descriptive statistics of each of the feature coumns:

In [37]:
data.describe()
Out[37]:
thickness unif. size unif. shape adhesion epithelial size nuclei chromatin nucleoli mitosis class
count 699.000000 699.000000 699.000000 699.000000 699.000000 699 699.000000 699.000000 699.000000 699.000000
unique NaN NaN NaN NaN NaN 11 NaN NaN NaN NaN
top NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN
freq NaN NaN NaN NaN NaN 402 NaN NaN NaN NaN
mean 4.417740 3.134478 3.207439 2.806867 3.216023 NaN 3.437768 2.866953 1.589413 2.689557
std 2.815741 3.051459 2.971913 2.855379 2.214300 NaN 2.438364 3.053634 1.715078 0.951273
min 1.000000 1.000000 1.000000 1.000000 1.000000 NaN 1.000000 1.000000 1.000000 2.000000
25% 2.000000 1.000000 1.000000 1.000000 2.000000 NaN 2.000000 1.000000 1.000000 2.000000
50% 4.000000 1.000000 1.000000 1.000000 2.000000 NaN 3.000000 1.000000 1.000000 2.000000
75% 6.000000 5.000000 5.000000 4.000000 4.000000 NaN 5.000000 4.000000 1.000000 4.000000
max 10.000000 10.000000 10.000000 10.000000 10.000000 NaN 10.000000 10.000000 10.000000 4.000000

The nuclei column seems pretty off since all the statistics are NaNs, so lets look at their unique values:

In [38]:
data.df.nuclei.unique()
Out[38]:
array(['1', '10', '2', '4', '3', '9', '7', '?', '5', '8', '6'], dtype=object)

You can see there is a '?' character in it! Lets get rid of that, by using the clean function. The clean function will convert all the '?' to NaNs. The other values in this column are actually stored as strings so we will also convert them to their numerical values. Lastly, it converts the Class column values from 2 and 4 to 0 and 1 to represent normal and cancerous cell. Let's clean the data and look at the number of different values that are in the nuclei column to see how many NaNs there are:

In [39]:
data.clean()
Number of different values in nuclei column.
 1.0     402
 10.0    132
 5.0      30
 2.0      30
 3.0      28
 8.0      21
 4.0      19
NaN       16
 9.0       9
 7.0       8
 6.0       4
Name: nuclei, dtype: int64

Since the number of NaNs is small compared to the total amount of data it's probably safe to just drop those rows. The dropping of the rows was performed in the above clean function call, so no need to do it now. Let's see if we can get a better idea of the data set now by calling the describe function again:

In [43]:
data.describe()
Out[43]:
thickness unif. size unif. shape adhesion epithelial size nuclei chromatin nucleoli mitosis class
count 683.000000 683.000000 683.000000 683.000000 683.000000 683.000000 683.000000 683.000000 683.000000 683.000000
mean 4.442167 3.150805 3.215227 2.830161 3.234261 3.544656 3.445095 2.869693 1.603221 0.349927
std 2.820761 3.065145 2.988581 2.864562 2.223085 3.643857 2.449697 3.052666 1.732674 0.477296
min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000
25% 2.000000 1.000000 1.000000 1.000000 2.000000 1.000000 2.000000 1.000000 1.000000 0.000000
50% 4.000000 1.000000 1.000000 1.000000 2.000000 1.000000 3.000000 1.000000 1.000000 0.000000
75% 6.000000 5.000000 5.000000 4.000000 4.000000 6.000000 5.000000 4.000000 1.000000 1.000000
max 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 1.000000

It seems now that all our rows and columns have values within the ranges described in the introduction and we are happy that our data is cleaned and in the correct format. We can now move on to see what we can learn from the dataset through visualization.


Visualization


Let's instantiate our visualization object:

In [24]:
viz = data_visualizer()

Let's get a basic histogram of how many samples are cancerous and how many are not, remembering cells with class 0 are normal and cells with class 1 are cancerous:

In [25]:
viz.dist_of_cells(data)

Now lets get the underlying distributions of the features within each class:

In [26]:
viz.feature_violin(data)
<matplotlib.figure.Figure at 0x1143f6810>