blog_post

Exploring And Forecasting Crime Rates In NYC


Introduction


This post is about the project I completed at Insight Data Science. The goal of the project was to develop a data driven strategy for reducing crime in New York City using historical data provided by the NYC Open Data Website. Analyzing the crime data it became evident that different types of crimes affect different neighborhoods at different times of the year. I believe that by developing a predictive model of monthly crime rates on the neighborhood level, police will be able to distribute their resources to the right neigborhoods at the appropriate time and hopefully this will lead to a reduction in the crime rates. This led me to create a web application, CrimeTime, where users can preform similar analysis in their neighborhood anywhere in New York City.

The purpose of this post is therefore two-fold:

  1. Discuss the analysis of the historical crime data as well as the predictive models to forecast crime rates in the future.
  2. Show how to use the tools I developed; this will be helpful to those trying to extend the codebase of CrimeTime.online

Analysis and modeling

In this post I will be covering a few topics that involve learning from data and developing a predictive model for crime rates. Specifically I will,

  • Perform exploratory analysis on temporal and geospatial crime data to find where and when specific crimes happen the most.

We will see that different crimes affect different areas in the Manhattan and that monthly crime rates peak at different points of the year depending on the type of crime. We then,

  • Develop a time series model that uses historical data to predict monthly number of crimes in two neighborhoods that have the highest crime rates in Manhattan.

Once we can predict future montly crime rates on a neighborhood level, police can distribute their resources to the right neigborhoods at the most appropriate time.

About the web application

Throughout this post I'll be using the classes and methods I developed in the web application CrimeTime. Taking an object oriented approach was useful because I have the flexibility of extending the code easily and also using it in a different environment like this post. The web application I built was written in Python and Flask and was deployed to Amazon web services. Users are prompted to enter an address and then I use the geopy library to get the latitude and longitude of the address. Once that latitude and longitude are known I use the shapely library to find out which police precinct the address is in and obtain the data on that police precinct.

The info for police precincts was obtained by scraping the NYPD's website using beautifulsoup library and also this specific database on the NYC Open Data Website. The crime data was obtained from the NYC Open Data Website and cleaning was completed using Pandas and GeoPandas. The data was then stored in a SQLite database. Forecasted crime rates were predicted using a seasonal ARIMA model through the python library StatsModels. I used a grid search to obtain the appropriate model paramaters that minimize the validation error.

The source code for this project can be found here and along with instructions on how to build a local copy of the Sphinx based documentation.


Exploratory Data Analysis


After downloading the crime data (in CSV format) from the NYC Open Data Website and cleaning I decided to store it in a sqlite database. I put all these steps into a method in the PreProcessor class from CrimeTime (in the backend directory). If one wants to run the application on their local machine, all you have to do is pass the PreProcessor object the address and name the database we want to make during instantiation:

In [1]:
import sys
import warnings
warnings.filterwarnings('ignore')
sys.path.append("../backend/")
In [3]:
from PreProcessor import PreProcessor 

Instantiate the preprocessor object with location and name of database
PP = PreProcessor("../data/CrimeTime.db")

The actual CSV file should be located in the same directory ("/data/"). We can clean this file and create the database called "CrimeTime.db" with the crime data stored in a table called "NYC_Crime" by the following command:

In [2]:
PP.make_NYC_Crime_database("NYC_Crime")

The preprocessor object can also scrape the NYPD's website and obtain the address and telephone number of each precinct police station and store it in the "CrimeTime.db" database as the table "NYC_CRIME." This done through the command:

In [3]:
PP.make_NYC_precinct_database()

Note: If NYC Open Data no longer provides the crime data email me and I can send you the database. You will no longer need to do the above commands to make the database then.

Now that our data is cleaned we can access the it very easily using sqlite3 database library and read it into a pandas dataframe. We obtain all the crime data on Manhattan by the commands:

In [2]:
import sqlite3
import pandas as pd

# Connect to the database
conn = sqlite3.connect('../data/CrimeTime.db')

# SQL Command
sql_command = 'SELECT * FROM NYC_Crime WHERE BOROUGH = \'MANHATTAN\''

# Now querry the database using the above command
crime_df = pd.read_sql(sql_command, conn)

# close the connection to the database
conn.close()

You'll notice I only have data that from 2006 to 2015. The data for 2016 was not available at the time and the data before 2006 is rather sparse so I decided to neglect it. Let's get a look at how crime in Manhattan has evolved over the last 10 years.

We will plot the yearly crime data, but first we must group all the crimes by their year, we do this with Pandas the command:

In [3]:
CRIMES_PER_YEAR = crime_df.groupby('YEAR').size()

This creates an Pandas series, where the index is the year and the values are the number of crimes that occured in that year. We can plot this yearly data directly using the .plot() method on the dataframe. This calls matplotlib library under the hood:

In [4]:
import matplotlib.pyplot as plt
%matplotlib inline
fig = plt.figure(figsize=(6, 3))

CRIMES_PER_YEAR.plot(title='Number of crimes per year all of Manhattan',
                     fontsize=13)

plt.xlabel('Year',fontsize=13)
plt.ylabel('Number of Crimes',fontsize=13)
plt.ticklabel_format(useOffset=False)

One thing to notice is that crime in the last 10 years has been going down!

Let's just look at the crime date for larceny and assault. We can build new dataframes just for these specific crimes with the commands:

In [5]:
LARCENY = crime_df[crime_df.OFFENSE == 'GRAND LARCENY']
ASSAULT = crime_df[crime_df.OFFENSE == 'FELONY ASSAULT']

In order to visualize the geospatial distribution of crimes in Manhattan, lets break down our data in to the number of crimes per month (in 2015) that occrur within a precinct. We can do this in pandas using the groupby command:

In [6]:
CRIMES_2015 = ASSAULT[ASSAULT.YEAR == 2015]
CRIMES_2015 = CRIMES_2015.groupby('PRECINCT').size()

Now that we have the number of crimes per precinct and the precinct geometries (Note: I downloaded them as a geojson file and edited out non-Manhattan police precincts by hand.) we can visualize the number of assaults per precint using Folium. Folium is python package that has very nice geospatial visualization cabilities. Using it we can directly read in the geojson file and then plot the number of crimes that occur each month in a precinct using the choropleth function. Let's visualize the number the assaults in 2015:

In [7]:
import folium

# Central lat/long values of NYC
nyc_coor = [40.81,-73.9999]

# instatiate a folium map object with the above coordinate at center
nyc_crime_map = folium.Map(location=nyc_coor,zoom_start=11)

# the path to the geojson file of the manhattan precincts
pathgeo = './manhattan_precincts.geojson'

# make the chorlopleth map
nyc_crime_map.choropleth(geo_data=pathgeo,
                        data=CRIMES_2015,
                        key_on='feature.properties.Precinct',
                        fill_color='BuPu', 
                        fill_opacity=0.7, 
                        line_opacity=0.2,
                        legend_name='MANHATTAN')
# show the map
nyc_crime_map
Out[7]:

We can look at the other crimes if we want too, and for the purposes of this blog well just look at larceny:

In [8]:
CRIMES_2015 = LARCENY[LARCENY.YEAR == 2015]
CRIMES_2015 = CRIMES_2015.groupby('PRECINCT').size()

# instantiate the object
nyc_crime_map = folium.Map(location=nyc_coor,zoom_start=11)

# path to geojson precinct file
pathgeo = './manhattan_precincts.geojson'

# make the map
nyc_crime_map.choropleth(geo_data=pathgeo,
                        data=CRIMES_2015,
                        key_on='feature.properties.Precinct',
                        fill_color='BuPu',
                        fill_opacity=0.7,
                        line_opacity=0.2,
                        legend_name='MANHATTAN')
# show the map
nyc_crime_map
Out[8]:

We can see that larceny seems to be highest in midtown while assault seems to be highest in East Harlem. We'll focus in on these two neighborhoods and analyze larceny in Midtown and assaults in East Harlem to see what we can learn from the historical data to help us predict crimes in the future. Once we are able to predict crime rates in different parts of the city, police might be able to redistribute their resources at the appropriate place and time of the year.

We'll be using the "CrimeMapper" class from CrimeTime (in the backend directory) to house our data and make our graphics.

In [9]:
from CrimeMapper import CrimeMapper

Now we instantiate the CrimeMapper object with the boolean False to declare that we are not in prpoduction mode (production mode is run through the Flask app). We can then find the precinct for our address using the find_precinct(address) method:

In [10]:
EastHarlem = CrimeMapper(production_mode=False)
EastHarlem.find_precinct("162 E 102nd St, Manhattan, NY")

We can now get all the crime data for this precinct and the following crimes:

  • Assault
  • Larceny
  • Robbery
  • Burglary
  • Car Theft

We find the crime data for assault for just our precinct and store it in a pandas dataframe titled crime_df in the CrimeMapper object:

In [11]:
EastHarlem.get_crime_data("Assault")
EastHarlem.crime_df.head()
Out[11]:
DATE WEEKDAY MONTH DAY YEAR HOUR OFFENSE PRECINCT BOROUGH LATITUDE LONGITUDE
0 01/09/2006 01:50:00 AM Monday Jan 9.0 2006 1.0 FELONY ASSAULT 23 MANHATTAN 40.791261 -73.947968
1 01/09/2006 06:26:00 PM Monday Jan 9.0 2006 18.0 FELONY ASSAULT 23 MANHATTAN 40.795695 -73.938737
2 01/10/2006 03:30:00 PM Tuesday Jan 10.0 2006 15.0 FELONY ASSAULT 23 MANHATTAN 40.793842 -73.944070
3 02/11/2006 01:45:00 AM Saturday Feb 11.0 2006 1.0 FELONY ASSAULT 23 MANHATTAN 40.789959 -73.945857
4 02/11/2006 02:20:00 AM Saturday Feb 11.0 2006 2.0 FELONY ASSAULT 23 MANHATTAN 40.797300 -73.948928

The dataframe crime_df is just the raw daily assault data, we can turn it into a time series, where we record the monthly number of assaults in this precinct by using the make_time_series() command. The time series

In [12]:
EastHarlem.make_time_series()
EastHarlem.ts.head()
Out[12]:
Crimes
Date
2006-01-31 18
2006-02-28 11
2006-03-31 13
2006-04-30 11
2006-05-31 15

Note: We actually still store this time series as a Pandas dataframe instead of a Pandas timeseries.

We can now decompose our time series data into its trend and season using the plot_decompose() function. Under the hood of the CrimeMapper class this calls the statsmodels "seasonal_decompose()" method with a frequency of 12 a yearly frequency.

In [13]:
EastHarlem.plot_decompose()

The trend of monthly assaults in East Harlem has been steadly rising around 4% - 5% each year on average, but shot up to around a 10% increase last year. The seasonality tells us about when assaults are at the highest and lowest in the year. We can see that assaults in East Harlem peak in the summer time and are at their lowest in the beginning of the year. This makes sense since it is the time of year when people are the most and the least likely to be outside respectively.

Let's repeat the same process for Midtown:

In [14]:
# Instantiate the Midtown CrimeMapper object
Midtown = CrimeMapper(production_mode=False)

# Find the Midtown south precinct
Midtown.find_precinct("357 West 35th Street, New York, NY")

# Get the crime data for Larceny in this precinct
Midtown.get_crime_data("Larceny")

# Make the data into a time series
Midtown.make_time_series()

# Decompose the time series into trend and seasonality and plot it
Midtown.plot_decompose()

We can see that overall the number of crimes involving larceny has been steadily decreasing over the years. We can also see that larceny in Midtown is very seasonal. Larceny is lowest at the beginning of the year and highest at the end of year. Sadly this makes some sense, since larceny involves shoplifting, many people might be stealing presents for the holidays.

Since the two crimes in the two neighborhoods are out of phase with one another (occur at different times of the year), it may be possible to redirect police efforts to the different areas depending on the time of year. Let's develop a predictive model to estimate the number of monthly crimes involving grand larceny in Midtown and the number of assaults in East Harlem using the time series data. Being able to predict the number of incidents on a month-to-month basis would be useful for police as they might be able to better prepare for increases and decreases in crime within the city and reallocate resources to specific neighborhoods at the appropriate time of year.


TIME SERIES ANALYSIS


There are lot of great resource for time series analysis. For this project I used the following blogs:

A tutorial on time series model is a good place to start, but its in R, so I turned to, Time series forecasting in python to learn how to do time series analysis in Python. Another good resource is Tom Augspurger's blog. Finally, for seasonal ARIMA I found the the following blog post very helpful.

I won't go over the details of modeling with ARIMA, but more information can be found in the links above. It's worth noting that R has a lot better time series capabilities than Python. However, one can use the statsmodels library to perform ARIMA modeling.

ARIMA: Autoregressive Integrated Moving Average

ARIMA is a standard techinque for modeling and forecasting in time series analysis. The general model for a stationary process is given by formula,

\begin{equation} C_{T} = \mu + \sum_{i=1}^{p} \gamma_{i} C_{T-i} +\sum_{i=0}^{q} \theta_{i} \epsilon_{T-i}. \end{equation}

Where $C_{T}$ is related to the number of crimes that occured in month $T$ and $\epsilon_{T}$ is a normally distributed random error. We can see that ARIMA incorporates trying to balance a linear combination of previous months number of crimes as well as a linear combination of this month's and previous months' random errors. An ARIMA model, ARIMA($p,d,q$), has three paramaters, $p,d,$ and $q$. $p$ and $q$ correspond to the maximum number previous months' crimes and random errors we incorporate into or model. The number $d$ corresponds to the number of differences in the monlthy number of crimes we have to take before our data is stationary.

Seasonal ARIMA

Seasonal ARIMA is a little more complicated than the standard ARIMA model. Seasonal ARIMA takes into effect the seasonality of the data and tries to fit an appropriate model labeled SARIMA$(P,D,Q)_{m}$ where $m$ is the seasonality and in our case will turn out to be 12, since it is yearly. The seasonal ARIMA model will therefore have 7 paramaters: $(p,d,q)$ for the non-seasonal component of the model and $(P,D,Q)_{m}$ for the seasonal component of the model.

The seasonal ARIMA class which I wrote, Seasonal_Arima, uses the SARIMAX model from the StatsModels library. We import the Seasonal_ARIMA class from the Forecasting file (in the backend directory):

In [15]:
from Forecasting import Seasonal_ARIMA

We preform some preliminary analysis on the assault data using the Dicky-Fuller test in StatsModels. The Dicky-Fuller test is a test for stationarity in time series data. Stationarity means that our time series data has constant mean and constant variance.

Our analysis involves looking at differences and testing to see if the differences transform the data to being stationary. We first instantiate a model on the time series data for East Harlem and then use the Dickey-Fuller test to see if the first difference and seasonal first difference are stationary:

In [16]:
# Instantiate East Harlem Seasonal Arima Model
Forecast_Harlem = Seasonal_ARIMA(EastHarlem)

# Test the stationarity of first difference
Forecast_Harlem.first_diff()

# Test the stationarity of the seasonal first difference
Forecast_Harlem.seasonal_first_diff()
First Difference:

Results of Dickey-Fuller Test:
Test Statistic                -7.662762e+00
p-value                        1.671661e-11
#Lags Used                     1.100000e+01
Number of Observations Used    1.070000e+02
Critical Value (1%)           -3.492996e+00
Critical Value (5%)           -2.888955e+00
Critical Value (10%)          -2.581393e+00
dtype: float64

Seasonal First Difference:

Results of Dickey-Fuller Test:
Test Statistic                 -5.046806
p-value                         0.000018
#Lags Used                     13.000000
Number of Observations Used    93.000000
Critical Value (1%)            -3.502705
Critical Value (5%)            -2.893158
Critical Value (10%)           -2.583637
dtype: float64

Since the rolling mean for the first difference and seasonal first difference are constant we take $d=1$ and $D=1$ with $m=12$ in our seasonal ARIMA model. However, we still need to figure out the values of $p, q, P$ and $Q$. One way to find the other values is through creating the ACF and PCF plots for the first difference and seasonal first difference. A good way example of how to do this in Python is the blog post here. However, this process is hands on and does not lead towards automating which is necessary for my web application, so I decided to do a grid search for the other seasonal ARIMA paramaters. Using a grid search to find the correct parameter choices can be a time consuming process, so inorder to keep the runtime down I limited the values of these. There is always a trade off between speed and accuracy and while using the ACF and PCF plots can be more accurate, I had to sacrifice some accuracy for speed and automation.

To find the values for $p, q, P$ and $Q$ using a grid search we'll loop through all the possible values, fit the SARIMAX model to the training data (1/2006 - 12/2014). We then compute the validation error, by dynamically forecasting crimes in validation year (1/2014 - 12/2014) and the finding the root mean square error in the predicted number of monthly crimes and recorded number of monthly crimes. We then choose the model paramater values which minimize the validation error. After the model has been again fitted on the training data, we then forecast it for test set year (1/2015 - 12/2015) and find the error between the the predicted values and the record values. We store the test set error and once again re-train the model, but this time on ALL of the data (1/2006 - 12/2015).

Note: We enforce stationarity and invertability of model through the use of a try-except block. The code will try to fit the mdoel using all parameter values and if SARIMAX throws an error when fitting our model, we don't record their values or validation error and move onto the next paramter values.

We now can fit our model to the data by using the fit() method:

In [17]:
# Fit it to training data. Choose appropriate paramater values
# by doing a grid search and choosing the values with the smallest
# RMSE on the validation set
Forecast_Harlem.fit()

# Plot the dynamic forecast on the validation set
Forecast_Harlem.plot_test()

Using the plot_test() method as we did above we can see how our model's forecast compares to the test set data. Over all we do fairly well, except in the end of year, our model seems to fail to capture the upswing in the the number of asaults in December. This could be because of the increase in the trend of assaults in 2015 as mentioned previously. We can see the $[p,d,q,P,D,Q]$ values for our model by typing,

In [18]:
print(Forecast_Harlem.params)
[1, 1, 1, 2, 1, 1]

We can repeat the same procedure for the Midtown larceny data. We will assume $d=D=1$ in all our models and enforce stationarity through the try-except block:

In [19]:
# Instantiate the Midtown Seasonal Arima model object
Forecast_Midtown = Seasonal_ARIMA(Midtown)

# Fit the model to the data
Forecast_Midtown.fit()

# Plot the predicted and recorded data of the test set
Forecast_Midtown.plot_test()
print(Forecast_Midtown.params)
[0, 1, 0, 1, 1, 1]

We can see that overall our model does very well on the test set, except it over estimates the number of crimes involving larceny in the month of Decemeber.

We can now forecast our monthly crime rates into the future using the forecast() method and plot them using the plot_forecast() method:

In [20]:
# Forecast two years into the future
Forecast_Harlem.forecast()

# Plot the forecasted crime rates
Forecast_Harlem.plot_forecast()
plt.title("Forcaste Monthly Assault Rate In East Harlem")
Out[20]:
Text(0.5, 1.0, 'Forcaste Monthly Assault Rate In East Harlem')
In [21]:
# Forecast two years into the future
Forecast_Midtown.forecast()

# Plot the forecasted crime rates
Forecast_Midtown.plot_forecast()
plt.title("Forcaste Monthly Larceny Rate In Midtown")
Out[21]:
Text(0.5, 1.0, 'Forcaste Monthly Larceny Rate In Midtown')

Conclusion


In this post we have looked into using exploratory data analysis to come up with a data driven strategie to reduce crime in Manhattan. We saw that larceny and assault are two crimes that effect different parts of Manhattan and also peak at different times of the year. Larceny effects midtown and peaks at the end of the year, while assault effects East Harlem and peaks in summer time. We looked into time series analysis to model monthly crime rates in both Midtown and East Harlem and used them to predict monthly crime rates into the future. This is just an small example with two neighborhoods, but if police are better able to predict monthly crime rates on a neighborhood level they can more effectively distribute their resources at the appropriate place and time. Lastly, I demonstrated how to use the tools I developed; this will be helpful to those trying to extend the codebase of CrimeTime.online

Hopefully this post has been both ineteresting and informative!