This post is about the project I completed at Insight Data Science. The goal of the project was to develop a data driven strategy for reducing crime in New York City using historical data provided by the NYC Open Data Website. Analyzing the crime data it became evident that different types of crimes affect different neighborhoods at different times of the year. I believe that by developing a predictive model of monthly crime rates on the neighborhood level, police will be able to distribute their resources to the right neigborhoods at the appropriate time and hopefully this will lead to a reduction in the crime rates. This led me to create a web application, CrimeTime, where users can preform similar analysis in their neighborhood anywhere in New York City.
The purpose of this post is therefore two-fold:
In this post I will be covering a few topics that involve learning from data and developing a predictive model for crime rates. Specifically I will,
We will see that different crimes affect different areas in the Manhattan and that monthly crime rates peak at different points of the year depending on the type of crime. We then,
Once we can predict future montly crime rates on a neighborhood level, police can distribute their resources to the right neigborhoods at the most appropriate time.
Throughout this post I'll be using the classes and methods I developed in the web application CrimeTime. Taking an object oriented approach was useful because I have the flexibility of extending the code easily and also using it in a different environment like this post. The web application I built was written in Python and Flask and was deployed to Amazon web services. Users are prompted to enter an address and then I use the geopy library to get the latitude and longitude of the address. Once that latitude and longitude are known I use the shapely library to find out which police precinct the address is in and obtain the data on that police precinct.
The info for police precincts was obtained by scraping the NYPD's website using beautifulsoup library and also this specific database on the NYC Open Data Website. The crime data was obtained from the NYC Open Data Website and cleaning was completed using Pandas and GeoPandas. The data was then stored in a SQLite database. Forecasted crime rates were predicted using a seasonal ARIMA model through the python library StatsModels. I used a grid search to obtain the appropriate model paramaters that minimize the validation error.
After downloading the crime data (in CSV format) from the NYC Open Data Website and cleaning I decided to store it in a sqlite database. I put all these steps into a method in the
PreProcessor class from CrimeTime (in the backend directory). If one wants to run the application on their local machine, all you have to do is pass the
PreProcessor object the address and name the database we want to make during instantiation:
import sys import warnings warnings.filterwarnings('ignore') sys.path.append("../backend/")
from PreProcessor import PreProcessor Instantiate the preprocessor object with location and name of database PP = PreProcessor("../data/CrimeTime.db")
The actual CSV file should be located in the same directory ("
/data/"). We can clean this file and create the database called "
CrimeTime.db" with the crime data stored in a table called "
NYC_Crime" by the following command:
The preprocessor object can also scrape the NYPD's
website and obtain the address and telephone number of each precinct police station and store it in the "
CrimeTime.db" database as the table "
NYC_CRIME." This done through the command:
Note: If NYC Open Data no longer provides the crime data email me and I can send you the database. You will no longer need to do the above commands to make the database then.
import sqlite3 import pandas as pd # Connect to the database conn = sqlite3.connect('../data/CrimeTime.db') # SQL Command sql_command = 'SELECT * FROM NYC_Crime WHERE BOROUGH = \'MANHATTAN\'' # Now querry the database using the above command crime_df = pd.read_sql(sql_command, conn) # close the connection to the database conn.close()
You'll notice I only have data that from 2006 to 2015. The data for 2016 was not available at the time and the data before 2006 is rather sparse so I decided to neglect it. Let's get a look at how crime in Manhattan has evolved over the last 10 years.
We will plot the yearly crime data, but first we must group all the crimes by their year, we do this with Pandas the command:
CRIMES_PER_YEAR = crime_df.groupby('YEAR').size()
This creates an Pandas series, where the index is the year and the values are the number of crimes that occured in that year. We can plot this yearly data directly using the
.plot() method on the dataframe. This calls matplotlib library under the hood:
import matplotlib.pyplot as plt %matplotlib inline fig = plt.figure(figsize=(6, 3)) CRIMES_PER_YEAR.plot(title='Number of crimes per year all of Manhattan', fontsize=13) plt.xlabel('Year',fontsize=13) plt.ylabel('Number of Crimes',fontsize=13) plt.ticklabel_format(useOffset=False)
One thing to notice is that crime in the last 10 years has been going down!
Let's just look at the crime date for larceny and assault. We can build new dataframes just for these specific crimes with the commands:
LARCENY = crime_df[crime_df.OFFENSE == 'GRAND LARCENY'] ASSAULT = crime_df[crime_df.OFFENSE == 'FELONY ASSAULT']
In order to visualize the geospatial distribution of crimes in Manhattan, lets break down our data in to the number of crimes per month (in 2015) that occrur within a precinct. We can do this in pandas using the
CRIMES_2015 = ASSAULT[ASSAULT.YEAR == 2015] CRIMES_2015 = CRIMES_2015.groupby('PRECINCT').size()
Now that we have the number of crimes per precinct and the precinct geometries (Note: I downloaded them as a geojson file and edited out non-Manhattan police precincts by hand.) we can visualize the number of assaults per precint using Folium. Folium is python package that has very nice geospatial visualization cabilities. Using it we can directly read in the geojson file and then plot the number of crimes that occur each month in a precinct using the choropleth function. Let's visualize the number the assaults in 2015:
import folium # Central lat/long values of NYC nyc_coor = [40.81,-73.9999] # instatiate a folium map object with the above coordinate at center nyc_crime_map = folium.Map(location=nyc_coor,zoom_start=11) # the path to the geojson file of the manhattan precincts pathgeo = './manhattan_precincts.geojson' # make the chorlopleth map nyc_crime_map.choropleth(geo_data=pathgeo, data=CRIMES_2015, key_on='feature.properties.Precinct', fill_color='BuPu', fill_opacity=0.7, line_opacity=0.2, legend_name='MANHATTAN') # show the map nyc_crime_map
We can look at the other crimes if we want too, and for the purposes of this blog well just look at larceny:
CRIMES_2015 = LARCENY[LARCENY.YEAR == 2015] CRIMES_2015 = CRIMES_2015.groupby('PRECINCT').size() # instantiate the object nyc_crime_map = folium.Map(location=nyc_coor,zoom_start=11) # path to geojson precinct file pathgeo = './manhattan_precincts.geojson' # make the map nyc_crime_map.choropleth(geo_data=pathgeo, data=CRIMES_2015, key_on='feature.properties.Precinct', fill_color='BuPu', fill_opacity=0.7, line_opacity=0.2, legend_name='MANHATTAN') # show the map nyc_crime_map