Exploring And Forecasting Crime Rates In NYC


This post is about the project I completed at Insight Data Science. The goal of the project was to develop a data driven strategy for reducing crime in New York City using historical data provided by the NYC Open Data Website. Analyzing the crime data it became evident that different types of crimes affect different neighborhoods at different times of the year. I believe that by developing a predictive model of monthly crime rates on the neighborhood level, police will be able to distribute their resources to the right neigborhoods at the appropriate time and hopefully this will lead to a reduction in the crime rates. This led me to create a web application, CrimeTime, where users can preform similar analysis in their neighborhood anywhere in New York City.

The purpose of this post is therefore two-fold:

  1. Discuss the analysis of the historical crime data as well as the predictive models to forecast crime rates in the future.
  2. Show how to use the tools I developed; this will be helpful to those trying to extend the codebase of CrimeTime.online

Analysis and modeling

In this post I will be covering a few topics that involve learning from data and developing a predictive model for crime rates. Specifically I will,

  • Perform exploratory analysis on temporal and geospatial crime data to find where and when specific crimes happen the most.

We will see that different crimes affect different areas in the Manhattan and that monthly crime rates peak at different points of the year depending on the type of crime. We then,

  • Develop a time series model that uses historical data to predict monthly number of crimes in two neighborhoods that have the highest crime rates in Manhattan.

Once we can predict future montly crime rates on a neighborhood level, police can distribute their resources to the right neigborhoods at the most appropriate time.

About the web application

Throughout this post I'll be using the classes and methods I developed in the web application CrimeTime. Taking an object oriented approach was useful because I have the flexibility of extending the code easily and also using it in a different environment like this post. The web application I built was written in Python and Flask and was deployed to Amazon web services. Users are prompted to enter an address and then I use the geopy library to get the latitude and longitude of the address. Once that latitude and longitude are known I use the shapely library to find out which police precinct the address is in and obtain the data on that police precinct.

The info for police precincts was obtained by scraping the NYPD's website using beautifulsoup library and also this specific database on the NYC Open Data Website. The crime data was obtained from the NYC Open Data Website and cleaning was completed using Pandas and GeoPandas. The data was then stored in a SQLite database. Forecasted crime rates were predicted using a seasonal ARIMA model through the python library StatsModels. I used a grid search to obtain the appropriate model paramaters that minimize the validation error.

The source code for this project can be found here and along with instructions on how to build a local copy of the Sphinx based documentation.

Exploratory Data Analysis

After downloading the crime data (in CSV format) from the NYC Open Data Website and cleaning I decided to store it in a sqlite database. I put all these steps into a method in the PreProcessor class from CrimeTime (in the backend directory). If one wants to run the application on their local machine, all you have to do is pass the PreProcessor object the address and name the database we want to make during instantiation:

In [1]:
import sys
import os.path
import os
import warnings

from PreProcessor import PreProcessor 

# Instantiate the preprocessor object with location and name of database
PP = PreProcessor("../data/CrimeTime.db")

The actual CSV file should be located in the same directory ("/data/"). We can clean this file and create the database called "CrimeTime.db" with the crime data stored in a table called "NYC_Crime" by the following command:

In [2]:

The preprocessor object can also scrape the NYPD's website and obtain the address and telephone number of each precinct police station and store it in the "CrimeTime.db" database as the table "NYC_CRIME." This done through the command:

In [3]:

Note: If NYC Open Data no longer provides the crime data email me and I can send you the database. You will no longer need to do the above commands to make the database then.

Now that our data is cleaned we can access the it very easily using sqlite3 database library and read it into a pandas dataframe. We obtain all the crime data on Manhattan by the commands:

In [4]:
import sqlite3
import pandas as pd

# Connect to the database
conn = sqlite3.connect('../data/CrimeTime.db')

# SQL Command

# Now querry the database using the above command
crime_df = pd.read_sql(sql_command, conn)

# close the connection to the database

You'll notice I only have data that from 2006 to 2015. The data for 2016 was not available at the time and the data before 2006 is rather sparse so I decided to neglect it. Let's get a look at how crime in Manhattan has evolved over the last 10 years.

We will plot the yearly crime data, but first we must group all the crimes by their year, we do this with Pandas the command:

In [5]:
CRIMES_PER_YEAR = crime_df.groupby('YEAR').size()

This creates an Pandas series, where the index is the year and the values are the number of crimes that occured in that year. We can plot this yearly data directly using the .plot() method on the dataframe. This calls matplotlib library under the hood:

In [6]:
import matplotlib.pyplot as plt
%matplotlib inline
fig = plt.figure(figsize=(6, 3))

CRIMES_PER_YEAR.plot(title='Number of crimes per year all of Manhattan',

plt.ylabel('Number of Crimes',fontsize=13)

One thing to notice is that crime in the last 10 years has been going down!

Let's just look at the crime date for larceny and assault. We can build new dataframes just for these specific crimes with the commands:

In [7]:
LARCENY = crime_df[crime_df.OFFENSE == 'GRAND LARCENY']
ASSAULT = crime_df[crime_df.OFFENSE == 'FELONY ASSAULT']

In order to visualize the geospatial distribution of crimes in Manhattan, lets break down our data in to the number of crimes per month (in 2015) that occrur within a precinct. We can do this in pandas using the groupby command:

In [8]:
CRIMES_2015 = CRIMES_2015.groupby('PRECINCT').size()

Now that we have the number of crimes per precinct and the precinct geometries (Note: I downloaded them as a geojson file and edited out non-Manhattan police precincts by hand.) we can visualize the number of assaults per precint using Folium. Folium is python package that has very nice geospatial visualization cabilities. Using it we can directly read in the geojson file and then plot the number of crimes that occur each month in a precinct using the choropleth function. Let's visualize the number the assaults in 2015:

In [9]:
import folium

# Central lat/long values of NYC
nyc_coor = [40.81,-73.9999]

# instatiate a folium map object with the above coordinate at center
nyc_crime_map = folium.Map(location=nyc_coor,zoom_start=11)

# the path to the geojson file of the manhattan precincts
pathgeo = './manhattan_precincts.geojson'

# make the chorlopleth map
# show the map

We can look at the other crimes if we want too, and for the purposes of this blog well just look at larceny:

In [10]:
CRIMES_2015 = CRIMES_2015.groupby('PRECINCT').size()

# instantiate the object
nyc_crime_map = folium.Map(location=nyc_coor,zoom_start=11)

# path to geojson precinct file
pathgeo = './manhattan_precincts.geojson'

# make the map
# show the map