5. Removing Outliers With Isolation Forests
6. Recomendations & Next Steps
I started this project a while back with a goal of taking the 2016 NYC Benchmarking Law building energy usage data and do something interesting with it. I originally attmpted to clean and analyze this data set to try to find ways to reduce builings' energy usage and subsequently their green house gas emissions. After a few iterations I thought it might be interesting to see if I could predict the emission of green house gases from buildings by looking at their age, energy and water consumption as well as other energy consumption metrics. This is somewhat of a difficult task as the data was very messy and in this first blogpost I will cover how to perform,
Since I will completing this project over multiple days and using Google Cloud, I will go over the basics of using BigQuery for storing the datasets so I won't have to start all over again each time I work on it. At the end of this blogpost I will summarize the findings, and give some specific recommendations to reduce mulitfamily and office building energy usage. The source code for this project can be found here.
The NYC Benchmarking Law requires owners of large buildings to annually measure their energy and water consumption in a process called benchmarking. The law standardizes this process by requiring building owners to enter their annual energy and water use in the U.S. Environmental Protection Agency's (EPA) online tool, ENERGY STAR Portfolio Manager® and use the tool to submit data to the City. This data gives building owners information about a building's energy and water consumption compared to similar buildings, and tracks progress year over year to help in energy efficiency planning.
Benchmarking data is also disclosed publicly and can be found here. I analyzed the 2016 data and my summary of the findings and recommendations for reducing energy consumption in New York City buildings are discussed in the conclusions post.
The data comes from the year 2016 and is quite messy and a lot of cleaning is necessary to do analysis on it. The cleaning process was made more difficult because the data was stored as strings with multiple non-numeric values which made converting the data to its proper type a more involved process. One thing to keep in mind through out this post is that this is self-reported data, meaning our data is mostly biased, containinig outliers. Therefore, I will go over a few techniques for removing outliers in post.
Since the cleaning was more tedious I created external functions do handle this processes. In addition, I also created functions to handle transforming and plotting the data. I kept these functions in seperate files Cleaning_Functions.py
and Plotting_Functions.py
respecively so as to not clutter the post. We import these functions along with other basic libraries (Pandas, Matplotlib and Seaborn) as well as read in the data file below:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
from utilities.CleaningFunctions import (
initial_clean,
convert_GeoPandas_to_Bokeh_format,
group_property_types
)
from utilities.PlottingFunctions import (
plot_years_built,
make_interactive_choropleth_map
)
Here is we specifify a few datatypes as integers while reading in the Excel using the read_excel function from Pandas
df_2016 = pd.read_excel("../data/nyc_benchmarking_disclosure_data_reported_in_2016.xlsx",
converters={'Street Number':int,
'Zip Code':int,
'Year Build':int,
'ENERGY STAR Score':int})
There are about 13,233 buildings with different types of energy usage, emissions and other information. I'll drop a bunch of these features and only keep the following,
The terms in the square brackets are the column names used in the dataframe. In addition, we must do some basic feature engineering. The reported data gives us the metrics (NAT_Gas
, Elec_Use
, GHG
, Water_Use
) in terms of total volume. Using these metrics in comparing buildings of different sizes is not a fair comparison. In order to compare them fairly we must standardize these metrics by dividing by the square footage of the buildings giving us each metrics' intensity. We therefore have the following features,
I wrote a basic function called initial_clean()
. to clean the data create the additional features. We call it on our dataset and then get some basic statistics about the data:
df_2016_2 = initial_clean(df_2016)
temp_cols_to_drop = ['BBL','Street_Number','Zip_Code','Occupancy']
df_2016_2.drop(temp_cols_to_drop, axis=1)\
.describe()
Energy_Star | Site_EUI | Nat_Gas | Elec_Use | GHG | Water_Use | NGI | EI | WI | GHGI | OPSFT | Age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 9535.000000 | 11439.000000 | 1.008700e+04 | 1.142500e+04 | 1.147800e+04 | 7.265000e+03 | 9870.000000 | 11206.000000 | 7261.000000 | 11258.000000 | 11311.000000 | 11531.000000 |
mean | 57.735711 | 525.733377 | 2.520461e+07 | 8.201496e+06 | 6.952577e+03 | 2.579751e+04 | 137.705639 | 54.266179 | 0.161268 | 0.031272 | 0.001065 | 67.857168 |
std | 30.143817 | 10120.105154 | 1.194068e+09 | 1.214643e+08 | 1.692231e+05 | 5.860239e+05 | 7512.527146 | 1210.530111 | 2.053453 | 0.571378 | 0.000536 | 30.263637 |
min | 1.000000 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -2.000000 |
25% | 34.000000 | 65.300000 | 8.915501e+05 | 1.045702e+06 | 3.420250e+02 | 2.661700e+03 | 7.324853 | 13.682696 | 0.028523 | 0.004308 | 0.000629 | 50.000000 |
50% | 63.000000 | 82.400000 | 4.067600e+06 | 1.885996e+06 | 5.198000e+02 | 4.745600e+03 | 46.268145 | 18.482229 | 0.046098 | 0.005455 | 0.001075 | 76.000000 |
75% | 83.000000 | 103.000000 | 6.919267e+06 | 4.513704e+06 | 9.394500e+02 | 8.057900e+03 | 68.036285 | 30.716894 | 0.073287 | 0.007003 | 0.001525 | 90.000000 |
max | 100.000000 | 801504.700000 | 1.101676e+11 | 1.047620e+10 | 1.501468e+07 | 4.385740e+07 | 737791.764249 | 84461.681703 | 98.340480 | 39.190314 | 0.001999 | 417.000000 |
The above table is only a summary of the numrical data in the dataframe. Just looking at the count
column we can immediately see that there are a lot of missing valus in this data. This tells me that this data will be rather messy with many columns having NaNs or missing values.
It also looks like there is a lot of variation within this dataset. Just looking at the Site_EUI
statistic, the 75th percentile is is 103 (kBtu/ft²), but the max is 801,504.7 (kBtu/ft²). This probably due to the number of different types of buildings in the city, as well as the fact that contains outliers.
The next thing I would like to see is how many of the buildings in NYC are passing the Benchmarking Submission Status:
plt.figure(figsize=(10,4))
df_2016_2['Benchmarking_Status'].value_counts()\
.plot(kind='bar',
fontsize=12,
rot=0)
plt.title('DOF Benchmarking Submission Status',fontsize=14)
plt.ylabel('count',fontsize=12)
Text(0, 0.5, 'count')
Most buildings are in compliance with the Department of Finance Benchmarking standards. Let's take a look at the violators:
Violators = df_2016_2.query("Benchmarking_Status == 'In Violation' ")
Violators.head()
BBL | BINs | Street_Number | Street_Name | Zip_Code | Borough | Benchmarking_Status | Property_Type | Year_Built | Occupancy | ... | Nat_Gas | Elec_Use | GHG | Water_Use | NGI | EI | WI | GHGI | OPSFT | Age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11978 | 2.051410e+09 | NaN | 300 | BAYCHESTER AVENUE | 10475 | Bronx | In Violation | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11979 | 3.088400e+09 | NaN | 3939 | SHORE PARKWAY | 11235 | Brooklyn | In Violation | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11980 | 3.088420e+09 | NaN | 2824 | PLUMB 3 STREET | 11235 | Brooklyn | In Violation | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11981 | 2.051411e+09 | NaN | 2100 | BARTOW AVENUE | 10475 | Bronx | In Violation | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11982 | 2.051410e+09 | NaN | 312 | BAYCHESTER AVENUE | 10475 | Bronx | In Violation | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 22 columns
There's not much we can learn from this, if we can look to see if certain zip codes have more buildings in violation. First thing we do is group by the zip codes and count them to get the number of violations per zip code:
zips_df = Violators.groupby('Zip_Code')['Zip_Code'].size()\
.reset_index(name='counts')
Now we want to visualize the the number of violators per zip code. To make things interesting we will create an interactive choropleth map using the Bokeh Library. Bokeh is a great vizualization tool that I have used in the past. We get the shapes for New York City zip codes as a geojson file from this site. The geojson file can be read into a dataframe using GeoPandas.
import geopandas as gpd
gdf = gpd.read_file("../data/nyc-zip-code-tabulation-areas-polygons.geojson")
# GeoPandas doesn't allow users to convert the datatype while reading it in so we do it here
gdf["postalCode"] = gdf["postalCode"].astype(int)
We can see the basic contents of the GeoPandas dataframe:
gdf.head(2)
OBJECTID | postalCode | PO_NAME | STATE | borough | ST_FIPS | CTY_FIPS | BLDGpostal | @id | longitude | latitude | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 11372 | Jackson Heights | NY | Queens | 36 | 081 | 0 | http://nyc.pediacities.com/Resource/PostalCode... | -73.883573 | 40.751662 | POLYGON ((-73.86942457284177 40.74915687096788... |
1 | 2 | 11004 | Glen Oaks | NY | Queens | 36 | 081 | 0 | http://nyc.pediacities.com/Resource/PostalCode... | -73.711608 | 40.745366 | POLYGON ((-73.71132911125308 40.74947450816085... |
I noticed only a few of the zipcodes had actual names, so I wrote a script (GetNeighborhoodNames.py
) to scrape this website to obtain each neighborhood's name. I pickled the results so we could use them here:
zip_names = pd.read_pickle("../data/neighborhoods.pkl")
We can attach them to our GeoPandas dataframe by joining them (on zip code),
gdf = gdf.drop(['PO_NAME'],axis=1)\
.merge(zip_names, on="postalCode",how="left")\
.fillna("")
Next, we'll left join our count of violators-per-zipcode zips_df
to above dataframe and fill in the zip codes that do not have violations with zeros:
gdf= gdf.merge(zips_df, how="left", left_on="postalCode", right_on="Zip_Code")\
.drop(["OBJECTID","Zip_Code"], axis=1)\
.fillna(0)
gdf.head(2)
postalCode | STATE | borough | ST_FIPS | CTY_FIPS | BLDGpostal | @id | longitude | latitude | geometry | PO_NAME | counts | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 11372 | NY | Queens | 36 | 081 | 0 | http://nyc.pediacities.com/Resource/PostalCode... | -73.883573 | 40.751662 | POLYGON ((-73.86942457284177 40.74915687096788... | West Queens | 5.0 |
1 | 11004 | NY | Queens | 36 | 081 | 0 | http://nyc.pediacities.com/Resource/PostalCode... | -73.711608 | 40.745366 | POLYGON ((-73.71132911125308 40.74947450816085... | Southeast Queens | 0.0 |
Now before we can use Bokeh to visualize our data we must first convert the GeoPandas dataframe to a format that Bokeh can work with. Since I already covered this in a previous blog post I won't go over the details, but here I used a slightly modified version of the function from that post:
bokeh_source = convert_GeoPandas_to_Bokeh_format(gdf)
Next we set bokeh io
module to be in the notebook and use the function I wrote make_interactive_choropleth_map
to create the in-notebook zipcode choropleth map:
from bokeh.io import output_notebook, show
output_notebook()
# We get the min and max of the number of violations to give the cloropleth a scale.
max_num_violations = zips_df['counts'].max()
min_num_violations = zips_df['counts'].min()
fig = make_interactive_choropleth_map(bokeh_source,
count_var = "Number Of Violations",
min_ct = min_num_violations,
max_ct = max_num_violations)
show(fig)