Green Buildings

Analyzing Energy Use & Predicting Emissions Of Buildings In NYC


1. Introduction

2. Exploratory Analysis

3. Analysis Of Multifamily Buildings

4. Predictive Models For Green House Gas Emission

5. Conclusions And Recommendations


I started this project a while back with a goal of taking the 2016 NYC Benchmarking Law data about building energy usage and do something interesting with it. I originally attmpted to clean and analyze this data set to try to find ways to reduce builings' energy usage and subsequently their green house gas emissions. After a few iterations I thought it might be interesting to see if we could additonally predict the emission of green house gases from buildings by looking at their age, energy and water consumption as well as other energy consumption metrics. In the modeling section we look at three different modes for predicting green house gas emissions:

  1. Linear Regression

  2. Generalized Additive Models

  3. Gradient Boosted Regression Trees

In the conclusion section I not only summarize the findings, but give some specific recommendations to reduce the multi-family buildings energy usage.


The NYC Benchmarking Law requires owners of large buildings to annually measure their energy and water consumption in a process called benchmarking. The law standardizes this process by requiring building owners to enter their annual energy and water use in the U.S. Environmental Protection Agency's (EPA) online tool, ENERGY STAR Portfolio Manager® and use the tool to submit data to the City. This data gives building owners about a building's energy and water consumption compared to similar buildings, and tracks progress year over year to help in energy efficiency planning. In this blog post we will analyze how buildings in New York City use energy and water, make recommendations on how to improve their performance and also model their green house gas emissions. The source code for this project can be found here.

Benchmarking data is also disclosed publicly and can be found here. I analyzed the 2016 data and my summary of the findings and recommendations for reducing energy consumption in New York City buildings are discussed in the conclusions section.

The 2016 data is very messy and a lot of cleaning was necessary to do analysis on it. One thing to note is that this is self-reported data, meaning the performance data wasn't collected by a non-bias third party but by the building owners. This means our data is biased and we will keep this in mind while performing our analysis.

There are about 13,223 buildings recorded in this dataset and many of them have missing data values. While there are many different techniques for imputing missing values, there was sufficient number of buildings with all their values that I did not need to impute missing values. The cleaning process was made more difficult because the data was stored as strings with multiple non-numeric values which made converting the data to its proper type a more involved process.

Exploratory Analysis

Since the cleaning was more tedious I created external functions do handle this processes. In addition, I also a created functions to handle transforming and plotting the data. I kept these functions in seperate files and respecively so as to not clutter the post. We import these functions along with other basic libraries (Pandas, Matplotlib and Seaborn) as well as read in the data file below:

In [52]:
import warnings

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from Cleaning_Functions import (initial_clean, 
from Plotting_Functions import (plot_years_built, 

# Here we specifify a few data types as integers while reading in the Excel
df_2016 = pd.read_excel("data/nyc_benchmarking_disclosure_data_reported_in_2016.xlsx",
                       converters={'Street Number':int, 
                                   'Zip Code':int,
                                   'Year Build':int,
                                   'ENERGY STAR Score':int})

There are about 13,233 buildings with different types of energy usage, emissions and other information. I'll drop a bunch of these features and only keep the following,

  • Reported NYC Building Identification Numbers : [BINs]
  • NYC Borough, Block and Lot : [BBL]
  • Street Number : [Street_Number]
  • Street Name : [Street_Name]
  • Zip Code : [Zip_Code]
  • Borough : [Borough]
  • Year Built : [Year_Built]
  • DOF Benchmarking Status :[Benchmarking_Status]
  • Site EUI (kBtu/ft$^{2}$) : [Site_Eui]
  • Natural Gas Use [N(kBtu) : [Nat_Gas]
  • Electricity Use (kBtu): [Elec_Use]
  • Total GHG Emissions (Metric Tons CO2e) : [GHG]
  • ENERGY STAR Score : [Energy_Star]
  • Water Use (All Water Sources) (kgal) : [Water_Use]

The terms in the square brackets are the column names used in the dataframe. In addition, we must do some basic feature engineering. The reported data gives us the metrics (NAT_Gas, Elec_Use, GHG, Water_Use) in terms of total volume. Using these metrics in comparing buildings of different sizes is not a fair comparison. In order to compare them fairly we must standardize these metrics by dividing by the square footage of the buildings giving us each metrics' intensity. We therefore have the following features,

  • Nautral Gas Use Intensity (kBtu/ft$^{2}$) : [NGI]
  • Electricty Use Intensity (kBtu/ft$^{2}$) : [EI]
  • Water Use Intensity (kga/ft$^2$) : [WI]
  • Total GHG Emissions Intensity (Metric Tons CO2e / ft$^2$) : [GHGI]
  • Occupancy Per Square Foot (People / ft$^2$) : [OPSQFT]

I wrote a basic function called initial_clean(). to clean the data create the additional features. We call it on our dataset and then get some basic statistics about the data:

In [3]:
df_2016_2 = initial_clean(df_2016)
temp_cols_to_drop = ['BBL','Street_Number','Zip_Code','Occupancy']

df_2016_2.drop(temp_cols_to_drop, axis=1)\
Energy_Star Site_EUI Nat_Gas Elec_Use GHG Water_Use NGI EI WI GHGI OPSFT
count 9535.000000 11439.000000 1.008700e+04 1.142500e+04 1.147800e+04 7.265000e+03 9870.000000 11206.000000 7261.000000 11258.000000 11311.000000
mean 57.735711 525.733377 2.520461e+07 8.201496e+06 6.952577e+03 2.579751e+04 137.705639 54.266179 0.161268 0.031272 0.001065
std 30.143817 10120.105154 1.194068e+09 1.214643e+08 1.692231e+05 5.860239e+05 7512.527146 1210.530111 2.053453 0.571378 0.000536
min 1.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000
25% 34.000000 65.300000 8.915501e+05 1.045702e+06 3.420250e+02 2.661700e+03 7.324853 13.682696 0.028523 0.004308 0.000629
50% 63.000000 82.400000 4.067600e+06 1.885996e+06 5.198000e+02 4.745600e+03 46.268145 18.482229 0.046098 0.005455 0.001075
75% 83.000000 103.000000 6.919267e+06 4.513704e+06 9.394500e+02 8.057900e+03 68.036285 30.716894 0.073287 0.007003 0.001525
max 100.000000 801504.700000 1.101676e+11 1.047620e+10 1.501468e+07 4.385740e+07 737791.764249 84461.681703 98.340480 39.190314 0.001999

The above table is only a summary of the numrical data in the dataframe. Just looking at the count column we can immediately see that there are a lot of missing valus in this data. This tells me that this data will be rather messy with many columns having NaNs or missing values.

It also looks like there is a lot of variation within this dataset. Just looking at the Site_EUI statistic, the 75th percentile is is 103 (kBtu/ft²), but the max is 801,504.7 (kBtu/ft²). This probably due to the number of different types of buildings in the city, as well as the fact that the data is biased due to the fact it is self-reported

The next thing I would like to see is how many of the buildings in NYC are passing the Benchmarking Submission Status:

In [4]:
plt.title('DOF Benchmarking Submission Status',fontsize=14)
Text(0, 0.5, 'count')

Most buildings are in compliance with the Department of Finance Benchmarking standards. Let's take a look at the violators:

In [5]:
Violators = df_2016_2[df_2016_2.Benchmarking_Status == 'In Violation']
BBL BINs Street_Number Street_Name Zip_Code Borough Benchmarking_Status Property_Type Year_Built Occupancy ... Site_EUI Nat_Gas Elec_Use GHG Water_Use NGI EI WI GHGI OPSFT
11978 2.051410e+09 NaN 300 BAYCHESTER AVENUE 10475 Bronx In Violation NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11979 3.088400e+09 NaN 3939 SHORE PARKWAY 11235 Brooklyn In Violation NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11980 3.088420e+09 NaN 2824 PLUMB 3 STREET 11235 Brooklyn In Violation NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11981 2.051411e+09 NaN 2100 BARTOW AVENUE 10475 Bronx In Violation NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11982 2.051410e+09 NaN 312 BAYCHESTER AVENUE 10475 Bronx In Violation NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 21 columns

There's not much we can learn from this, if we can look to see if certain zip codes have more buildings in violation. First thing we do is group by the zip codes and count them to get the number of violations per zip code:

In [6]:
zips_df = Violators.groupby('Zip_Code')['Zip_Code'].size()\

Now we want to visualize the the number of violators per zip code. To make things interesting we will create an interactive choropleth map using the Bokeh Library. Bokeh is a great vizualization tool that I have used in the past. We get the shapes for New York City zip codes as a geojson file from this site. The geojson file can be read into a dataframe using GeoPandas.

In [7]:
import geopandas as gpd
gdf = gpd.read_file("data/nyc-zip-code-tabulation-areas-polygons.geojson")

# GeoPandas doesn't allow users to convert the datatype while reading it in so we do it here
gdf["postalCode"] = gdf["postalCode"].astype(int)

We can see the basic contents of the GeoPandas dataframe:

In [8]:
OBJECTID postalCode PO_NAME STATE borough ST_FIPS CTY_FIPS BLDGpostal @id longitude latitude geometry
0 1 11372 Jackson Heights NY Queens 36 081 0 -73.883573 40.751662 POLYGON ((-73.86942457284177 40.74915687096788...
1 2 11004 Glen Oaks NY Queens 36 081 0 -73.711608 40.745366 POLYGON ((-73.71132911125308 40.74947450816085...
2 3 11040 New Hyde Park NY Queens 36 081 0 -73.703443 40.748714 POLYGON ((-73.70098278625547 40.73889569923034...
3 4 11426 Bellerose NY Queens 36 081 0 -73.724004 40.736534 POLYGON ((-73.72270447144122 40.75373371438336...
4 5 11365 Fresh Meadows NY Queens 36 081 0 -73.794626 40.739903 POLYGON ((-73.81088634744756 40.7271718757592,...

I noticed only a few of the zipcodes had actual names, so I wrote a script ( to scrape this website to obtain each neighborhood's name. I pickled the results so we could use them here:

In [9]:
zip_names = pd.read_pickle("data/neighborhoods.pkl")

We can attach them to our GeoPandas dataframe by joining them (on zip code),

In [10]:
gdf = gdf.drop(['PO_NAME'],axis=1)\
         .merge(zip_names, on="postalCode",how="left")\

Next, we'll left join our count of violators-per-zipcode zips_df to above dataframe and fill in the zip codes that do not have violations with zeros:

In [11]:
gdf= gdf.merge(zips_df, how="left", left_on="postalCode", right_on="Zip_Code")\
         .drop(["OBJECTID","Zip_Code"], axis=1)\

postalCode STATE borough ST_FIPS CTY_FIPS BLDGpostal @id longitude latitude geometry PO_NAME counts
0 11372 NY Queens 36 081 0 -73.883573 40.751662 POLYGON ((-73.86942457284177 40.74915687096788... West Queens 5.0
1 11004 NY Queens 36 081 0 -73.711608 40.745366 POLYGON ((-73.71132911125308 40.74947450816085... Southeast Queens 0.0

Now before we can use Bokeh to visualize our data we must first convert the GeoPandas dataframe to a format that Bokeh can work with. Since I already covered this in a previous blog post I won't go over the details, but here I used a slightly modified version of the function from that post:

In [12]:
bokeh_source = convert_GeoPandas_to_Bokeh_format(gdf)

Next we set bokeh io module to be in the notebook and use the function I wrote make_interactive_choropleth_map to create the in-notebook zipcode choropleth map:

In [53]:
from import output_notebook, show

# We get the min and max of the number of violations to give the cloropleth a scale.
max_num_violations = zips_df['counts'].max()
min_num_violations = zips_df['counts'].min()

fig = make_interactive_choropleth_map(bokeh_source,
                                      count_var = "Number Of Violations",
                                      min_ct    = min_num_violations,
                                      max_ct    = max_num_violations)
Loading BokehJS ...

You can hover your mouse over the each of the zipcode and the map will display the neighborhood name and number of violations. From this we can see that Chelsea, Downtown Brooklyn and Long Island City neighborhood have the highes number of violations.

The fact that different neighborhoods have different numbers of violating buildings gives us the suspicion that the neighborhood may be correlated with the buildings energy usage, this could be because of building owners that are in voliation owning multiple buildings on a single lot or neighrborhood.

Now let's move back to analyzing the buidlings that are not in violation. First let's see the distributution of all buildings that are in different ranges of the Energy Star ratings:

In [14]:
bins = [0,10,20,30,40,50,60,70,80,90,100]

                            .plot(kind    = 'bar',
                                  rot     = 35,
                                  figsize = (10,4),
                                  title   = 'Frequency of ENERGY STAR Ratings')
plt.xlabel('ENERGY STAR Score')
Text(0.5, 0, 'ENERGY STAR Score')

We can see that the majority are within the 50-100 range, but a almost 1000 buildings have scores inbetween 0 and 10. Let's take a look at the distribution of building types. We will just take the top 10 most common building types for now..

In [15]:
                          .plot(kind     = 'bar',
                                figsize  = (10,4.5),
                                fontsize = 12,
                                rot      = 60)
plt.title('Frequency of building type', fontsize=13)
plt.xlabel('Building Type', fontsize=13)
plt.ylabel('Frequency', fontsize=13)
Text(0, 0.5, 'Frequency')

The most common buildings in NYC are multifamily housing, then offices, other, hotels and somewhat suprisingly non-refrigerated warehouse space. I would have thought that there would be more schools and retail spaces than warehouses or dormitorites in New York City, but I don't know what the Primaty BBL listing is.

Let's look at the Energy Star ratings of buildings across different building types, but first how many different building types are there? We can find this out,

In [16]:
print("Number of building types are: {}".format(len(df_2016_2['Property_Type'].unique())))
Number of building types are: 54

This is too many building types to visualize the Energy Star Score (Energy_Star) of each, we'll just look at just 5 building types, lumping the 54 into the categories into either:

  • Residential
  • Office
  • Retail
  • Storage
  • Other

I built a function to group the buildings into the 5 types above called clean_property_type(...) and we use it below to transform the Pandas Series:

In [17]:
Property_Type = df_2016_2.copy()
Property_Type['Property_Type'] = Property_Type['Property_Type'].apply(group_property_types)

Now we can look at the Energy_Star (score) of each of the grouped buildings types:

In [18]:
bins2 = [0,20,35,50,65,80,100]

Energy_Star_Scores = Property_Type.groupby(['Property_Type'])['Energy_Star']


plt.title('Frequency of Energy Star Score by building type',fontsize=14)
plt.xlabel('Building Type and Energy Star', fontsize=13)
plt.ylabel('Frequency', fontsize=13)
Text(0, 0.5, 'Frequency')

Overall it looks like residential buildings have a lot more proportion of low Energy Star Scoring buildings when compared to office buildings. This is probably because there are much more older residential buildings than office spaces in New York City. We'll look at the distribution of the years in which builings of just properties of type: 'Multifamily Housing' and 'Office' were built:

In [19]:

It seems like it's the opposite of what I thought, but the number of residential buildings is much higher and the majority were built right before and right after World War 2, as well as in the 2000s. The same is true about offices, however, without the uptick in the early 2000s.

Let's just focus on the multifamily housing and see what we can find out about them since they may offer the best return on investment in terms of improving energy efficiency.

Analysis Of Multifamily Buildings

First let's look at the summary statistics of just the mulitfamily housing:

In [20]:
Multifamily_Buildings = df_2016_2[df_2016_2['Property_Type'] == 'Multifamily Housing']

Multifamily_Buildings.drop(temp_cols_to_drop, axis=1)\
Energy_Star Site_EUI Nat_Gas Elec_Use GHG Water_Use NGI EI WI GHGI OPSFT
count 7513.000000 8654.000000 7.914000e+03 8.643000e+03 8.669000e+03 5.499000e+03 7704.000000 8432.000000 5497.000000 8457.000000 8487.000000
mean 56.629842 405.938456 2.991518e+07 4.536019e+06 4.093800e+03 1.193724e+04 164.160356 31.980654 0.107960 0.021024 0.001122
std 30.477155 9520.852245 1.347821e+09 5.199372e+07 8.535570e+04 8.786550e+04 8502.255602 495.657557 0.905276 0.470851 0.000517
min 1.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000
25% 32.000000 67.900000 1.249754e+06 9.817544e+05 3.419000e+02 3.146700e+03 10.052411 13.248499 0.035663 0.004370 0.000696
50% 61.000000 82.900000 4.429401e+06 1.539532e+06 4.928000e+02 5.100200e+03 52.311449 16.720334 0.051203 0.005358 0.001149
75% 83.000000 101.100000 7.227774e+06 3.000909e+06 8.164000e+02 8.046100e+03 70.215899 22.614394 0.077178 0.006656 0.001558
max 100.000000 801504.700000 1.101676e+11 3.729784e+09 5.860677e+06 3.638813e+06 737791.764249 35863.305400 52.143200 39.190314 0.001999

We can see again, large variations in the energy data, with most of it being between 0 and some fixed number and then atleast one outlier. Comparing multifamily housing to all buildings in NYC (previous table) we can see that all of the mean value and variation of the energy, water and emission rates are lower for multifamily housing buildings than overall buildings in NYC.

Now let's take a look at how the performance metrics relate to one another by plotting the correlation matrix. But first, since we have so much missing data, let's find the total number of multifamily buildings and the number of multifamily buildings without missing data.

In [21]:
cols_to_drop = ['BBL','BINs','Street_Number','Street_Name', 'Occupancy',

X = Multifamily_Buildings.drop(cols_to_drop,axis=1)
X_clean = X.dropna()

print("Total Multifamily Buildings: {}".format(X.shape[0]))
print("Total Multifamily Buildings without missing data: {}".format(X_clean.shape[0]))
Total Multifamily Buildings: 8699
Total Multifamily Buildings without missing data: 4407

About half of the multifamily buildings have missing data, that's significant. Let's plot the correlation matrix to see how correlated are features are on the all the multifamily buildings. Note that we first have to normalize the data.

In [22]:
X_s = (X - X.mean())/X.std()

fig, ax = plt.subplots(figsize=(8,6))  
<matplotlib.axes._subplots.AxesSubplot at 0x1a1aca40b8>

We can see that,

  • Natural gas usage is fairly strongly correlated to green house emission rates which makes sense.
  • Energy usage intensity is strongly correlated with natural gas intensity, which again makes sense, since gas is a primary form of heating.

What doesn't make sense to me is that the energy star score is weakly correlated to any of the measures of energy, water or emissions. This is strange to me because a higher energy star score is supposed to reflect more efficient use of energy and water. Furthermore, the energy star score goes up (slightly) when the density of occupants increases, and the energy star rating should be independent of occupancy density since it is a property of the physical building and not the buildings' tenants.

We can see how the results change when we only use multifamily building data that do not have missing values:

In [23]:
# I don't want to scale the years since I will eventually treat them as a categorial variable
year_built         = X_clean['Year_Built'].to_frame()
X_clean_wo_year    = X_clean.drop('Year_Built',axis=1)

X_s_clean  = (X_clean_wo_year  - X_clean_wo_year.mean())/X_clean_wo_year.std()

# Now get the year built back with a left join
X_s_clean  = X_s_clean.merge(year_built,
                             left_index  = True,
                             right_index = True,
                             how         = "left")

fig, ax = plt.subplots(figsize=(8,6))
<matplotlib.axes._subplots.AxesSubplot at 0x1a1aa732e8>

The previously mentioned correlations are now stronger, but there is still too weak a correlation between energy star score and energy or water usage for my liking. We'll have to dig deeper into the data to see if there are outliers that are affecting our correlation matrix.

Removing Outliers

In this section we'll be looking at scatter plots of various features against the Site_EUI to try to identify outliers in the data.

Let's look at the scatter plot of natural gas intensity and energy usage intensity to see if we can observe outliers that may be affecting our correlation matrix. The reason we are doing so is that we suspect energy usage intensity should be highly correlated to natural gas intensity since natural gas is used for cooking and heating.

In [56]:
<seaborn.axisgrid.PairGrid at 0x1a22d48c50>

We can see that are some signficant outliers in our data. Experimenting with different values i was able to remove them and a clearer relationship between the natural gas usage and EUI:

In [55]:
X_s_Nat_Gas = X_s_clean[(X_s_clean.NGI < 0.02) & (X_s_clean.Site_EUI < 0.01)]

<seaborn.axisgrid.PairGrid at 0x1a22d14278>