The NYC Benchmarking Law requires owners of large buildings to annually measure their energy and water consumption in a process called benchmarking. The law standardizes this process by requiring building owners to enter their annual energy and water use in the U.S. Environmental Protection Agency's (EPA) online tool, ENERGY STAR Portfolio Manager® and use the tool to submit data to the City. This data gives building owners about a building's energy and water consumption compared to similar buildings, and tracks progress year over year to help in energy efficiency planning. In this blog post we will analyze how buildings in New York City use energy and water and make recommendations on how to improve their performance. The source code for this project can be found here.
Benchmarking data is also disclosed publicly and can be found here. I analyzed the 2016 data and my summary of the findings and recommendations for reducing energy consumption in New York City buildings are discussed below.
The 2016 data is very messy and a lot of cleaning was necessary to do analysis on it. There are about 13,223 buildings recorded in this dataset and many of them have missing data values. The cleaning process was made more difficult because the data was stored as strings with multiple non-numeric values which made converting the data to its proper type a more involved process.
As mentioned before this data was rather messy, I created functions to clean the data and also a plotting function that rather long. I kept these functions in the files
Plotting_Functions.py respecively. We import these functions along with other libraries and read in the data file below:
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline from Cleaning_Functions import * from Plotting_Functions import * df_2016 = pd.read_excel("nyc_benchmarking_disclosure_data_reported_in_2016.xlsx")
There are about 13,233 buildings with different types of energy usage, emissions and other information. I'll drop a bunch of these features since there are so many and only keep the following,
Where the terms in the square brackets are the column names in the dataframe. I wrote a basic function to clean the data feature names and drop the columns called
initial_clean(). We call it on our data and then get some basic statistics about the data:
df_2016_2 = initial_clean(df_2016) temp_cols_to_drop = ['BBL','Street_Number','Zip_Code'] df_2016_2.drop(temp_cols_to_drop, axis=1)\ .describe()
The above is only a summary of the numrical data in the dataframe. Just looking at the
count column we can immediately see that there are a lot of missing valus in this data. This tells me that this data will be rather messy with many columns having NaNs or missing values.
It also looks like there is a lot of variation within this dataset. Just looking at the
Site_EUI statistic, the 75th percentile is is 103 (kBtu/ft²), but the max is 801,504.7 (kBtu/ft²). This probably due to the number of different types of buildings in the city.
The next thing I would like to see is how many of the buildings in NYC are passing the Benchmarking Submission Status:
plt.figure(figsize=(5,4)) df_2016_2['Benchmarking_Status'].value_counts()\ .plot(kind='bar') plt.xticks(rotation=0) plt.title('DOF Benchmarking Submission Status')
<matplotlib.text.Text at 0x1104be910>
Most buildings are in compliance with the Department of Finance Benchmarking standards. Let's take a look at the violators:
Violators = df_2016_2[df_2016_2.Benchmarking_Status == 'In Violation'] Violators.head()
|11978||2.051410e+09||NaN||300.0||BAYCHESTER AVENUE||10475.0||Bronx||In Violation||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN|
|11979||3.088400e+09||NaN||3939.0||SHORE PARKWAY||11235.0||Brooklyn||In Violation||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN|
|11980||3.088420e+09||NaN||2824.0||PLUMB 3 STREET||11235.0||Brooklyn||In Violation||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN|
|11981||2.051411e+09||NaN||2100.0||BARTOW AVENUE||10475.0||Bronx||In Violation||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN|
|11982||2.051410e+09||NaN||312.0||BAYCHESTER AVENUE||10475.0||Bronx||In Violation||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN|
There's not much we can learn from this, if we had more time we could look to see if certain zip codes have more buildings in violation.
Let's see the distributution of all buildings that are in different ranges of the
Energy Star ratings:
bins = [0,10,20,30,40,50,60,70,80,90,100] df_2016['ENERGY STAR Score'].value_counts(bins=bins)\ .sort_index()\ .plot(kind='bar', title='Frequency of ENERGY STAR Ratings') plt.ylabel('Frequency') plt.xlabel('ENERGY STAR Score')
<matplotlib.text.Text at 0x10ed58210>
We can see that the majority are within the 50-100 range, but a almost 1000 buildings have scores inbetween 0 and 10. Let's take a look at the distribution of building types. We will just take the top 10 most common building types for now..
df_2016_2['Property_Type'].value_counts()\ .head(10)\ .plot(kind='bar', title='Frequency of building type') plt.xlabel('Building Type') plt.ylabel('Frequency')
<matplotlib.text.Text at 0x10fbd5b50>
The most common buildings in NYC are multifamily housing, then offices, then other, hotels and somewhat suprisingly non-refrigerated warehouse space. I would have thought that there would be more schools and retail spaces than warehouses or dormitorites in New York City, but I don't know what the
Primaty BBL listing is.
Let's look at the Energy Star ratings of buildings across different building types, but first how many different building types are there? We can find this out,
print "Number of building types are: " + str(len(df_2016_2['Property_Type'].unique()))
Number of building types are: 54
This is too many building types to visualize the Energy Star Score (
Energy_Star) of each, we'll just look at just 5 building types, lumping the 54 into the categories into either:
I built a function to group the buildings into the 5 types above called
clean_property_type(...) and we use it below to transform the Pandas Series:
Property_Type = df_2016_2.copy() Property_Type['Property_Type']=Property_Type['Property_Type'].apply(group_property_types)
Now we can look at the
Energy_Star (score) of each of the buildings types:
bins2 = [0,20,35,50,65,80,100] Energy_Star_Scores = Property_Type.groupby(['Property_Type'])['Energy_Star'] Energy_Star_Scores.value_counts(bins=bins2)\ .sort_index()\ .plot(kind='bar', title='Frequency of Energy Star Score by building type') plt.xlabel('Building Type and Energy Star') plt.ylabel('Frequency')
<matplotlib.text.Text at 0x1104f3110>
Overall it looks like residential buildings have a lot more proportion of low Energy Star Scoring buildings when compared to office buildings. This is probably because there are much more older residential buildings than office spaces in New York City. We'll look at the distribution of the years in which builings of just properties of type: 'Multifamily Housing' and 'Office' were built:
It seems like it's the opposite of what I thought, but the number of residential buildings is much higher and the majority were built right before and right after the war as well as the the 2000s. The same is true about offices without the uptick in the early 2000s.
Let's just focus on the multifamily housing and see what we can find out about them since they may offer the best return on investment in terms of improving energy efficiency.
First let's look at the summary statistics of just the mulitfamily housing:
Multifamily_Buildings = df_2016_2[df_2016_2['Property_Type'] == 'Multifamily Housing'] Multifamily_Buildings.drop(temp_cols_to_drop,axis=1) .describe()
We can see again, large variations in the energy data, with most of it being between 0 and some fixed number and then atleast one outlier. Comparing multifamily housing to all buildings in NYC (previous table) we can see that all of the mean value and variation of the energy, water and emission rates are lower for multifamily housing buildings than overall buildings in NYC.
Now let's take a look at how the performance features relate to by plotting the correlation matrix. But first since we have so much missing data, let's take at the total number of multifamily buildings and the number of multifamily buildings without missing data.
cols_to_drop = ['BBL','BINs','Street_Number','Street_Name', 'Zip_Code', 'Borough','Benchmarking_Status', 'Property_Type', 'Occupancy'] X = Multifamily_Buildings.drop(cols_to_drop,axis=1) X_clean = X.dropna() print "Total Multifamily Buildings: " + str(X.shape) print "Total Multifamily Buildings without missing data: " + str(X_clean.shape)
Total Multifamily Buildings: 8699 Total Multifamily Buildings without missing data: 4407
About half of the multifamily buildings have missing data, that's significant. Let's plot the correlation matrix to see how correlated are features are on the all the multifamily buildings. We note that we first have to normalize the data.
fig, ax = plt.subplots(figsize=(6,5)) X_s = (X - X.mean())/X.std() sns.heatmap(X_s.corr())
<matplotlib.axes._subplots.AxesSubplot at 0x1044f8e50>
We can see that natural gas usage is fairly strongly correlated to green house emission rates, which makes sense. We can also see that Energy Usage Intensity is strongly correlated with Natural Gas Intensity. What doesn't make sense is that the energy star score is weakly correlated to any of the measures of energy or water usage. This is because the a higher energy star score is supposed to reflect more efficient use of energy and water.
We can see how the results change when we only use multifamily building data that do not have missing values:
fig, ax = plt.subplots(figsize=(6,5)) X_s_clean = (X_clean - X_clean.mean())/X_clean.std() sns.heatmap(X_s_clean.corr())
<matplotlib.axes._subplots.AxesSubplot at 0x1164d73d0>
The previously mentioned correlations are now stronger, but still too weak a correlation between energy star score and energy or water usage for my liking. Let's look at the scatter plot of natural gas usage and energy usage intensity:
sns.pairplot(X_s, x_vars='Nat_Gas', y_vars='Site_EUI', kind='reg', size=5,dropna=True)
<seaborn.axisgrid.PairGrid at 0x115dac990>
We can see that are some signficant outliers in our data. Experimenting with different values i was able to remove them and a clearer relationship between the natural gas usage and EUI:
X_s_Nat_Gas = X_s[X_s.Nat_Gas < 0.4] X_s_Nat_Gas = X_s_Nat_Gas[X_s_Nat_Gas.Site_EUI < 0.2] sns.pairplot(X_s_Nat_Gas, x_vars='Nat_Gas', y_vars='Site_EUI', kind='reg', size=5, dropna=True)
<seaborn.axisgrid.PairGrid at 0x115a36d90>
I did the same for electricity usage and EUI:
X_s_Elec_Use = X_s[X_s.Elec_Use < 2] X_s_Elec_Use = X_s_Elec_Use[X_s_Elec_Use.Site_EUI < 0.1] sns.pairplot(X_s_Elec_Use, x_vars='Elec_Use', y_vars='Site_EUI', kind='reg', size=5, dropna=True)
<seaborn.axisgrid.PairGrid at 0x115d83910>
And did the same for water usage and EUI. While it might not seem water and energy usage could be correlated with energy usage intensity, often water is used for heating and cooling in buildings:
X_s_Water_Use = X_s[X_s.Water_Use < 1] X_s_Water_Use = X_s_Water_Use[X_s_Water_Use.Site_EUI < 0.04] sns.pairplot(X_s_Water_Use, x_vars='Water_Use', y_vars='Site_EUI', kind='reg', size=5, dropna=True)
<seaborn.axisgrid.PairGrid at 0x116540d10>
Now we make a dataframe that has all the removed outliers and we plot the EUI vs. energy and water usage:
X_outliers_removed = X_s[X_s.Nat_Gas < 0.4] X_outliers_removed = X_outliers_removed[X_outliers_removed.Elec_Use < 2] X_outliers_removed = X_outliers_removed[X_outliers_removed.Site_EUI < 0.08] X_outliers_removed = X_outliers_removed[X_outliers_removed.Water_Use < 1] sns.pairplot(X_outliers_removed, x_vars=['Nat_Gas','Elec_Use','Water_Use'], y_vars='Site_EUI', kind = 'reg', size=5, dropna=True)
<seaborn.axisgrid.PairGrid at 0x1167d0950>
The relationship isn't quite as strong as compared to the EUI vs. energy intensity and water intensity below:
sns.pairplot(X_outliers_removed, x_vars=['NGI','EI','WI'], y_vars='Site_EUI', kind = 'reg', size=5, dropna=True)
<seaborn.axisgrid.PairGrid at 0x116c89110>
It look's like EUI is very strongly correlated to the natural gas usage intensity when comparedd to electricity usage. The relationship between EUI and water usage intensity is a little harder to ascertain.
We can also see from the plot below that the site EUI wasnt very correlated to the year that the buildings were constructed, probably because so many were built around the same time period.
sns.pairplot(X_outliers_removed, x_vars='Year_Built', y_vars='Site_EUI', size=5, dropna=True)
<seaborn.axisgrid.PairGrid at 0x1174bab10>
Now that we have removed some outliers and have multifamily buildings that have very similar characteristis we can visualize the correlation matrix to try to gleam some insights into improving these building energy efficiency.
fig, ax = plt.subplots(figsize=(6,5)) sns.heatmap(X_outliers_removed.corr())
<matplotlib.axes._subplots.AxesSubplot at 0x117341890>
This is a much more belivable correlation matrix than the previous two. We can see that the Energy Star score is very negatively correlated with enery usage intensity, which make sense as the energy star score is a measure of energy efficiceny of the building.
We should note that the number of buildings in this case is,
print "Total Multifamily Buildings in reduced data: " + str(X_outliers_removed.shape)
Total Multifamily Buildings in reduced data: 4967
While our procedure does not have substantially more buildings then the case when we drop all the buildings with any missing data, the data on the buildings in this case is much more informative.
We can see that the Energy Star score is negatively correlated with enery usage intensity. This makes sense as the energy star score is a measure of energy efficiceny of the building. The energy usage intensity has a strong positive correlation to the natural gas usage intensity. We can see that natural gas usage intensity and electricity usage intensity are uncorrlated. This implies we could replace one with the other. Therefore reducing natural gas consumption or replacing it with electricity could be the best answer to reducing energy usage intensity. It should also be noted that year the residence was built did not have any correlation with energy usage intensity. This is probably due to the fact that the majority of residential buildings in New York City were built in a similar time period and before energy efficiency was a priority.
Since natural gas usage intensity is most highly correlated with energy usage intensity reducing it could improve building energy efficiency. Some ways to reduce natural gas consumption include:
With more time I would have like to look into more of the features/indicators of peformance of buildings and their affact on energy efficiency. I also would have liked to look at the historical data of each building and identify which buildings have been improving and which have not. I would have also like to have mapped out the different properties to see if there are areas or property owners that have particular issues with energy usage intensity or compliance with the Department Of Finance Benchmarking. I think this one be useful in helping to have a more focused approach to finding buildings with the highest ROI on energy efficiency investments.