3. Analysis Of Multifamily Buildings
4. Predictive Models For Green House Gas Emission
5. Conclusions And Recommendations
I started this project a while back with a goal of taking the 2016 NYC Benchmarking Law data about building energy usage and do something interesting with it. I originally attmpted to clean and analyze this data set to try to find ways to reduce builings' energy usage and subsequently their green house gas emissions. After a few iterations I thought it might be interesting to see if we could additonally predict the emission of green house gases from buildings by looking at their age, energy and water consumption as well as other energy consumption metrics. In the modeling section we look at three different modes for predicting green house gas emissions:
Linear Regression
Generalized Additive Models
Gradient Boosted Regression Trees
In the conclusion section I not only summarize the findings, but give some specific recommendations to reduce the multi-family buildings energy usage.
The NYC Benchmarking Law requires owners of large buildings to annually measure their energy and water consumption in a process called benchmarking. The law standardizes this process by requiring building owners to enter their annual energy and water use in the U.S. Environmental Protection Agency's (EPA) online tool, ENERGY STAR Portfolio Manager® and use the tool to submit data to the City. This data gives building owners about a building's energy and water consumption compared to similar buildings, and tracks progress year over year to help in energy efficiency planning. In this blog post we will analyze how buildings in New York City use energy and water, make recommendations on how to improve their performance and also model their green house gas emissions. The source code for this project can be found here.
Benchmarking data is also disclosed publicly and can be found here. I analyzed the 2016 data and my summary of the findings and recommendations for reducing energy consumption in New York City buildings are discussed in the conclusions section.
The 2016 data is very messy and a lot of cleaning was necessary to do analysis on it. One thing to note is that this is self-reported data, meaning the performance data wasn't collected by a non-bias third party but by the building owners. This means our data is biased and we will keep this in mind while performing our analysis.
There are about 13,223 buildings recorded in this dataset and many of them have missing data values. While there are many different techniques for imputing missing values, there was sufficient number of buildings with all their values that I did not need to impute missing values. The cleaning process was made more difficult because the data was stored as strings with multiple non-numeric values which made converting the data to its proper type a more involved process.
Since the cleaning was more tedious I created external functions do handle this processes. In addition, I also a created functions to handle transforming and plotting the data. I kept these functions in seperate files Cleaning_Functions.py
and Plotting_Functions.py
respecively so as to not clutter the post. We import these functions along with other basic libraries (Pandas, Matplotlib and Seaborn) as well as read in the data file below:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
from Cleaning_Functions import (initial_clean,
convert_GeoPandas_to_Bokeh_format,
group_property_types)
from Plotting_Functions import (plot_years_built,
make_interactive_choropleth_map)
# Here we specifify a few data types as integers while reading in the Excel
df_2016 = pd.read_excel("data/nyc_benchmarking_disclosure_data_reported_in_2016.xlsx",
converters={'Street Number':int,
'Zip Code':int,
'Year Build':int,
'ENERGY STAR Score':int})
There are about 13,233 buildings with different types of energy usage, emissions and other information. I'll drop a bunch of these features and only keep the following,
The terms in the square brackets are the column names used in the dataframe. In addition, we must do some basic feature engineering. The reported data gives us the metrics (NAT_Gas
, Elec_Use
, GHG
, Water_Use
) in terms of total volume. Using these metrics in comparing buildings of different sizes is not a fair comparison. In order to compare them fairly we must standardize these metrics by dividing by the square footage of the buildings giving us each metrics' intensity. We therefore have the following features,
I wrote a basic function called initial_clean()
. to clean the data create the additional features. We call it on our dataset and then get some basic statistics about the data:
df_2016_2 = initial_clean(df_2016)
temp_cols_to_drop = ['BBL','Street_Number','Zip_Code','Occupancy']
df_2016_2.drop(temp_cols_to_drop, axis=1)\
.describe()
The above table is only a summary of the numrical data in the dataframe. Just looking at the count
column we can immediately see that there are a lot of missing valus in this data. This tells me that this data will be rather messy with many columns having NaNs or missing values.
It also looks like there is a lot of variation within this dataset. Just looking at the Site_EUI
statistic, the 75th percentile is is 103 (kBtu/ft²), but the max is 801,504.7 (kBtu/ft²). This probably due to the number of different types of buildings in the city, as well as the fact that the data is biased due to the fact it is self-reported
The next thing I would like to see is how many of the buildings in NYC are passing the Benchmarking Submission Status:
plt.figure(figsize=(10,4))
df_2016_2['Benchmarking_Status'].value_counts()\
.plot(kind='bar',
fontsize=12,
rot=0)
plt.title('DOF Benchmarking Submission Status',fontsize=14)
plt.ylabel('count',fontsize=12)
Most buildings are in compliance with the Department of Finance Benchmarking standards. Let's take a look at the violators:
Violators = df_2016_2[df_2016_2.Benchmarking_Status == 'In Violation']
Violators.head()
There's not much we can learn from this, if we can look to see if certain zip codes have more buildings in violation. First thing we do is group by the zip codes and count them to get the number of violations per zip code:
zips_df = Violators.groupby('Zip_Code')['Zip_Code'].size()\
.reset_index(name='counts')
Now we want to visualize the the number of violators per zip code. To make things interesting we will create an interactive choropleth map using the Bokeh Library. Bokeh is a great vizualization tool that I have used in the past. We get the shapes for New York City zip codes as a geojson file from this site. The geojson file can be read into a dataframe using GeoPandas.
import geopandas as gpd
gdf = gpd.read_file("data/nyc-zip-code-tabulation-areas-polygons.geojson")
# GeoPandas doesn't allow users to convert the datatype while reading it in so we do it here
gdf["postalCode"] = gdf["postalCode"].astype(int)
We can see the basic contents of the GeoPandas dataframe:
gdf.head()
I noticed only a few of the zipcodes had actual names, so I wrote a script (getNeighborhoodNames.py
) to scrape this website to obtain each neighborhood's name. I pickled the results so we could use them here:
zip_names = pd.read_pickle("data/neighborhoods.pkl")
We can attach them to our GeoPandas dataframe by joining them (on zip code),
gdf = gdf.drop(['PO_NAME'],axis=1)\
.merge(zip_names, on="postalCode",how="left")\
.fillna("")
Next, we'll left join our count of violators-per-zipcode zips_df
to above dataframe and fill in the zip codes that do not have violations with zeros:
gdf= gdf.merge(zips_df, how="left", left_on="postalCode", right_on="Zip_Code")\
.drop(["OBJECTID","Zip_Code"], axis=1)\
.fillna(0)
gdf.head(2)
Now before we can use Bokeh to visualize our data we must first convert the GeoPandas dataframe to a format that Bokeh can work with. Since I already covered this in a previous blog post I won't go over the details, but here I used a slightly modified version of the function from that post:
bokeh_source = convert_GeoPandas_to_Bokeh_format(gdf)
Next we set bokeh io
module to be in the notebook and use the function I wrote make_interactive_choropleth_map
to create the in-notebook zipcode choropleth map:
from bokeh.io import output_notebook, show
output_notebook()
# We get the min and max of the number of violations to give the cloropleth a scale.
max_num_violations = zips_df['counts'].max()
min_num_violations = zips_df['counts'].min()
fig = make_interactive_choropleth_map(bokeh_source,
count_var = "Number Of Violations",
min_ct = min_num_violations,
max_ct = max_num_violations)
show(fig)
You can hover your mouse over the each of the zipcode and the map will display the neighborhood name and number of violations. From this we can see that Chelsea, Downtown Brooklyn and Long Island City neighborhood have the highes number of violations.
The fact that different neighborhoods have different numbers of violating buildings gives us the suspicion that the neighborhood may be correlated with the buildings energy usage, this could be because of building owners that are in voliation owning multiple buildings on a single lot or neighrborhood.
Now let's move back to analyzing the buidlings that are not in violation. First let's see the distributution of all buildings that are in different ranges of the Energy Star
ratings:
bins = [0,10,20,30,40,50,60,70,80,90,100]
df_2016_2['Energy_Star'].value_counts(bins=bins)\
.sort_index()\
.plot(kind = 'bar',
rot = 35,
figsize = (10,4),
title = 'Frequency of ENERGY STAR Ratings')
plt.ylabel('Frequency')
plt.xlabel('ENERGY STAR Score')
We can see that the majority are within the 50-100 range, but a almost 1000 buildings have scores inbetween 0 and 10. Let's take a look at the distribution of building types. We will just take the top 10 most common building types for now..
df_2016_2['Property_Type'].value_counts()\
.head(10)\
.plot(kind = 'bar',
figsize = (10,4.5),
fontsize = 12,
rot = 60)
plt.title('Frequency of building type', fontsize=13)
plt.xlabel('Building Type', fontsize=13)
plt.ylabel('Frequency', fontsize=13)
The most common buildings in NYC are multifamily housing, then offices, other, hotels and somewhat suprisingly non-refrigerated warehouse space. I would have thought that there would be more schools and retail spaces than warehouses or dormitorites in New York City, but I don't know what the Primaty BBL
listing is.
Let's look at the Energy Star ratings of buildings across different building types, but first how many different building types are there? We can find this out,
print("Number of building types are: {}".format(len(df_2016_2['Property_Type'].unique())))
This is too many building types to visualize the Energy Star Score (Energy_Star
) of each, we'll just look at just 5 building types, lumping the 54 into the categories into either:
I built a function to group the buildings into the 5 types above called clean_property_type(...)
and we use it below to transform the Pandas Series:
Property_Type = df_2016_2.copy()
Property_Type['Property_Type'] = Property_Type['Property_Type'].apply(group_property_types)
Now we can look at the Energy_Star
(score) of each of the grouped buildings types:
bins2 = [0,20,35,50,65,80,100]
Energy_Star_Scores = Property_Type.groupby(['Property_Type'])['Energy_Star']
Energy_Star_Scores.value_counts(bins=bins2)\
.sort_index()\
.plot(kind='bar',
figsize=(13,6),
fontsize=13)
plt.title('Frequency of Energy Star Score by building type',fontsize=14)
plt.xlabel('Building Type and Energy Star', fontsize=13)
plt.ylabel('Frequency', fontsize=13)
Overall it looks like residential buildings have a lot more proportion of low Energy Star Scoring buildings when compared to office buildings. This is probably because there are much more older residential buildings than office spaces in New York City. We'll look at the distribution of the years in which builings of just properties of type: 'Multifamily Housing' and 'Office' were built:
plot_years_built(df_2016_2)
It seems like it's the opposite of what I thought, but the number of residential buildings is much higher and the majority were built right before and right after World War 2, as well as in the 2000s. The same is true about offices, however, without the uptick in the early 2000s.
Let's just focus on the multifamily housing and see what we can find out about them since they may offer the best return on investment in terms of improving energy efficiency.
First let's look at the summary statistics of just the mulitfamily housing:
Multifamily_Buildings = df_2016_2[df_2016_2['Property_Type'] == 'Multifamily Housing']
Multifamily_Buildings.drop(temp_cols_to_drop, axis=1)\
.describe()
We can see again, large variations in the energy data, with most of it being between 0 and some fixed number and then atleast one outlier. Comparing multifamily housing to all buildings in NYC (previous table) we can see that all of the mean value and variation of the energy, water and emission rates are lower for multifamily housing buildings than overall buildings in NYC.
Now let's take a look at how the performance metrics relate to one another by plotting the correlation matrix. But first, since we have so much missing data, let's find the total number of multifamily buildings and the number of multifamily buildings without missing data.
cols_to_drop = ['BBL','BINs','Street_Number','Street_Name', 'Occupancy',
'Borough','Benchmarking_Status','Property_Type']
X = Multifamily_Buildings.drop(cols_to_drop,axis=1)
X_clean = X.dropna()
print("Total Multifamily Buildings: {}".format(X.shape[0]))
print("Total Multifamily Buildings without missing data: {}".format(X_clean.shape[0]))
About half of the multifamily buildings have missing data, that's significant. Let's plot the correlation matrix to see how correlated are features are on the all the multifamily buildings. Note that we first have to normalize the data.
X_s = (X - X.mean())/X.std()
fig, ax = plt.subplots(figsize=(8,6))
sns.color_palette("BuGn_r",)
sns.heatmap(X_s.corr(),
linewidths=.5,
cmap="RdBu_r")
We can see that,
What doesn't make sense to me is that the energy star score is weakly correlated to any of the measures of energy, water or emissions. This is strange to me because a higher energy star score is supposed to reflect more efficient use of energy and water. Furthermore, the energy star score goes up (slightly) when the density of occupants increases, and the energy star rating should be independent of occupancy density since it is a property of the physical building and not the buildings' tenants.
We can see how the results change when we only use multifamily building data that do not have missing values:
# I don't want to scale the years since I will eventually treat them as a categorial variable
year_built = X_clean['Year_Built'].to_frame()
X_clean_wo_year = X_clean.drop('Year_Built',axis=1)
X_s_clean = (X_clean_wo_year - X_clean_wo_year.mean())/X_clean_wo_year.std()
# Now get the year built back with a left join
X_s_clean = X_s_clean.merge(year_built,
left_index = True,
right_index = True,
how = "left")
fig, ax = plt.subplots(figsize=(8,6))
sns.heatmap(X_s_clean.corr(),
linewidths=.5,
cmap="RdBu_r")
The previously mentioned correlations are now stronger, but there is still too weak a correlation between energy star score and energy or water usage for my liking. We'll have to dig deeper into the data to see if there are outliers that are affecting our correlation matrix.
In this section we'll be looking at scatter plots of various features against the Site_EUI
to try to identify outliers in the data.
Let's look at the scatter plot of natural gas intensity and energy usage intensity to see if we can observe outliers that may be affecting our correlation matrix. The reason we are doing so is that we suspect energy usage intensity should be highly correlated to natural gas intensity since natural gas is used for cooking and heating.
sns.pairplot(X_s_clean,
x_vars='NGI',
y_vars='Site_EUI',
kind='reg',
size=5,
dropna=True)
We can see that are some signficant outliers in our data. Experimenting with different values i was able to remove them and a clearer relationship between the natural gas usage and EUI:
X_s_Nat_Gas = X_s_clean[(X_s_clean.NGI < 0.02) & (X_s_clean.Site_EUI < 0.01)]
sns.pairplot(X_s_Nat_Gas,
x_vars='NGI',
y_vars='Site_EUI',
kind='reg',
size=5,
dropna=True)
We can see that buildings that have higher natural gas usage per sqft have a higher energy intensity and this makes sense! I repeat the same procedure for electricity usage intensity and EUI since again, we would think they should be positively correlated:
X_s_Elec_Use = X_s_clean[(X_s_clean.EI < 0.5) & (X_s_clean.Site_EUI < 0.01)]
sns.pairplot(X_s_Elec_Use,
x_vars='EI',
y_vars='Site_EUI',
kind='reg',
size=5,
dropna=True)
We again see that buildings with higher electricity per square foot have a higher energy usage intensity.
Now we do the same for water usage intensity and EUI. While it might not seem water and energy usage could be correlated with energy usage intensity, often water is used for heating and cooling:
X_s_Water_Use = X_s_clean[(X_s_clean.WI < 0.2) & (X_s_clean.Site_EUI < 0.01)]
sns.pairplot(X_s_Water_Use,
x_vars='WI',
y_vars='Site_EUI',
kind='reg',
size=5,
dropna=True)
Interestingly, here we see a slighly weaker positive correlation, meaning buildings which use more water per square foot only have marginarly higher energy usage intensity. Lastly we repleat the procedure for green house has emission intensity, since you would think that if a building is more energy intensive it has a much larger carbon foot print.
X_s_GHG = X_s_clean[(X_s_clean.GHGI < 0.02) & (X_s_clean.Site_EUI < 0.01)]
sns.pairplot(X_s_GHG,
x_vars='GHGI',
y_vars='Site_EUI',
kind='reg',
size=5,
dropna=True)
Indeed this is true! Green house gas emissions are highly correlated to energy usage intensity; infact its almost linearly related! This means
Now, lets make a dataframe that has all the removed outliers and the EUI vs. natural gas, electricity, water and green house gas emission intensity below:
X_outliers_removed = X_s_clean[(X_s_clean.NGI < 0.02) &
(X_s_clean.EI < 0.5) &
(X_s_clean.WI < 0.2) &
(X_s_clean.GHGI < 0.02) &
(X_s_clean.Site_EUI < 0.01)]
sns.pairplot(X_outliers_removed,
x_vars=['NGI','EI','WI','GHGI'],
y_vars='Site_EUI',
kind = 'reg',
size=5,
dropna=True)
It look's like EUI is very positively correlated to the green house gas emission intensity and natural gas usage intensity. The relationship between EUI is slightly less strongly correlated to electricity usage intensity and even less so with water usage intensity.
We can also see from the plot below that the site EUI was not very correlated to the year that the buildings were constructed. Most likely this is because so many were built around the same time period.
sns.pairplot(X_outliers_removed,
x_vars='Year_Built',
y_vars='Site_EUI',
size=5,
dropna=True)
Now that we have removed some outliers we now have multifamily buildings that have very similar characteristis. We can visualize the correlation matrix to try to gleam some insights into improving these building energy efficiency.
fig, ax = plt.subplots(figsize=(8,6))
sns.heatmap(X_outliers_removed.corr(),
linewidths=.5,
cmap="RdBu_r")
This is a much more belivable correlation matrix than the previous two we looked at!!
We can see that the Energy Star score is very negatively correlated with enery usage intensity, which make sense as the energy star score is a measure of energy efficiceny of the building. Furthermore the Energy Star score is negatively correlated with electricity, water, and natural gas usage intensity as well as green house gas emission intensity which makes sense. Additionally, we see that Energy Star Score is independent of occupancy per square foot which makes sense since it is a fucntion of the building's efficiency and not the people inside of it.
We should note that the number of buildings in this case is,
print("Total Multifamily Buildings in reduced data: {}".format(X_outliers_removed.shape[0]))
We see that removing outliers only removed 94 buildings or about 2% of the buildings, a small price to pay for getting much more informative data!
Since energy usage intensity is almost linearly correlated with green house gas emissions we should be able to predict the green house gas emissions of a bulding from its energy intensity usage. However, the energy usage intensity might not always be readily available and instead we'll see if we can predict the green house gas emissions directly using the other energy consumption characteristics of the buildings.
Now let's build a predictive model for green house gas Emissions by multifamily homes in New York City. We'll do this at first using a simple linear regression model. While not the best in terms of perdictive performance it is often a best first step since it allows one to interpet the effect each feature has on the predicted green house gas emissions.
Before building a model, we first have to prepare our features and scale them appropriately. This involves dictating which are the continuous variabels and the categorical variables. We define them as,
continuous_vars = ["NGI", "EI", "WI", "GHGI", "OPSFT","Site_EUI"]
df_cont = X_clean[continuous_vars]
categorical_vars = ["Year_Built", "Zip_Code", "Energy_Star"]
df_cat = X_clean[categorical_vars]
Now we scale our continuous variables,
df_cont_s = (df_cont - df_cont.mean() ) / df_cont.std()
We now declare make our categorical variables. Note that we defined the year built and energy star rating as ordered categorical variables since there is a natural ordering for them, however, zipcode has no natural ordering.
df_cat["Zip_Code"] = df_cat["Zip_Code"].astype("category", ordered=False)
df_cat["Year_Built"] = df_cat["Year_Built"].astype("category", ordered=True)
df_cat["Energy_Star"] = df_cat["Energy_Star"].astype("category", ordered=True)
Now we can merge the two dataframes back together on their index
df_s = df_cont_s.merge(df_cat,
left_index=True,
right_index=True)
Now we can filter the oultiers and define our training and test set.
Note that we drop the site EUI since it is highly correlated to other features. Although this is most correlated to green house gas emissions, it might be harder to measure, so we remove it for now to see how well we can do without it.
df_s = df_s[(df_s.NGI < 0.02) &
(df_s.EI < 0.5) &
(df_s.WI < 0.2) &
(df_s.GHGI < 0.02) &
(df_s.Site_EUI < 0.01)]
# target variable is green house gas emission
y = df_s["GHGI"]
# features are all the other
X = df_s.drop(["GHGI", "Site_EUI"], axis=1) #,"Zip_Code","OPSFT","Site_EUI"],axis=1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
As we stated early we'll start out with a Linear Regression Model since it is simple and interpertable. The form of a linear regression the model assumes that the green hour gas emission intensity (GHGI
) labeld $y$, can be expressed a linear combination,
$$ y \; = \; \theta_{0} + \theta_{1} x_{1} + \theta_{2} x_{2} + \ldots + \theta_{p} x_{p} + \epsilon$$
The $x_{i}$'s are the features for our model (WI
, OPSFT
, etc), and the error $\epsilon \sim N(0,\sigma^{2}$) is normally distributed. We then make a model $h_{\theta}(\textbf{x})$,
$$ h_{\theta}(\textbf{x}) \; = \; \theta_{0} + \theta_{1} x_{1} + \theta_{2} x_{2} + \ldots + \theta_{p} x_{p} $$
And estimate the coefficents ($\theta_{i}$) are estimated using ordinary least squares methods. This leads to the expected value of our preditions being unbiased:
$$ E[ h_{\theta}(\textbf{x})] \; = \; \theta_{0} + \theta_{1} x_{1} + \theta_{2} x_{2} + \ldots + \theta_{p} x_{p} $$
We can easily implement a least squares regression model using Scikit-learn:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
reg = LinearRegression(n_jobs=-1)
reg.fit(X_train, y_train)
We can then evaluate the model performance ($R^{2}$-score) on the test set to see how much variance in the model we are able to explain,
y_pred = reg.predict(X_test)
# Explained variance score:
print("Variance score: {}".format(r2_score(y_test, y_pred)))
We can explain 53% of the variance which definitely leaves room for improvement, but we expected that such a simple model would not perform well. Let's take a look at which features are most important.
for coef in zip(X_train.columns, reg.coef_):
print(coef)
For features that have continuous values the model coefficents tell us that an increase on of their in units results in an incrase in the unit of green house emissions tha is equal to the coefficent. We can see that increasing the electricty, water intensity and occupancy increase the green house gas emissions, which makes sense. However, increases in natural gas intensity usage decreases the green house gas emisssions. This is strange, however, it could also be because of multi-collinarity of the features.
Additionally, I'm not sure how green house gas emissions are calculated, if the calculations include the emissions from power sources this could make sense since electricity often comes from coal which releases more emissions that using natural gas. Therefore switching to natural gas instead of electricity could reduce green house gas emissions.
We can measure the p-values for coefficents by using Scikit-Learns's f_regression function.
from sklearn.feature_selection import f_regression
f_stats, pvals = f_regression(X_train, y_train)
for coef in zip(X_train.columns, pvals):
print("{0}'s p-value : {1}".format(coef[0],coef[1]))
We see that even though the coeficients of the regression model are rather small, their small p-values signifiy that they are still signifcant and should be included in our model.
Let's look at some of the residuals in the continuous features to see if we can find any non-linear patterns that might signal ways improve the model.
# compute the residuals
resid = (y_test - y_pred).to_frame()\
.rename(columns={"GHGI":"Residuals"})
# left join them to the original features
x_resid = X_test.merge(resid, left_index=True, right_index=True)
# plot the residuals
sns.pairplot(x_resid,
x_vars=['NGI','EI','WI','OPSFT'],
y_vars='Residuals',
kind = 'scatter',
size=5,
dropna=True)
A slightly parabolic pattern in the WI
's residuals, but nothing to clear in the others. One way we could improve our model's $R^{2}$ value is to tune the regularization parameter to reduce variance, however, let's move on to a more flexible model and hopefully improve our predictions. A more flexible model will reduce the bias (which we can tell is high since the scatter plots do not show a linear relationship) in our model and therefore improve the overall performance.
Linear regression is a specific type of generalized linear model. Generalized lienar models relate the expected outcome of the target variable, $y$, to the model's features $x_{1}, x_{2}, \ldots x_{p}$ through the equation:
$$ g(E[h_{\theta}(\textbf{x})]) \; = \; \theta_{0} + \theta_{1} x_{1} + \theta_{2} x_{2} + \ldots + \theta_{p} x_{p} $$
The function $g$ is called the link function. For linear regresion the link function is just the identity function, while for logistic regression it is the logit function.
Generalized Additive Models are extension of the generalized addative model that assume that the expected outcome of the target variable $y$ to the features $x_{1}, x_{2}, \ldots x_{p}$ through the equation:
$$ g(E[h_{\theta}(\textbf{x})]) \; = \; \theta_{0} + f_{1}(x_{1}) + f_{2}(x_{2}) + \ldots + f_{p}(x_{p}) $$
The basis functions $f_{i}$ can be functions of a parameteric form such as a polynomial, or more general non-parametric functions. Using basis functions allows for smoother, higher order (more accurate) approximations to the model in parameter space when compared to traditional polynomials. GAMs therefore allow for more accurate predictions than generalized linear models, however, they still allowing for the model to be interperable.
We'll use the PyGam library which uses penalized B splines as basis functions. This choice of basis functions allow us to automatically model non-linear relationships without having to manually try out many different transformations on each variable.
We can import the regression GAM model:
from pygam import LinearGAM
We can then fit our model using 10 splines and using a grid search:
gam = LinearGAM(n_splines=10).fit(X_train, y_train)
PyGam gives a summary report of our model similar to the StatsModels library:
print(gam.summary())
Evaluating the performance of our model on the test set we find an $R^{2}$ score of:
y_pred = gam.predict(X_test)
# Explained variance score:
print("Variance score: {}".format(r2_score(y_test, y_pred)))
The GAM model results in an improvement in the $R^2$ score! We can see that using a more flexible model like a GAM using B-splines is better able to explain the variance in our data when compared to a linear regression model.
We can explore the effect that the number of splines used in our model has on the performance by training the model on various numbers of splines and plotting the learning curves. We write a function to train a model and returns 1 - $R^2$ on the training and test set respectively.
def train_gam_full(X_train, y_train, X_test, y_test, n_splines):
"""
Trains a linear GAM model and returns performance on training and test sets.
:parameters:
-----------
X_train (DataFrame) : Training set of features
y_train (Series) : Training set target values
X_test (DataFrame) : Test set of features
y_test (Series) : Test set target values
n_splines (int) : The number of splines to use in the GAM
:returns:
--------
The tupple of (1-R^2) for the training and test set respectively.
"""
model = LinearGAM(n_splines=n_splines).fit(X_train, y_train)
y_pred = model.predict(X_train)
y_pred2 = model.predict(X_test)
return 1-r2_score(y_train, y_pred), 1-r2_score(y_test, y_pred2)
We now get the results for our model's learning curve using ideas from functional programing.
We transform the above function into one that is only dependent on the number of splines using the partial tool to bind the values the training and test set. This will allow us to return a function that returns 1 - $R^2$ on the training and test set and only depends on the number of splines.
from functools import partial
train_gam = partial(train_gam_full, X_train, y_train, X_test, y_test)
Now we can define the number of spines we want to train on and then use the map function to get the performance metrics and plot the resulting learning curves:
# defined the
n_splines = range(5,21)
# get the r2 values
r2_train, r2_test = zip(*map(train_gam, n_splines))
# plot the learning curves
plt.figure(figsize=(10,6))
plt.plot(n_splines, r2_train, label="Training R^2 Score")
plt.plot(n_splines, r2_test, label="Testing R^2 Score")
plt.ylabel("R^2 Scores",fontsize=13)
plt.xlabel("Number Of Splines",fontsize=13)
plt.ylim([0.2,0.4])
plt.title("Learning Curves For Reg. GAMs Using Diff. Num. Of Splines", fontsize=15)
plt.legend(fontsize=13)
We can see that increasing the number of splines doesnt seem to improve the predictive performance of the model past 10 or 12 splines.
So in order to improve the predictive ability we'll need to turn to a different model.
Let's try another model to see if we cant improve our $R^2$ score for our model. One model that performs quite well is the Gradient Boosted Regression Tree. It is a nonlinear model based on decision trees and is one of the best performing machine learning models avaiable. We can use Scikit-learns gradient boosted regressor:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
GBR = GradientBoostingRegressor()
Now let's run a grid search using cross-valiation to find the optimal parameter values for our model:
# define the parameter values
paramters = {"n_estimators":[50,100,125,150,200],
"max_depth" :[3,5,7],
"loss" :["ls", "lad", "huber", "quantile"]
}
# defind the grid search and optimization metric
grid = GridSearchCV(estimator=GBR,
param_grid=paramters,
scoring="r2",
cv=5,
n_jobs=-1)
# perform the grid search
grid.fit(X_train, y_train)
Now that we have the best model from our grid search over the trainin set let see how it performs on the test set:
y_pred = grid.predict(X_test)
# Explained variance score:
print("Variance score: {}".format(r2_score(y_test, y_pred)))
Again a definite improvement in the $R^2$ score! While we have improved our prediction capabilities, one draw back to more complex models like Gradient Boosted Regressors is that they are less interperable. Despite this draw back, Scikit-learn still allows to the find the relative importance of the features on the outcome of the model:
model = grid.best_estimator_
for coef in zip(X_train.columns, model.feature_importances_):
print(coef)
The most important features for predicting green house gas emissions are the energy star score followed by tnatural gas intensity and the electicity intensity. The zipcode and the year built have the least importance, which from our analysis above makes sense.
That's enough modeling for now! Overall we were able to come with a model that explains over 75% of the variance in the test set data. Not bad!
In this blog post we analyzed the energy usage of multifamily buildings and developed a predictive model for these buildings green house gas emissions. From the final heat map of the correlation matrix we can see that the Energy Star score is negatively correlated with enery usage intensity. This makes sense as the energy star score is a measure of energy efficiceny of the building. The energy usage intensity has a strong positive correlation to the natural gas usage intensity. We can see that natural gas usage intensity and electricity usage intensity are uncorrlated. This implies we could replace one with the other. Therefore reducing natural gas consumption or replacing it with electricity could be the best answer to reducing energy usage intensity and green house gas emissions. It should also be noted that year the residence was built did not have any correlation with energy usage intensity. This is probably due to the fact that the majority of residential buildings in New York City were built in a similar time period and before energy efficiency was a priority.
Since natural gas usage intensity is most highly correlated with energy usage intensity reducing it could improve building energy efficiency and reduce green house gas emissions (so long as the buildings' electricity comes from a clean energy source or atleast one with less emissions than natural gas). Some ways to reduce natural gas consumption include:
In addition, we investigated creating a predictive model of green house gas emissions for multi-family buildings in New York City. The highest performing model was a gradient boosted tree model that had $R^2$ value of 0.76. The most important features for predicting green house gas emissions in this model are the natural gas intensity followed by the energy start scored and the electicity intensity. The zipcode and the year built have the least importance, which from our analysis above makes sense.