Contents Link to heading


  1. Introduction

  2. Introduction To Function Approximation

  3. Introduction To Regression

  4. Linear Solvers For Least Squares Regression

    • Cholesky Factorization For Normal Equations

    • Singular Values Decomposition

    • Controlling For Overfitting With Regularization

    • Implementation in Scikit-learn

  5. A Touch Of Recommendation Systems

  6. Where To Go From Here

Introduction Link to heading


In this blogpost we’ll go over applications of numerical linear algebra in machine learning starting out with regression and ending with modern recommender systems! Numerical linear algebra (and numerical analysis more generally) was one of thoses courses that I learned, thought was boring and never wanted to study again. Only with maturity that comes with age (and a PhD) was I able to understand and appreciate the true power of numerical linear alebra. Infact understanding (distribued) linear algebra is probably one of the most important and useful tools I have ever learned. It has allowed me to contribute to open source libraries for scientific computing and understand how big data and machine learning systems work. The reason why numerical linear algebra is so important is because it allows us to approximate functions. In scientific computing and machine learning one is interested in how to approximate a function f(x)f(x) . Numerical analysis and statistics concerns itself with how good is our approximation to f(x)f(x) ?

Traditionally algorithms in Numerical Linear Algebra are taught/learned in Matlab for easy learning, but written for production Fortran and C/C++ for high performance. The fields of Data Science, Machine Learning and Artifical Intelligence have a seen recent boom and given a new life and applications to these algorithms. These fields lean more to using Python (though R is often used as well) as the primaty programming language, but make heavy use of wrappers to BLAS and LAPACK libraries written in Fortran and C/C++.

Many Data Scientists and Machine Learning Researchers also heavily make use of Jupyter notebook as a tool for rapid prototyping and writing experiments in Python (though it also allows for the use of other languages likes R, Julia and Scala). Jupyter notebook is as an interactive envrionment for computing that runs in your web browser and is very similar to Mathematica’s notebook. The fact it runs in a web browser is quite convenient as it will allows users to easily work on a remote computer without any complicated set up. See this post about how to get set up with Python & Jupyter notebooks on Google Cloud with all the dependencies installed for this post.

Introduction To Function Approximation Link to heading


Let’s now move onto function approximation. One learns in Calculus to represent smooth functions f:R>Rf : \, \mathbb{R} -> \mathbb{R} about some point x0x_0 using special polynomials called power series or Taylor series:

f(x)=n=0an(xx0)n f(x) \, = \sum_{n=0}^{\infty} \, a_n (x - x_0)^{n}

We can find the coefficients to the series by the equation,

an  =  1n!fn(x0) a_{n} \; = \; \frac{1}{n!} \, f^{n}(x_0)

We observed that there is a radius of convergence RRR \, \in \mathbb{R} such that within (x0R,x0+R)(x_0 - R, x_0 + R) (one should test the end points too) the series converges to f(x)f(x) uniformly and outside that it does not necessarily converge to f(x)f(x) .

We then learned that we can approximate our function f(x)f(x) with the finite degree polynomial,

f(x)fN(x)=a0+a1(xx0)+a2(xx0)2++aN(xx0)N f(x) \, \simeq \, f_N(x) \, = \, a_0 \, + \, a_1 (x - x_0) \, + \,a_2 (x - x_0)^2 \, + \ldots \, + \, a_N (x - x_0)^N

We can then use that approximation to perform calculations on our function that might not have been possible otherwise. The most memorialble for me being learning how to integrate a Gaussian. Though in practice numerical integration is usually perferred for finite integrals.

Power series were originaly developed to approximate solutions to differential equations. As science and engineering progressed the differential equations become more complicated and harder to solve. Other approximation methods were invented like Fourier Series:

f(x)fN(x)=a02+k=1N(ancos(2πkxL)+bnsin(2πkxL)) f(x) \, \simeq \, f_N(x) \, = \, \frac{a_0}{2} \, + \sum_{k=1}^{N} \left(a_n \cos \left(\frac{2 \pi k x}{L} \right) + b_n \sin \left(\frac{2 \pi k x}{L} \right) \right)

We remark that the functions cos(2πkxL)\cos\left(\frac{2 \pi k x}{L}\right) and sin(2πkxL)\sin\left(\frac{2 \pi k x}{L}\right) form an orthogonal basis for the Hilbert Space L2([L,L])L^{2}([-L,L]) . Coefficients for the approximations are determined projecting the function onto the basis functions:

an  =  2LLLf(x)cos(2πnxL)dx a_n \; = \; \frac{2}{L} \, \int_{-L}^{L} \, f(x) \,\cos \left(\frac{2 \pi n x}{L} \right) \, dx bn  =  2LLLf(x)sin(2πnxL)dx b_n \; = \; \frac{2}{L} \, \int_{-L}^{L} \, f(x) \,\sin \left(\frac{2 \pi n x}{L} \right) \, dx

The convergence of the Fourier series is much subtle issue. While the Taylor series was guaranteed to converge uniformly, the regularity conditions imposed to achieve this were quite strong. The class of functions in L2([L,L])L^{2}([-L,L]) is quite broad and depending on the regularity conditions of our specific function we will get different modes of convergence.

Let’s derive the second equation for bnb_n by using a “projection” onto the basis function sin(2πnxL)\sin \left(\frac{2 \pi n x}{L} \right) ,

LLsin(2πnxL)[a02+k=1N(ancos(2πkxL)+bnsin(2πkxL))]dx  =  LLsin(2πnxL)f(x)dxbnLLsin2(2πnxL)dx  =  LLsin(2πnxL)f(x)dxbnL2  =  LLf(x)sin(2πnxL)dx\begin{aligned} \int_{-L}^{L} \sin \left(\frac{2 \pi n x}{L} \right) \, \left[\frac{a_0}{2} \, + \sum_{k=1}^{N} \left(a_n \cos \left(\frac{2 \pi k x}{L} \right) + b_n \sin\left(\frac{2 \pi k x}{L} \right) \right) \right] \, dx \; &= \; \int_{-L}^{L} \sin \left(\frac{2 \pi n x}{L} \right) \, f(x) \, dx \\ b_n \, \int_{-L}^{L} \sin^2 \left(\frac{2 \pi n x}{L} \right)\, dx \; &= \; \int_{-L}^{L} \sin \left(\frac{2 \pi n x}{L} \right) \, f(x) \, dx \\ b_n \, \frac{L}{2} \; &= \; \int_{-L}^{L} \, f(x) \,\sin \left(\frac{2 \pi n x}{L} \right) \, dx \end{aligned}

Which finally becomes,

bn  =  (L2)1LLf(x)sin(2πnxL)dx b_n \; = \; \left(\frac{L}{2} \right)^{-1} \int_{-L}^{L} \, f(x) \,\sin \left(\frac{2 \pi n x}{L} \right) \, dx

The above example illustrates the two steps in our approximation methods:

  1. Projecting the function (or its values) onto a finite dimensional space

  2. Solving for the co-efficients corresponding to each of the basis functions

The last step is not usually as easy as the above example was and often requires solving a system of linear equations:

Ax=b A x \, = \, b

The need to solve a system of linear equations occurs most often because the basis for our finite dimensional space is not orthogonal. If the basis were orthogonal, then the matrix would be diagonal and we could be invert it by hand, leading to equations like the Fourier coefficient equations.

Each of these steps in our approximations methods introduces their own errors:

  1. The error in your model approximation.

  2. The error in solving the linear system of equations.

The first error best summarized in the bias-variance trade off (see Section 2.2 in this book as well). By this I mean that by assuming that our function has only pp features we introduce a bias into our model and as we increase the number we will most likely increase the variance in our model. The second error is due to rounding errors and numerical stability. We won’t touch on either of these topics too much, but mention them for completeness.

With the advent of the computer and seemingly cheap and unlimited computational resources numerical methods like finite difference and finite element methods were invented to approximate solutions to differential equations. The finite element method is one that is particularly dear to my heart and has concepts that have proved useful in understanding models in statistics and machine learning, particularly, Generalized Addative Models.

In the next section well go into the basics of Linear Regression models which is a subset of the larger class of Generalized Additive Models.

Introduction To Regression Link to heading


Linear regression is generally the first statistical modeling technique one learns. It involves relating some target yy that is a continuous random variable to pp “features” represented as a vector: xi=[1,x1,x2,,xp]  Rp+1\textbf{x}_{i} = [1, x_{1}, x_{2}, \ldots, x_{p}] \; \in \, \mathbb{R}^{p+1} using coefficients ω=[ω0,ω1,ω2,,ωp]  Rp+1\boldsymbol \omega = [\omega_0, \omega_1, \omega_2, \ldots, \omega_p] \; \in \, \mathbb{R}^{p+1} . The coefficient ω0\omega_0 is called the “intercept.” The features can have continuous and non-negative integer values and are generally considered to be non-random. The relationship between the target values and features values is described by the model,

yi  =  ωTxi+ϵi y_{i} \; =\; \boldsymbol \omega^{T} \textbf{x}_{i} + \epsilon_{i}

Where ϵiN(0,1)\epsilon_{i} \, \sim \, N(0,1) are independent and normal distributed with mean 0 and variance 1 and ωRp+1\boldsymbol \omega \, \in \mathbb{R}^{p+1} are the unknown co-efficients. The fitting the linear regression model then becomes,

hω(xi)  =  ωTxi h_{\boldsymbol \omega}( \textbf{x}_{i}) \; =\; \boldsymbol \omega^{T} \textbf{x}_{i}

Fitting the model then becomes finding the cofficients ω\boldsymbol \omega that best approximates yi  =  hω(xi)y_i \; = \; h_{\boldsymbol \omega}( \textbf{x}_{i}) for i=1,,ni=1, \ldots, n . This model is often called ordinary least squares and converges in probability.

In general we will have many distinct measurements xi\textbf{x}_{i} corresponding to values yiy_i that are collected into a training set, DTrain={(xi,yi)}i=1nD_{\text{Train}} \, = \, \left\{ (\textbf{x}_{i}, \, y_{i}) \right\}_{i=1}^{n} . This can be represend in the form DTrain=(XTrain,yTrainD_{\text{Train}} \, = \, (\textbf{X}_{\text{Train}}, \textbf{y}_{\text{Train}} ), where XTrainRn×(p+1)\textbf{X}_{\text{Train}} \in \mathbb{R}^{n \times (p+1)} is the Design Matrix and yTrainRn\textbf{y}_{\text{Train}} \in \mathbb{R}^{n} is the vector corresponding of target values. We also have a test set, DTest={(xi,yi)}i=1mD_{\text{Test}} \, = \, \left\{ (\textbf{x}_{i}, \, y_{i}) \right\}_{i=1}^{m} . This can be represend in the form DTest=(X,yD_{\text{Test}} \, = \, (\textbf{X}, \textbf{y} ), where XTestRm×(p+1)\textbf{X}_{\text{Test}} \in \mathbb{R}^{m \times (p+1)} and yTestRm\textbf{y}_{\text{Test}} \in \mathbb{R}^{m} is the vector corresponding of target values. The training set is the set of points we use to determine ω\boldsymbol \omega and the testing evaluating the performance of our model. For ease of notation we will simply drop the +1 from the p+1p+1 , but the reader should always remember there is an intectept. Let’s make things concrete using some example data.

The data we will use (data.csv) comes from the NYC Mayor’s Office of Sustainbility and contains energy efficiency measurements of multi-family housing units in New York City. We will use some of the measurements to predict the Green House Gas Emissions of the buildings, you can read more how this data was cleaned here. The file has the following columns:

  • NGI : Nautral Gas Use Intensity (kBtu/ft 2 )
  • EI : Electricty Use Intensity (kBtu/ft 2 )
  • WI : Water Use Intensity (kga/ft 2 )
  • Site_EUI : Site Energy Usage Intensity (kBtu/ft 2 )
  • GHGI : Green House Gas Emissions Intensity (Metric Tons CO2e / ft 2 )

We can import the data using Pandas which gives us a fast way to read in data from csv documents:

import pandas  as pd

#columns we care about
cols = ['NGI','EI','WI','Site_EUI','GHGI']

# read the data
df = pd.read_csv("data.csv",
                 usecols = cols)

#view the first 5 lines
df.head()

Site_EUINGIEIWIGHGI
0-0.037249-0.016434-0.012849-0.104832-0.037384
1-0.037921-0.017372-0.007096-0.040264-0.037782
2-0.033047-0.0134400.019025-0.047608-0.031731
3-0.034623-0.012073-0.026548-0.118878-0.034398
4-0.033804-0.0136760.008066-0.077564-0.032913

We split out dataframe into a training and test set:

df_test  = df.sample(frac=0.25, random_state=42)
df_train = df.drop(df_test.index, axis=0)

Then our target variable then is the GHGI and design matrix becomes the rest of the columns:

import numpy as np

# Design matrix
X = df_train[['NGI','EI','WI','Site_EUI']].values

# Target vector
y = df_train[["GHGI"]].values[:,0]

# Design matrix
X_test  = df_test[['NGI','EI','WI','Site_EUI']].values

# Target vector
y_test  = df_test[["GHGI"]].values[:,0]

print("X = \n", X)
print()
print("X.shape = ", X.shape)
print()
print("y = \n", y)
print()
print("X_test.shape = ", X_test.shape)
X = 
 [[-0.0164345  -0.01284852 -0.1048316  -0.0372486 ]
 [-0.01737222 -0.00709637 -0.04026423 -0.03792081]
 [-0.01344039  0.01902526 -0.04760821 -0.0330473 ]
 ...
 [-0.01034947 -0.03023676  0.00994682 -0.03592519]
 [-0.00979158 -0.03166883 -0.00486112 -0.0356311 ]
 [-0.00874449 -0.02223016 -0.05749552 -0.03100967]]

X.shape =  (3702, 4)

y = 
 [-0.03738401 -0.03778169 -0.03173064 ... -0.03306691 -0.03265591
 -0.03082855]

X_test.shape =  (1234, 4)

XX and yy are now a NumPy matrix and array respectively and the same for XtestX_{\text{test}} and ytesty_{\text{test}} . NumPy is Python library that has highly optimized data structures and algorithms for linear algebra. Infact, Pandas is written on top of NumPy.

We can then look at the relationship or correlation each of the features has on the target variables by looking at their scatter plots using a function I wrote:

from plotting import plot_results
%matplotlib inline

plot_results(df_test)

png

These scatter plots show us the correlation between GHGI and the different features. The far panel on the right shows us that as we increase Site_EUI on average we see GHGI increase linearly. This makes sense as the more energy a building uses the more green house gas it produces. The same is true of natural gas (NGI) as it is used for heating or cooking. Therefore if this goes up so should our green house gas emissions. Less intuitive may be electricity usage. We might not expect that as our electicity usage increases our green house gases increases, however, this could be due to the fact the source of the electricity comes from gas or coal. That fact this is less intuitive could also be why the shape is less linear than Site_EUI. The amount of water we use shoul not effect the green house gas emissions (maybe for heating hot water). This realiztion is displayed in the scater plot for WI as it does not show any distinct pattern.

Let’s now return to the linear regression. Our goal is to approximate the target variable yiy_i with a function hω(xi)h_{\boldsymbol \omega}(\textbf{x}_{i}) . This function is a linear combination of our features,

hω(xi)=ωTxi h_{\boldsymbol \omega}(\textbf{x}_{i}) \, = \, \boldsymbol \omega^{T} \textbf{x}_{i}

We find the solution vector or coefficients ω\boldsymbol \omega by minimizing the cost function J(ω)J \left( \boldsymbol \omega \right) ,

ω^  =  minω12i=1n(yihω(xi))2=  minω12yhω(x)2=  minωJ(ω)\begin{aligned} \hat{\boldsymbol \omega} \; &= \; \min_{\boldsymbol \omega} \, \frac{1}{2} \sum_{i=1}^{n} \left( y_{i} -h_{\boldsymbol \omega}( \textbf{x}_{i}) \right)^{2} \\ &= \; \min_{\boldsymbol \omega} \, \frac{1}{2} \Vert \textbf{y} - h_{\boldsymbol \omega}(\textbf{x}) \Vert^{2} \\ &= \; \min_{\boldsymbol \omega} \, J \left( \boldsymbol \omega \right) \end{aligned}

Mininmizing J(ω)J \left( \boldsymbol \omega \right) is equivalent to setting the J(ω)  =  0\nabla J \left( \boldsymbol \omega \right) \; = \; 0 . We can expand the cost function,

J(ω)  =  12yXω2=  minθ12yXω2=  12(yXω)T(yXω)=  12(yTyyTXωωTXTy+ωTXTXω)\begin{aligned} J(\boldsymbol \omega ) \; &= \; \frac{1}{2} \, \Vert \textbf{y} - \textbf{X} \boldsymbol \omega \Vert^{2} \\ &= \; \min_{\boldsymbol \theta} \, \frac{1}{2} \Vert \textbf{y} - \textbf{X} \boldsymbol \omega \Vert^{2} \\ &= \; \frac{1}{2} \, \left( \textbf{y} - \textbf{X} \boldsymbol \omega \right)^{T} \left( \textbf{y} - \textbf{X} \boldsymbol \omega \right) \\ &= \; \frac{1}{2} \, \left( \textbf{y}^{T} \textbf{y} - \textbf{y}^{T} \textbf{X} \boldsymbol \omega - \boldsymbol \omega^{T} \textbf{X}^{T} \textbf{y} + \boldsymbol \omega^{T} \textbf{X}^{T} \textbf{X} \boldsymbol \omega \right) \end{aligned}

Taking the gradient of both sides we then have (for a review on matrix calculus see here),

J(ω)  =  XT(yXω) \boldsymbol \nabla J(\boldsymbol \omega ) \; = \; \textbf{X}^{T} \left( \textbf{y} - \textbf{X} \boldsymbol \omega \right)

Setting the above equal to zero yields the linear system of equations,

(XTX)ω^  =  XTy \left(\textbf{X}^{T}\textbf{X}\right) \, \hat{\boldsymbol \omega} \; = \; \textbf{X}^{T} \textbf{y}

The above formulation is the so-called “Normal Equations”. We can rewrite them as,

Sω^  =  b S \, \hat{\boldsymbol \omega} \; = \; \textbf{b}

Where bRp\textbf{b} \, \in \mathbb{R}^{p} and S=XTXRp×pS \, = \, \textbf{X}^{T}\textbf{X} \, \in \mathbb{R}^{p \times p} . When the features have been scaled (as we have assumed), the matrix SS is the covariance matrix

(XTX)i,j  =  Cov(xi,xj) \left(\textbf{X}^{T}\textbf{X}\right)_{i,j} \; = \; \text{Cov}(x_{i},x_{j} )

The vector b=XTyb \, = \, \textbf{X}^{T} \textbf{y} is the projection of the target values onto the finite dimensional space that is spanned by the features. Defining the residual r as,

r  =  yXω \textbf{r} \; = \; \textbf{y} - \textbf{X} \boldsymbol \omega

Then we note J(ω^)  =  0\nabla J \left( \hat{\boldsymbol \omega} \right) \; = \; 0 is the same thing as saying,

XTr  =  0 X^{T} \, \textbf{r} \; = \; 0

Or the solution ω^\hat{\boldsymbol \omega} is value such that the residual is orthogonal to our feature space or the residual is statistically uncorrelated with any of the features.

The covariance matrix is symmetric and positive definite, but if the two features are highly correlated then two rows in the matrix will be nearly identical and the matrix will be nearly singular. Statistically, having correlated features is bad since it makes it hard to determine the impact on any one feature has on the target. Numerically, having correlated features is bad because the condition number of the covariance matrix explodes. The reason is that the condition number is definted as,

κ(S)  =  S2S12  =  σmaxσmin \kappa(S) \; = \; \Vert S \Vert_2 \, \Vert S^{-1} \Vert_2 \; = \; \frac{\sigma_{\text{max}}}{\sigma_{\text{min}}}

Where σ\sigma are the singular values from the singular value decomposition. As the matrix SS becomes more singular S12\Vert S^{-1} \Vert_2 \, \rightarrow \, \infty , forcing the κ\kappa \, \rightarrow \, \infty . We remark that the condition number affects not only the accuracy of your solution, but also the stability of the solvers, see Lecture 12.

There are two we will go over about solving the over-determined system of equations Xω=yX \boldsymbol \omega \, = \, \textbf{y} :

  1. Solving the Normal Equations using the Cholesky Decomposition

  2. Solving the linear system using the Singular Value Decomposition

However, one can also solve the Normal Equations using an LU factorization or the original system with the QR factorization.

Linear Solvers For Least Squares Regression Link to heading


See Lecture 11 for more detailed discussions.

Linear Solver 1: Cholesky Decomposition Of Normal Equations Link to heading

Let’s first solve the Normal Equations:

Sω  =  XTy S \, \boldsymbol \omega \; = \; X^{T} \textbf{y}

Our covariance matrix SS is symmetrix positive definite we can use the Cholesky decomposition:

S=LLT S = L L^{T}

where LL is an lower triangular matrix. To solve the system from the nomrmal equations then becomes,

Sω  =  XTyLLTω  =  XTy\begin{aligned} S \, \boldsymbol \omega \; &= \; X^{T} \textbf{y} \\ L L^{T} \, \boldsymbol \omega \; &= \; X^{T} \textbf{y} \end{aligned}

Which we can rewrite as first solving,

Lz  =  XTy L \textbf{z} \; = \; X^{T} \textbf{y}

Then solve,

LTω  =  z L^T \boldsymbol \omega \; = \; \textbf{z}

As LL is an lower triangular matrix each of these linear system of equations is simple to solve and the forward/backward substitutions made to solve them are backwards stable.

Let’s try this out by first analyzing the covariance matrix

# Build the covariance matrix
X_t = X.transpose()
S   = X_t.dot(X)

print("S = \n", S)
S = 
 [[ 0.74547278  1.35358196  2.7794211   1.72619886]
 [ 1.35358196  4.09897316  5.16118203  3.39404027]
 [ 2.7794211   5.16118203 16.53112563  6.69851448]
 [ 1.72619886  3.39404027  6.69851448  4.2625346 ]]

Let’s take a look at the eigen values to understand the condititioning of our matrix

eigs = np.linalg.eigvals(S)
print(eigs)
[22.1809954   2.73222512  0.03890524  0.6859804 ]
cond_num = np.sqrt( eigs.max() / eigs.min())
print(f"Condition number = {cond_num}")
Condition number = 23.87736746652693

The eigenvalues are all positive and the condition number are not too big which is good!

We will now be using many SciPy function calls. SciPy is a sister Python library with Numpy and supports many scientific algorithms. SciPy’s linear algebra routines expand on NumPy’s and even allow for use sparse matrices and vectors. SciPy acts a wrappers around LAPACK, indeed the eigenvalue calculator directly calls LAPACK’s eigenvalue decomposition.

Let’s now perform the Cholesky Decomposition using SciPy:

from scipy import linalg

# Perform the Cholesky Factsaorization
L = linalg.cholesky(S, lower=True)
L
array([[0.86340766, 0.        , 0.        , 0.        ],
       [1.56772058, 1.28110317, 0.        , 0.        ],
       [3.21912954, 0.08936547, 2.48200412, 0.        ],
       [1.99928602, 0.2027303 , 0.0984836 , 0.46324014]])

We can check to make sure that LLT=SL L^T \, = \, S :

L.dot(L.T)
array([[ 0.74547278,  1.35358196,  2.7794211 ,  1.72619886],
       [ 1.35358196,  4.09897316,  5.16118203,  3.39404027],
       [ 2.7794211 ,  5.16118203, 16.53112563,  6.69851448],
       [ 1.72619886,  3.39404027,  6.69851448,  4.2625346 ]])

Then use the matrix LL to solve the normal equations using SciPy’s solve_triangular (which solves a lower/upper linear system). This function directly calls LAPACK’s implementation of forward/backward substitution:

# solve for the coefficents
z     = linalg.solve_triangular(L,   X_t.dot(y), lower=True)
omega = linalg.solve_triangular(L.T, z, lower=False)

print(f"omega : {omega}")
omega : [-0.2376972   0.03247505  0.01313829  1.02531297]

We can now get the predicted values from y^=Xω^\hat{y} \, = \, X \hat{\boldsymbol \omega} on the test set and plot the results against the true values y_test:

# Get the in sample predicted values
df_test["y_ols"] = X_test.dot(omega)

plot_results(df_test, 'y_ols')

png

Not bad! Let’s move onto solving the system of equations with the singular value decomposition. I should note SciPy implements a Cholesky solver that directly calls LAPACK’s routine.

Linear Solver 2: The Singular Value Decomposition (SVD) Link to heading

Let’s solve the overdetermined system of equations,

Xω  =  y X \, \boldsymbol \omega \; = \; y

using the Singular Value Decomposition (SVD):

X  =  UΣVT X \; = \; U \Sigma V^{T}

Where URn×nU \, \in \mathbb{R}^{n \times n} and VRp×pV \, \in \mathbb{R}^{p \times p} are unitary matrices and ΣRn×p\Sigma \, \in \mathbb{R}^{n \times p} is a diagonal matrix with the singular values along the diagonal. We can substitute in the SVD into our system of equations to obtain,

Xω  =  yUΣVTω  =  yUTUΣVTω  =  UTyΣVTω  =  UTy\begin{aligned} X \, \boldsymbol \omega \; &= \; y \\ U \Sigma V^{T} \boldsymbol \omega \; &= \; y \\ U^{T} U \Sigma V^{T} \boldsymbol \omega \; &= \; U^{T} \, y \\ \Sigma V^{T} \boldsymbol \omega \; &= \; U^{T} \, y \end{aligned}

We then transform this to a system of equations:

Σz  =  UTy \Sigma \, \textbf{z} \; = \; U^{T} \, y

and

VTω  =  z V^{T} \boldsymbol \omega \; = \; \textbf{z}

The last system of equation can be transformed using the fact VV is unitary (VVT=IVV^{T} \, = \, I ):

ω  =  Vz \boldsymbol \omega \; = \; V \textbf{z}

We can then solve the original system of equations Xω  =  yX \, \boldsymbol \omega \; = \; y by

  1. Finding the SVD of the Design Matrix X  =  UΣVTX \; = \; U \Sigma V^{T} then solve,

  2. Inverting the diagonal matrix to solve z  =Σ1  UTy \textbf{z} \; = \Sigma^{-1} \; U^{T} \, y

  3. Perform the matrix-vector multiplication: ω  =  Vz\boldsymbol \omega \; = \; V \textbf{z}

We can perform the SVD using SciPy (which to no surprise calls LAPACK’s routine):

U, Sig, Vt = linalg.svd(X, full_matrices=False)

We can then look at the shapes of each of the resulting matrices:

U.shape, Sig.shape, Vt.shape
((3702, 4), (4,), (4, 4))

And then perform the steps described above to find ω\boldsymbol \omega ,

z = (1 / Sig) * U.transpose().dot(y)

omega = Vt.transpose().dot(z)

print(f"omega : {omega}")
omega : [-0.2376972   0.03247505  0.01313829  1.02531297]

The solution via Cholesky Factorization and the SVD have the same solutions!

One very nice thing about linear regression is that it is very easy to interpret. In linear regression, the coefficients tell us how much an increase in one unit of the feature increases the target variable by one unit. Therefore increases in the sites energy usage intensity (Site_EUI) increase the GHGI the most and intuitively this makes sense. However, the model is also telling us that the increases in natural gas usage (NGI) decrease the GHGI which doesn’t make sense. This most likely happened because the features are correlated (see off diagonal values in covariance matrix SS ) or because the model was over-fitting, i.e. giving to much weight to a specific feature.

Controling For Overfitting With Regularization Link to heading


Often times a linear regression will give us good results on the training set, but on very bad results on another independent (test) dataset. This is an example of overfitting our model. We can control for overfitting our model by introducting regularization of the form,

J(ω)  =  12ni=1n(yihω(xi))2+λqi=1pωiq\begin{aligned} J(\boldsymbol \omega ) \; = \; \frac{1}{2n} \sum_{i=1}^{n} \left( y_{i} - h_{\boldsymbol \omega}(\textbf{x}_{i}) \right)^{2} + \frac{\lambda}{q} \sum_{i=1}^{p} \vert \omega_{i} \vert^{q} \end{aligned}

where λ0\lambda \, \geq \, 0 . For q=2q=2 this is called Ridge Regression and for q=1q=1 it is called Lasso Regression. Ridge Regression serves to control for over-fitting by shrinking the parameters ωi\omega_{i} as λ\lambda increases. Not all individual parameters must shrink as λ\lambda increases, but in aggregate ω22\Vert \boldsymbol \omega \Vert^{2}_{\ell_2} decreases. We note that both methods seek to reduce the variance in model, and subsequently increase the bias.

Remark: In neither method do we consider penalizing the value ω0\omega_0 .

In the case of Ridge Regression we can view the cost function as,

ω^  =  minωyXω22+λω22 \hat{\boldsymbol \omega} \; = \; \min_{\boldsymbol \omega} \Vert \textbf{y} - \textbf{X} \boldsymbol \omega \Vert^{2}_{2} + \lambda \Vert \boldsymbol \omega \Vert^{2}_{2}

Which after expanding the values and setting the derivative to zero yields,

ω^  =  (XTX+λI)1(XTy) \hat{\boldsymbol \omega} \; = \; \left(\textbf{X}^{T} \textbf{X} + \lambda I \right)^{-1} \left(\textbf{X}^{T} \textbf{y} \right)

We can see that the regularization term acts to remove the singularity from the covariance matrix if there is correlation among features. Using the SVD we arrive at:

(XTX+λI)ω  =  (XTy)(VΣTΣVT+λI)ω  =  (VΣTUT)yV(ΣTΣ+λI)VTω  =  (VΣTUT)y(ΣTΣ+λI)VTω  =  ΣTUy\begin{aligned} \left(\textbf{X}^{T}\textbf{X} + \lambda I \right) \, \boldsymbol \omega \; &= \; \left(\textbf{X}^{T} \textbf{y} \right) \\ \left( V \Sigma^{T} \Sigma V^{T} + \lambda I \right) \, \boldsymbol \omega \; &= \; \left(V \Sigma^{T} U^{T} \right) \, \textbf{y} \\ V \left( \Sigma^{T} \Sigma + \lambda I \right) V^{T} \, \boldsymbol \omega \; &= \; \left(V \Sigma^{T} U^{T} \right) \, \textbf{y} \\ \left( \Sigma^{T} \Sigma + \lambda I \right) V^{T} \, \boldsymbol \omega \; &= \; \Sigma^{T} U \, \textbf{y} \end{aligned}

We can transform the last equation to the system of equations:

(ΣTΣ+λI)z  =  ΣTUTy \left( \Sigma^{T} \Sigma + \lambda I \right) \textbf{z} \; = \; \Sigma^{T} U^{T} \, \textbf{y}

with the system corresponding equation VTω^  =  zV^{T} \, \hat{\boldsymbol \omega} \; = \; \textbf{z} which can be rewritten using the unitary property of VV as the matrix-vector multiplication:

ω^  =  Vz \hat{\boldsymbol \omega}\; = \; V \textbf{z}

The matrix, (ΣTΣ+λI)Rp×p\left( \Sigma^{T} \Sigma + \lambda I \right) \, \in \mathbb{R}^{p \times p} is diagonal and can be inverted by hand. This gives us a way of solving the problem for Ridge Regression for a given λ\lambda :

  1. Finding the SVD X  =  UΣVTX \; = \; U \Sigma V^{T} then solve

  2. “Invert” the matrix to obtain z  =  (ΣTΣ+λI)1ΣTUTy \textbf{z} \; = \; \left( \Sigma^{T} \Sigma + \lambda I \right)^{-1} \, \Sigma^{T} \, U^{T} \, \textbf{y}

  3. Perform the matrix-vector multiplication ω^  =  Vz\hat{\boldsymbol \omega} \; = \; V \textbf{z}

We remark we can combine the 2. and 3. to obtain the formula:

ω^  =V(ΣTΣ+λI)1ΣTUTy \hat{\boldsymbol \omega} \; = V \, \left( \Sigma^{T} \Sigma + \lambda I \right)^{-1} \, \Sigma^{T} \, U^{T} \, \textbf{y}

And then our predictions y^  =  Xω^\hat{y} \; = \; X \, \hat{\omega} become:

y^  =  Xω^=  UΣ(ΣTΣ+λI)1ΣTUTy=  UDUTy\begin{aligned} \hat{y} \; &= \; X \, \hat{\omega} \\ &= \; U \, \Sigma \, \left( \Sigma^{T} \Sigma + \lambda I \right)^{-1} \, \Sigma^{T} \, U^{T} \, \textbf{y} \\ &= \; U \, D \, U^{T} \, \textbf{y} \end{aligned}

As you may have learned the matrix UU is unitary so it acts a change of basis transformation and the matrix D=Σ(ΣTΣ+λI)1ΣTRn×nD \, = \, \Sigma \, \left( \Sigma^{T} \Sigma + \lambda I \right)^{-1} \, \Sigma^{T} \, \in \mathbb{R}^{n \times n} is a diagonal matrix that either shrinks or elongates along this new basis as shown below:

From https://upload.wikimedia.org/wikipedia/commons/b/bb/Singular-Value-Decomposition.svg

The values along the diagonal of DD take the form,

Di,i  =  σi2σi2+λ D_{i,i} \; = \; \frac{\sigma_{i}^{2}}{\sigma_{i}^{2} + \lambda}

From this formulation DD acts to shrink the influence of low frequency singular vectors (σ\sigma small) more than it does high frequency singular vector (σ\sigma large)!

Let’s see how our code will look for solving the Ridge Regression problem using the SVD.

# using function annotations in Python 3 
# (see, https://www.python.org/dev/peps/pep-3107/)
import scipy as sp
def ridge_svd(
    X         : np.ndarray, 
    y         : np.ndarray, 
    reg_param : float = 1.0
) -> np.ndarray:
    """
    Solves the Ridge Regression problem using the SVD.
    
    Params:
    -------
    X : The matrix of features
    Y : The target vector
    reg_param : The regularization parameter
    
    Returns:
    --------
    omega : The array of coefficent for the solution
    """
    
    # compute the SVD
    U, Sig, Vt   = sp.linalg.svd(X, full_matrices=False)
    
    # for Sigma^{T} Sigma + \lambda (this is an array)
    Sig_plus_reg = Sig * Sig + reg_param

    # find z by "inverting" the matrix and then get omega
    z     = (1.0 / Sig_plus_reg) * Sig * U.T.dot(y)
    omega = Vt.T.dot(z)

    return omega

Let’s test this with λ=0\lambda \, = \, 0 which should give us the least squares regression solution:

print("omega: ", ridge_svd(X, y, 0.0))
omega:  [-0.2376972   0.03247505  0.01313829  1.02531297]

It does! Let’s try adding a little regularization and see what happens.

omega = ridge_svd(X, y, 0.1)
print(f"omega: {omega}")
omega: [0.14161464 0.07955296 0.03040791 0.78857811]

We can see that adding regulaization shrank the last feature, i.e. (Site_EUI) and made turned the coefficient for NGI positive which makes sense! Let’s take a look at the results:

df_test["y_ridge"] = X_test.dot(omega)

plot_results(df_test, 'y_ols', 'y_ridge')

png

It’s hard to say whether least squares or ridge regression is better from these figures, but the coefficients for Ridge Regression make much more sense. Notice that I have left out how to find the regularization parameter λ\lambda . That topic is out of scope for this post. However, for reference, the best value for λ\lambda is often chosen via cross-validation, see these slides for more information.

Implementation In Scikit-Learn Link to heading

Scikit-Learn is the industry standard open source machine leanring library in Python. As it is open source we can easily see how the developers wrote their linear sovlers by going to the library’s GitHub.

For regression, Scikit-learn mostly acts as a wrapper around Scipy which act as a wrapper to BLAS/LAPACK. The regular linear regression Scikit-Learn implementation utilizes Scipy’s least square solver, which if the matrix is dense calls BLAS/LAPACK. If the matrix XX is sparse then it calls SciPy’s sparse least squares solver which uses least squres with QR factorization LSQR. Note that while XRn×pX \, \in \mathbb{R}^{n \times p} may be a large sparse matrix, the covariance matrix SRp×pS \, \in \mathbb{R}^{p \times p} is typically quite small and dense.

For ridge regression, unless the user specifices the linear sovler to use then if the intercept is needed Scikit-learn solve the linear system with the Stochastic Average Gradient method, otherwise it uses the Cholesky Decomposition for dense matrices and Conjugate Gradient method for sparse matrices. Although, all the algorithms are written for dense or sparse matrices. The congugate gradient method makes use of SciPy’s Sparse CG function which does not need to explicitly form the matrix, but rather only needs its action on a vector making it extremely efficient for sparse matrices. If any of these methods fail, the least square QR solver is used.

A Touch Of Recommendation Systems Link to heading


Recommendation systems have become very popular and are one of the best examples of Machine Learning that we interact with daily, i.e. Amazon and Netflix. One popular technique for making recommendations is called Alternating Least Squares.

The idea is that we have some table of ratings of items by users as shown below:

From https://recsysjd.wordpress.com/2016/09/10/matrix-factorization-techniques-for-recommender-systems

However, not all users will have rated all items and we want to predict how a user may feel about an item they have not rated. This is how recomendations happen, through the prediction of how a user might rate an item they have not seen before!

Let’s for example take the idea of recommending movies to users of a webservice. The idea is we want to say a given user has some preference on a movie by the perfence of the generes that make up the movie. For simplicity let’s say movies are a combination of the genres (so-called latent factors in machine learning):

  • Horror
  • Action
  • Comedy

Then we can write a user’s preference as well a movie as a linear combinatin of these genres:

Sally  =  1×Horror+2×Action0.2×Comedy Weekend At Bernies  =  0×Horror+1×Action+5×ComedySally’s Rating of Weekend At Bernies  =  1×0+2×10.2×5=  1Star\begin{aligned} \text{Sally} \; &= \; 1 \times \text{Horror} + 2 \times \text{Action} - 0.2 \times \text{Comedy} \\ \ \text{Weekend At Bernies} \; &= \; 0 \times \text{Horror} + 1 \times \text{Action} + 5 \times \text{Comedy} \\ \text{Sally's Rating of Weekend At Bernies} \; &= \; 1 \times 0 + 2 \times 1 - 0.2 \times 5 \\ &= \; 1 \, \text{Star} \end{aligned}

The above shows how Sally would rate “Weekend at Bernies” given that she likes films involving Horror and Action, but dislikes action films and that “Weekend at Bernies” is a little bit of action and a lot of comedy. If we label Ru,iR_{u,i} as user uu ’s rating of movie ii , then the inner product of the vector of movie coefficients mim_{i} and user preferences pup_{u} is Ru,iR_{u,i} :

Ru,i  =  miTpu R_{u,i} \; = \; \textbf{m}_{i}^{T} \textbf{p}_{u}

We can then factorize the review matrix RR into a product of movies MM and users PP as depicted in the diagram above,

R    MTP R \; \simeq \; M^{T} P

If RRn×kR \in \mathbb{R}^{n \times k} and we have ff latent factors that make a movie then the movie genres are GRf×nG \in \mathbb{R}^{f \times n} and user preferences are PRf×kP \in \mathbb{R}^{f \times k} . How do we find the coefficient for each user and each movies, i.e. how do we find the matrices MM and PP ?

The idea is we can set up a cost function almost like Ridge Regression and minimize with respect to MM and PP :

minM,PJ(M,P)  =  12RMTPF2+λ2(MF2+PF2)  \min_{M,P} \, J(M,P) \; = \; \frac{1}{2} \Vert R - M^{T} P \Vert^{2}_{F} + \frac{\lambda}{2} \left( \Vert M \Vert^{2}_{F} + \Vert P \Vert^{2}_{F} \right) \

Where F\Vert \cdot \Vert_{F} is the Frobenius Norm. How do we minimize for both MM and PP ? Well one way is to:

  1. Randomly initialized MM and PP then

  2. Fix MM and solve for PP such that PJ(M,P)  =  0\nabla_{P} J(M,P) \; = \; 0

  3. Fix PP and solve for MM such that MJ(M,P)  =  0\nabla_{M} J(M,P) \; = \; 0

  4. Repeat steps 1 and 2 until convergence or max iterations reached

At each stage in 1. and 2. you are updating the choice and MM and PP by essentially solving a Least Squares Ridge Regression problem for PP and MM . The alternating nature of this iterative algorithm gives rise to its name Alternating Least Squares! Some links to good references for Alternating Least Squares below.

Where To Go From Here Link to heading


We’ve gone over a few topics in this blogpost and I want to supply the reader with some links for more information on the topics discussed.

Numerical Linear Algebra Link to heading

Regression & Machine Learning Link to heading

Recommendation Systems: Link to heading