September 27, 2021

Ridge, Lasso, and Polynomial Linear Regression

Ridge, Lasso, and Polynomial Linear Regression

Ridge Regression
Ridge regression learns , using the same least-squares criterion but adds a penalty for large variations in parameters.

The addition of a penalty parameter is called regularization. Regularization is an important concept in machine learning. It is a way to prevent overfitting by reducing the model complexity. It improves the likely generalization performance of a model by restricting the model’s possible parameter settings.

The practical effect of using ridge regression is to find feature weights, , that fit the data well and also set many of the feature weights to small values. The accuracy improvement on a regression problem with dozens or hundreds of features is significant.

The practical effect of using ridge regression is to find feature weights, , that fit the data well and also set many of the feature weights to small values. The accuracy improvement on a regression problem with dozens or hundreds of features is significant.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge

X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
                                                   random_state = 0)

linridge = Ridge(alpha=20.0).fit(X_train, y_train)

print('Crime dataset')
print('ridge regression linear model intercept: {}'
     .format(linridge.intercept_))
print('ridge regression linear model coeff:\n{}'
     .format(linridge.coef_))
print('R-squared score (training): {:.3f}'
     .format(linridge.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'
     .format(linridge.score(X_test, y_test)))
print('Number of non-zero features: {}'
     .format(np.sum(linridge.coef_ != 0)))
Crime dataset
ridge regression linear model intercept: -3352.4230358463437
ridge regression linear model coeff:
[ 1.95091438e-03  2.19322667e+01  9.56286607e+00 -3.59178973e+01
  6.36465325e+00 -1.96885471e+01 -2.80715856e-03  1.66254486e+00
 -6.61426604e-03 -6.95450680e+00  1.71944731e+01 -5.62819154e+00
  8.83525114e+00  6.79085746e-01 -7.33614221e+00  6.70389803e-03
  9.78505502e-04  5.01202169e-03 -4.89870524e+00 -1.79270062e+01
  9.17572382e+00 -1.24454193e+00  1.21845360e+00  1.03233089e+01
 -3.78037278e+00 -3.73428973e+00  4.74595305e+00  8.42696855e+00
  3.09250005e+01  1.18644167e+01 -2.05183675e+00 -3.82210450e+01
  1.85081589e+01  1.52510829e+00 -2.20086608e+01  2.46283912e+00
  3.29328703e-01  4.02228467e+00 -1.12903533e+01 -4.69567413e-03
  4.27046505e+01 -1.22507167e-03  1.40795790e+00  9.35041855e-01
 -3.00464253e+00  1.12390514e+00 -1.82487653e+01 -1.54653407e+01
  2.41917002e+01 -1.32497562e+01 -4.20113118e-01 -3.59710660e+01
  1.29786751e+01 -2.80765995e+01  4.38513476e+01  3.86590044e+01
 -6.46024046e+01 -1.63714023e+01  2.90397330e+01  4.15472907e+00
  5.34033563e+01  1.98773191e-02 -5.47413979e-01  1.23883518e+01
  1.03526583e+01 -1.57238894e+00  3.15887097e+00  8.77757987e+00
 -2.94724962e+01 -2.33454302e-04  3.13528914e-04 -4.13169509e-04
 -1.80309962e-04 -5.74054525e-01 -5.17742507e-01 -4.20670933e-01
  1.53383596e-01  1.32725423e+00  3.84863158e+00  3.03024594e+00
 -3.77692644e+01  1.37933464e-01  3.07676522e-01  1.57128807e+01
  3.31418306e-01  3.35994414e+00  1.61265911e-01 -2.67619878e+00]
R-squared score (training): 0.671
R-squared score (test): 0.494
Number of non-zero features: 88

Feature Preprocessing and Normalization

The effect of increasing is to shrink the coefficients towards 0 and toward each other. But, if the features have very different scales, then they will also have very different contributions to the penalty. So, transforming the input features so they are all on the same scale means the the ridge penalty is applied more “fairly” to all all features without unduly weighting some more than others just do to a difference in scales.

This was reviewed in our previous post about feature normalization.

Lasso Regression

Lasso regression is another form of regularized linear regression that uses an L1 regularization penalty for training, instead of the L2 regularization penalty used by Ridge regression.

This has the effect of setting parameter weights in to zero for the least influential variables, called a “sparse solution.”

When to use ridge versus lasso regression:

Use Ridge if there are only a few variables with many small/medium sized effects.
Use Lasso if there are only a few variables with medium/large effects.

from sklearn.linear_model import Lasso
scaler = MinMaxScaler()

X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
                                                    random_state = 0)

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

linlasso = Lasso(alpha=2.0, max_iter = 10000).fit(X_train_scaled, y_train)

print('Crime dataset')
print('lasso regression linear model intercept: {}'
     .format(linlasso.intercept_))
print('lasso regression linear model coeff:\n{}'
     .format(linlasso.coef_))
print('Non-zero features: {}'
     .format(np.sum(linlasso.coef_ != 0)))
print('R-squared score (training): {:.3f}'
     .format(linlasso.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}\n'
     .format(linlasso.score(X_test_scaled, y_test)))
print('Top Features with non-zero weight:')

for e in sorted (list(zip(list(X_crime), linlasso.coef_)),
                key = lambda e: -abs(e[1])):
    if e[1] != 0:
        print('\t{}, {:.3f}'.format(e[0], e[1]))
Crime dataset
lasso regression linear model intercept: 1186.612061998579
lasso regression linear model coeff:
[    0.             0.            -0.          -168.18346054
    -0.            -0.             0.           119.6938194
     0.            -0.             0.          -169.67564456
    -0.             0.            -0.             0.
     0.             0.            -0.            -0.
     0.            -0.             0.             0.
   -57.52991966    -0.            -0.             0.
   259.32889226    -0.             0.             0.
     0.            -0.         -1188.7396867     -0.
    -0.            -0.          -231.42347299     0.
  1488.36512229     0.            -0.            -0.
    -0.             0.             0.             0.
     0.             0.            -0.             0.
    20.14419415     0.             0.             0.
     0.             0.           339.04468804     0.
     0.           459.53799903    -0.             0.
   122.69221826    -0.            91.41202242     0.
    -0.             0.             0.            73.14365856
     0.            -0.             0.             0.
    86.35600042     0.             0.             0.
  -104.57143405   264.93206555     0.            23.4488645
   -49.39355188     0.             5.19775369     0.        ]
Non-zero features: 20
R-squared score (training): 0.631
R-squared score (test): 0.624

Top Features with non-zero weight:
    PctKidsBornNeverMar, 1488.365
    PctKids2Par, -1188.740
    HousVacant, 459.538
    PctPersDenseHous, 339.045
    NumInShelters, 264.932
    MalePctDivorce, 259.329
    PctWorkMom, -231.423
    pctWInvInc, -169.676
    agePct12t29, -168.183
    PctVacantBoarded, 122.692
    pctUrban, 119.694
    MedOwnCostPctIncNoMtg, -104.571
    MedYrHousBuilt, 91.412
    RentQrange, 86.356
    OwnOccHiQuart, 73.144
    PctEmplManu, -57.530
    PctBornSameState, -49.394
    PctForeignBorn, 23.449
    PctLargHouseFam, 20.144
    PctSameCity85, 5.198

So, 20 out of 88 features have non-zero weight in this example. The top five features with strongest relationships between input variables and outcomes for this dataset are:

PctKidsBornNeverMar, the percentage of kids born to people who never married,
PctKids2Par, the percentage of kids in family housing with two parents,
HousVacant, the number of vacant houses,
PctPersDensHous, the percetage of persons in dense housing (1+ person/room), and
NumInShelters, the number of people in homeless shelters.

for alpha in [0.5, 1, 2, 3, 5, 10, 20, 50]:
    linlasso = Lasso(alpha, max_iter = 10000).fit(X_train_scaled, y_train)
    r2_train = linlasso.score(X_train_scaled, y_train)
    r2_test = linlasso.score(X_test_scaled, y_test)
    
    print('Alpha = {:.0f}:\tFeatures kept: {}\tr-squared training, test:\t{:.2f}, {:.2f}\r'
         .format(alpha, np.sum(linlasso.coef_ != 0), r2_train, r2_test))
Alpha = 0:  Features kept: 35   r-squared training, test:   0.65, 0.58
Alpha = 1:  Features kept: 25   r-squared training, test:   0.64, 0.60
Alpha = 2:  Features kept: 20   r-squared training, test:   0.63, 0.62
Alpha = 3:  Features kept: 17   r-squared training, test:   0.62, 0.63
Alpha = 5:  Features kept: 12   r-squared training, test:   0.60, 0.61
Alpha = 10: Features kept: 6    r-squared training, test:   0.57, 0.58
Alpha = 20: Features kept: 2    r-squared training, test:   0.51, 0.50
Alpha = 50: Features kept: 1    r-squared training, test:   0.31, 0.30

Same as with Lasso regression, there is an optimal range of values for that will be different for different data sets and different feature preprocessing methods being used.

Polynomial features with Linear Regression

Suppose we have a set of two-dimensional data points with features x0 and x1

We could transform each data point by adding additional features that were the three unique multiplicative combinations of x0 and x1, yielding the following:

The degree of the polynomial specifies how many variables participate at a time in each new feature (above: 2). Note that this is still a weighted linear combination of features, so its still a linear model. But, this can be thought of intuitively as allowing polynomials to be fit to the training data instead of simply a straight line, but still using the same least-squares criterion.

This approach of adding new features, such as polynomial feaures, is very effective with classification. For example, housing prices may vary as a quadratic function of both the lot size and the amount of taxes paid on the property.

It is important to be careful about polynomial feature expansion with high degree, because this can lead to complex models that overfit. For this reason, polynomial feature expansion is also combined with a regularized learning method like ridge regression.

from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X_F1, y_F1,
                                                    random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)

print('linear model coeff (w): {}'
     .format(linreg.coef_))
print('linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('R-squared score (training): {:.3f}'
     .format(linreg.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'
     .format(linreg.score(X_test, y_test)))
linear model coeff (w): [ 4.42036739  5.99661447  0.52894712 10.23751345  6.5507973  -2.02082636
 -0.32378811]
linear model intercept (b): 1.543
R-squared score (training): 0.722
R-squared score (test): 0.722
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2)
X_F1_poly = poly.fit_transform(X_F1)

X_train, X_test, y_train, y_test = train_test_split(X_F1_poly, y_F1,
                                                   random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)

print('(poly deg 2) linear model coeff (w):\n{}'
     .format(linreg.coef_))
print('(poly deg 2) linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('(poly deg 2) R-squared score (training): {:.3f}'
     .format(linreg.score(X_train, y_train)))
print('(poly deg 2) R-squared score (test): {:.3f}\n'
     .format(linreg.score(X_test, y_test)))
(poly deg 2) linear model coeff (w):
[ 3.40951018e-12  1.66452443e+01  2.67285381e+01 -2.21348316e+01
  1.24359227e+01  6.93086826e+00  1.04772675e+00  3.71352773e+00
 -1.33785505e+01 -5.73177185e+00  1.61813184e+00  3.66399592e+00
  5.04513181e+00 -1.45835979e+00  1.95156872e+00 -1.51297378e+01
  4.86762224e+00 -2.97084269e+00 -7.78370522e+00  5.14696078e+00
 -4.65479361e+00  1.84147395e+01 -2.22040650e+00  2.16572630e+00
 -1.27989481e+00  1.87946559e+00  1.52962716e-01  5.62073813e-01
 -8.91697516e-01 -2.18481128e+00  1.37595426e+00 -4.90336041e+00
 -2.23535458e+00  1.38268439e+00 -5.51908208e-01 -1.08795007e+00]
(poly deg 2) linear model intercept (b): -3.206
(poly deg 2) R-squared score (training): 0.969
(poly deg 2) R-squared score (test): 0.805

The polynomial features version appears to have overfit. Note that the R-squared score is nearly 1 on the training data, and only 0.8 on the test data. The addition of many polynomial features often leads to overfitting, so it is common to use polynomial features in combination with regression that has a regularization penalty, like ridge regression.

X_train, X_test, y_train, y_test = train_test_split(X_F1_poly, y_F1,
                                                   random_state = 0)
linreg = Ridge().fit(X_train, y_train)

print('(poly deg 2 + ridge) linear model coeff (w):\n{}'
     .format(linreg.coef_))
print('(poly deg 2 + ridge) linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('(poly deg 2 + ridge) R-squared score (training): {:.3f}'
     .format(linreg.score(X_train, y_train)))
print('(poly deg 2 + ridge) R-squared score (test): {:.3f}'
     .format(linreg.score(X_test, y_test)))
(poly deg 2 + ridge) linear model coeff (w):
[ 0.          2.229281    4.73349734 -3.15432089  3.8585194   1.60970912
 -0.76967054 -0.14956002 -1.75215371  1.5970487   1.37080607  2.51598244
  2.71746523  0.48531538 -1.9356048  -1.62914955  1.51474518  0.88674141
  0.26141199  2.04931775 -1.93025705  3.61850966 -0.71788143  0.63173956
 -3.16429847  1.29161448  3.545085    1.73422041  0.94347654 -0.51207219
  1.70114448 -1.97949067  1.80687548 -0.2173863   2.87585898 -0.89423157]
(poly deg 2 + ridge) linear model intercept (b): 5.418
(poly deg 2 + ridge) R-squared score (training): 0.826
(poly deg 2 + ridge) R-squared score (test): 0.825

Note that this model outperforms both the linear model and the version with polynomial features that was trained using non-regularized regression.

These notes were taken from the Coursera course Applied Machine Learning in Python. The information is presented by Kevyn Collins-Thompson, PhD, an associate professor of Information and Computer Science at the University of Michigan.