September 17, 2021

[Algorithms] - Linear regression Least- Squares

[Algorithms] - Linear regression Least- Squares

A linear model is a sum of weighted variables that predict a target output value given an input data instance.

For example: car prices.

A car has different features like: year built, horse power, trunk capacity, etc.

With linear regression the idea is to find the linear formula which take into account those features  and predict the car  price:

EG:

Y(price) =  10000 + (Current Year - Year Built)*108 + 23*Trunk Capacity - NrOfAccidents

I try to not explain with advanced mathematics, just plain english and I hope the explanation above is sufficient for most readers.

Show me the code

As before let's use the sklearn to generate a synthetic dataset.

from sklearn.datasets import make_classification, make_blobs
from matplotlib.colors import ListedColormap
from sklearn.datasets import load_breast_cancer

cmap_bold = ListedColormap(['#FFFF00', '#00FF00', '#0000FF','#000000'])


# synthetic dataset for simple regression
from sklearn.datasets import make_regression
plt.figure()
plt.title('Sample regression problem with one input variable')
X_R1, y_R1 = make_regression(n_samples = 1000, n_features=1,
                            n_informative=1, bias = 150.0,
                            noise = 30, random_state=0)
plt.scatter(X_R1, y_R1, marker= 'o', s=50)
plt.show()

The code above will also plot the values generated.

Now, if we want to create the ML Model to predict future values, then we use the simple code below.

As always we split the data into train and test data, then we use the LinearRegression from sklearn.linear_model and with the fit method we get the model, which would give us at the further down some useful information.

from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X_R1, y_R1,
                                                   random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)

print('linear model coeff (w): {}'
     .format(linreg.coef_))
print('linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('R-squared score (training): {:.3f}'
     .format(linreg.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'
     .format(linreg.score(X_test, y_test)))

One widely used method for estimating w and b for linear aggression problems is called least-squares linear regression, also known as ordinary least-squares.

Least-squares linear regression finds the line through this cloud of points that minimizes what is called the means squared error of the model.

The mean squared error of the model is essentially the sum of the squared differences between the predicted target value and the actual target value for all the points in the training set.

Visually speaking its the distance between the predicted value and the true value.

So each of these can be computed as the square difference can be computed, and then if we add all these up, And divide by the number of training points, take the average, that will be the mean squared error of the model.


Adding up all the squared values of these differences for all the training points gives the total squared error and this is what the least-square solution tries to  minimize.

The code above prints out the following, where we get the R-squared score which we just discussed.

Results

linear model coeff (w): [45.71]
linear model intercept (b): 148.446
R-squared score (training): 0.679
R-squared score (test): 0.492

And now lets plot the linear regression line

plt.figure(figsize=(5,4))
plt.scatter(X_R1, y_R1, marker= 'o', s=50, alpha=0.8)
plt.plot(X_R1, linreg.coef_ * X_R1 + linreg.intercept_, 'r-')
plt.title('Least-squares linear regression')
plt.xlabel('Feature value (x)')
plt.ylabel('Target value (y)')
plt.show()

And the result is below

The result is a straight line which corresponds to a linear formula, where we can enter future input values, and the line will give us the predicted value.

Side notes:

  1. The linear model does not have any parameters to control the model complexity.
  2. The linear model always uses all of the input variables and always is represented by a straight line.

Any question just write them in the comments section below