September 20, 2021

Feature normalization

Feature normalization

Before writing the next post about Algorithms, I thought it was important to talk first about feature normalization, as it will be relevant in almost all algorithms moving forward.

Some of the algorithms apply a penalty, for example in Linear Regression Ridge the L2 penalty is a sum of squared of all the coefficients of the formula.  If your input features are in different scales for example house price is in millions of dollars, but number of  years since it was built its only in the range of 1-50 years, this will have a huge impact on the L2 penalty.

With feature normalization what we do is to take the features and normalize them to the same scale, so when the model is created using the .fit method, they are all on the same scale.  So transforming the input features, means the ridge penalty is in some sense applied more fairly to all features without unduly weighting some more than others, just because of the difference in scales.

You will see as we proceed that feature normalization is important to perform for a number of different learning algorithms, beyond just regularized regression.  This includes K_Nearest neighbors, Support Vector Machines, Neural Networks and others.

In the example below we are going to apply a widely used form of feature normalizacion called MinMax Scaling, this will compute the min and max values for each feature on the training data, and then apply the transformation for each feature.

In the example below I am using a water profile dataset which I have stored in my Azure Blob Storage, this dataset has several numeric features like: ph  ,Hardness,   Solids,  Chloramines, Sulfate,etc.

As you can see with the code below, all features have a very different range of values, by scaling them to the same range of values, some algorithms will make a better model.

from azureml.core import Workspace, Dataset

subscription_id = 'myid'
resource_group = 'mlplayground'
workspace_name = 'mlplayground'

workspace = Workspace(subscription_id, resource_group, workspace_name)

dataset = Dataset.get_by_name(workspace, name='WaterQuality')
df = dataset.to_pandas_dataframe()
print(df)
          ph    Hardness        Solids  Chloramines     Sulfate  \
0          NaN  204.890455  20791.318981     7.300212  368.516441   
1     3.716080  129.422921  18630.057858     6.635246         NaN   
2     8.099124  224.236259  19909.541732     9.275884         NaN   
3     8.316766  214.373394  22018.417441     8.059332  356.886136   
4     9.092223  181.101509  17978.986339     6.546600  310.135738   
...        ...         ...           ...          ...         ...   
3271  4.668102  193.681735  47580.991603     7.166639  359.948574   
3272  7.808856  193.553212  17329.802160     8.061362         NaN   
3273  9.419510  175.762646  33155.578218     7.350233         NaN   
3274  5.126763  230.603758  11983.869376     6.303357         NaN   
3275  7.874671  195.102299  17404.177061     7.509306         NaN   

      Conductivity  Organic_carbon  Trihalomethanes  Turbidity  Potability  
0       564.308654       10.379783        86.990970   2.963135           0  
1       592.885359       15.180013        56.329076   4.500656           0  
2       418.606213       16.868637        66.420093   3.055934           0  
3       363.266516       18.436524       100.341674   4.628771           0  
4       398.410813       11.558279        31.997993   4.075075           0  
...            ...             ...              ...        ...         ...  
3271    526.424171       13.894419        66.687695   4.435821           1  
3272    392.449580       19.903225              NaN   2.798243           1  
3273    432.044783       11.039070        69.845400   3.298875           1  
3274    402.883113       11.168946        77.488213   4.708658           1  
3275    327.459760       16.140368        78.698446   2.309149           1  

Now, the fun part, how do we use the MinMax  Scaler?

There are several ways, but I prefer to have the results of the scaler back to the same dataframe, so I assign the transformation back to the same dataframe columns, in this way at least visually it will be easier for the reader.

  1. We import the MinMaxScaler from sklearn.preprocessing
  2. We instantiate it.
  3. We use the scaler.fit_transform method with a list of all the features (columns), and then the result of this transformation we assign it back to the same dataframe columns.  This basically means we are replacing the values for the scaled one.
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
scaler = MinMaxScaler()


df[["ph", "Hardness","Solids","Chloramines","Sulfate", "Conductivity","Organic_carbon", "Trihalomethanes","Turbidity"]] = scaler.fit_transform(df[["ph", "Hardness","Solids","Chloramines","Sulfate", "Conductivity","Organic_carbon", "Trihalomethanes","Turbidity"]])



print(df)

The result is self explanatory, now all the numbers are in the same range value. And we can continue in our journey.

    ph  Hardness    Solids  Chloramines   Sulfate  Conductivity  \
0          NaN  0.571139  0.336096     0.543891  0.680385      0.669439   
1     0.265434  0.297400  0.300611     0.491839       NaN      0.719411   
2     0.578509  0.641311  0.321619     0.698543       NaN      0.414652   
3     0.594055  0.605536  0.356244     0.603314  0.647347      0.317880   
4     0.649445  0.484851  0.289922     0.484900  0.514545      0.379337   
...        ...       ...       ...          ...       ...           ...   
3271  0.333436  0.530482  0.775947     0.533436  0.656047      0.603192   
3272  0.557775  0.530016  0.279263     0.603473       NaN      0.368912   
3273  0.672822  0.465486  0.539101     0.547807       NaN      0.438152   
3274  0.366197  0.664407  0.191490     0.465860       NaN      0.387157   
3275  0.562477  0.535635  0.280484     0.560259       NaN      0.255266   

      Organic_carbon  Trihalomethanes  Turbidity  Potability  
0           0.313402         0.699753   0.286091           0  
1           0.497319         0.450999   0.576793           0  
2           0.562017         0.532866   0.303637           0  
3           0.622089         0.808065   0.601015           0  
4           0.358555         0.253606   0.496327           0  
...              ...              ...        ...         ...  
3271        0.448062         0.535037   0.564534           1  
3272        0.678284              NaN   0.254915           1  
3273        0.338662         0.560655   0.349570           1  
3274        0.343638         0.622659   0.616120           1  
3275        0.534114         0.632478   0.162441           1  

If you don't apply the same scaling to training and test sets, you'll end up with more or less random data skew, which will invalidate your results. If you prepare the scaler or other normalization method by showing it the test data instead of the training data, this leads to a phenomenon called Data Leakage, where the training phase has information that is leaked from the test set.

One downside to performing feature normalization is that the resulting model and the transformed features may be harder to interpret.

That's all folks for now, I hope this clarify what feature normalization is.