Scaling data for Deep Learning

bfeeny
Feb 10, 2018
1 min read

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(train_data)
train_data = scaler.transform(train_data)
test_data = scaler.transform(test_data)
By default, MinMaxScaler will scale to the interval <0,1>

This form of scaling is referred to as normalization, that is, the data is rescaled so that all values are in the interval <0,1>.  The way this is done mathematically is:

y = (x - min) / (max - min)MinMaxScaler takes care of this for us, but it is important to understand the math that is at work and the consequences of it.

One of the key rules in machine learning in general, is that we do not want the training data to be tainted by any of the test data.  There are many ways in which this can happen, so one must be careful of anything done to the data before it is split.  This includes operations such as scaling.  Because scaling is based on the min and max of the set, this min and max would be greatly influenced if the scaling were all done to the combined dataset.  Instead, it is important that we first do our train / test split, to establish two sets of data, build our scale based on the train, and then scale both sets of data to that scale.  Failure to handle the scaling properly will bias your results in a positive  way.

Scaling data for Deep Learning

Recent Posts

Comments

Hi, thanks for stopping by!

Let the posts
come to you.

Comments

Hi, thanks for stopping by!

Let the posts come to you.

Let the posts
come to you.