top of page

Scaling data for Deep Learning

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(train_data)
train_data = scaler.transform(train_data)
test_data = scaler.transform(test_data)
By default, MinMaxScaler will scale to the interval <0,1>

This form of scaling is referred to as normalization, that is, the data is rescaled so that all values are in the interval <0,1>.  The way this is done mathematically is:
y = (x - min) / (max - min)MinMaxScaler takes care of this for us, but it is important to understand the math that is at work and the consequences of it.

One of the key rules in machine learning in general, is that we do not want the training data to be tainted by any of the test data.  There are many ways in which this can happen, so one must be careful of anything done to the data before it is split.  This includes operations such as scaling.  Because scaling is based on the min and max of the set, this min and max would be greatly influenced if the scaling were all done to the combined dataset.  Instead, it is important that we first do our train / test split, to establish two sets of data, build our scale based on the train, and then scale both sets of data to that scale.  Failure to handle the scaling properly will bias your results in a positive  way.

 

Recent Posts

See All

Hi, thanks for stopping by!

I'm a paragraph. Click here to add your own text and edit me. I’m a great place for you to tell a story and let your users know a little more about you.

Let the posts
come to you.

Thanks for submitting!

  • Facebook
  • Instagram
  • Twitter
  • Pinterest
bottom of page