When building deep learning models, it can be very beneficial to scale your data. Oftentimes data can have a huge range of unbounded values. The goal of scaling is to bound these values. Typically the activation functions of a neuron are going to be tanh, sigmoid or ReLU.

In the case of sigmoid, recall that the output is in the interval of `[0,1]`

, and with tanh the output is in the interval `[-1,1]`

. Rectified Linear Unit (ReLU) activations are unbounded. Although ReLU is commonly used these days, we can scale the data anyways.

We can use Python’s sklearn `MinMaxScaler`

to accomplish this.

1 2 3 4 5 |
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() scaler.fit(train_data) train_data = scaler.transform(train_data) test_data = scaler.transform(test_data) |

By default, `MinMaxScaler`

will scale to the interval [0,1]

This form of scaling is referred to as *normalization*, that is, the data is rescaled so that all values are in the interval `[0,1]`

. The way this is done mathematically is:

1 |
y = (x - min) / (max - min) |

`MinMaxScaler`

takes care of this for us, but it is important to understand the math that is at work and the consequences of it.

One of the key rules in machine learning in general, is that we do not want the training data to be tainted by any of the test data. There are many ways in which this can happen, so one must be careful of anything done to the data before it is split. This includes operations such as scaling. Because scaling is based on the min and max of the set, this min and max would be greatly influenced if the scaling were all done to the combined dataset. Instead, it is important that we first do our train / test split, to establish two sets of data, build our scale based on the train, and then scale both sets of data to that scale. Failure to handle the scaling properly will bias your results in a positive way.