Scaling data for Deep Learning
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(train_data)
train_data = scaler.transform(train_data)
test_data = scaler.transform(test_data)
By default, MinMaxScaler will scale to the interval <0,1>
This form of scaling is referred to as normalization, that is, the data is rescaled so that all values are in the interval <0,1>. The way this is done mathematically is:
y = (x - min) / (max - min)MinMaxScaler takes care of this for us, but it is important to understand the math that is at work and the consequences of it.
One of the key rules in machine learning in general, is that we do not want the training data to be tainted by any of the test data. There are many ways in which this can happen, so one must be careful of anything done to the data before it is split. This includes operations such as scaling. Because scaling is based on the min and max of the set, this min and max would be greatly influenced if the scaling were all done to the combined dataset. Instead, it is important that we first do our train / test split, to establish two sets of data, build our scale based on the train, and then scale both sets of data to that scale. Failure to handle the scaling properly will bias your results in a positive way.
Recent Posts
See AllOne of the biggest bottlenecks in Deep Learning is loading data. having fast drives and access to the data is important, especially if...
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.TableName;...
Comentários