Custom Pytorch Dataset Class for Timeseries Sequence Windows
Recently I was working on a problem with Time Series. Â Time Series can quickly add up to a lot of data, as you are using previous intervals to predict future intervals. Â What some people do is they create a very large dataset. Â Typically you will have a number of dates in your Time Series, say we have 10,000 sequential dates. Â Then you will be looking at a set number of those dates at a time, say two weeks (14 days). Â This will be your "window", and you may advance it forward for each analysis. Â This is common in doing things like ARMA and ARIMA, and other analysis where you are averaging over a given period. Â But given a dataset, this means you have to code, introduce the logic before the data goes into your model or calculation. Â What if we could just have the Dataset do what we want? We can!
What we need is a dataset where each time we make a batch request (via a Dataloader), we get a given amount of sequences, a window. Â Each new request advances the window by 1 (or alternatively, we could make it advance by whatever we choose). Â Â
Here is a Dataset Class I created to do just this. Â You simply tie it to a Dataloader and you can just make requests of it to totally simply your training. Â First, here is the Dataset Class:
class MyDataset(Dataset):
def __init__(self, data, window, target_cols):
self.data = torch.Tensor(data)
self.window = window
self.target_cols = target_cols
self.shape = self.__getshape__()
self.size = self.__getsize__()
def __getitem__(self, index):
x = self.data
y = self.data
return x, y
def __len__(self):
return len(self.data) - self.window
def __getshape__(self):
return (self.__len__(), *self.__getitem__(0)<0>.shape)
def __getsize__(self):
return (self.__len__())
We pair this to a Dataloader such as below:
batch_size = 20
seq_length = 28
target_cols = 1
pin_memory = True
num_workers = 4
dataset = MyDataset(data_with_features, seq_length, target_cols)
data_load = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
drop_last=True,
num_workers=num_workers, pin_memory=pin_memory)
Each batch request from the Dataloader, will get a window of seq_length. Â So if we have batch_size set to 20, and our sequence length is 100, then you will end up with 20 windows of length 100, each advancing forward by one day. Â The purpose of target_cols is so that you can specify which columns are targets in your prediction. Â Say the first 10 columns are actual data you are predicting and the rest of the columns are just features, you can specify target_cols = 10 and when you request data from this Dataset it will give you all of the columns in the data and then just target_cols in the target.
Recent Posts
See AllOne of the biggest bottlenecks in Deep Learning is loading data. Â having fast drives and access to the data is important, especially if you are trying to saturate a GPU or multiple processors. Â Pytorc
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.TableName; import org.apache.hadoop.hbase.client.Admin; import org.apache
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.TableName; import org.apache.hadoop.hbase.client.*; import org.apache.had