Custom Pytorch Dataset Class for Timeseries Sequence Windows
Recently I was working on a problem with Time Series. Time Series can quickly add up to a lot of data, as you are using previous intervals to predict future intervals. What some people do is they create a very large dataset. Typically you will have a number of dates in your Time Series, say we have 10,000 sequential dates. Then you will be looking at a set number of those dates at a time, say two weeks (14 days). This will be your "window", and you may advance it forward for each analysis. This is common in doing things like ARMA and ARIMA, and other analysis where you are averaging over a given period. But given a dataset, this means you have to code, introduce the logic before the data goes into your model or calculation. What if we could just have the Dataset do what we want? We can!
What we need is a dataset where each time we make a batch request (via a Dataloader), we get a given amount of sequences, a window. Each new request advances the window by 1 (or alternatively, we could make it advance by whatever we choose).
Here is a Dataset Class I created to do just this. You simply tie it to a Dataloader and you can just make requests of it to totally simply your training. First, here is the Dataset Class:
class MyDataset(Dataset): def __init__(self, data, window, target_cols): self.data = torch.Tensor(data) self.window = window self.target_cols = target_cols self.shape = self.__getshape__() self.size = self.__getsize__() def __getitem__(self, index): x = self.data y = self.data return x, y def __len__(self): return len(self.data) - self.window def __getshape__(self): return (self.__len__(), *self.__getitem__(0)<0>.shape) def __getsize__(self): return (self.__len__())
We pair this to a Dataloader such as below:
batch_size = 20 seq_length = 28 target_cols = 1 pin_memory = True num_workers = 4 dataset = MyDataset(data_with_features, seq_length, target_cols) data_load = torch.utils.data.DataLoader(dataset, batch_size=batch_size, drop_last=True, num_workers=num_workers, pin_memory=pin_memory)
Each batch request from the Dataloader, will get a window of seq_length. So if we have batch_size set to 20, and our sequence length is 100, then you will end up with 20 windows of length 100, each advancing forward by one day. The purpose of target_cols is so that you can specify which columns are targets in your prediction. Say the first 10 columns are actual data you are predicting and the rest of the columns are just features, you can specify target_cols = 10 and when you request data from this Dataset it will give you all of the columns in the data and then just target_cols in the target.