Custom Pytorch Dataset Class for Timeseries Sequence Windows

Recently I was working on a problem with Time Series.  Time Series can quickly add up to a lot of data, as you are using previous intervals to predict future intervals.  What some people do is they create a very large dataset.  Typically you will have a number of dates in your Time Series, say we have 10,000 sequential dates.  Then you will be looking at a set number of those dates at a time, say two weeks (14 days).  This will be your “window”, and you may advance it forward for each analysis.  This is common in doing things like ARMA and ARIMA, and other analysis where you are averaging over a given period.  But given a dataset, this means you have to code, introduce the logic before the data goes into your model or calculation.  What if we could just have the Dataset do what we want? We can!

What we need is a dataset where each time we make a batch request (via a Dataloader), we get a given amount of sequences, a window.  Each new request advances the window by 1 (or alternatively, we could make it advance by whatever we choose).   

Here is a Dataset Class I created to do just this.  You simply tie it to a Dataloader and you can just make requests of it to totally simply your training.  First, here is the Dataset Class:

We pair this to a Dataloader such as below:

Each batch request from the Dataloader, will get a window of seq_length.  So if we have batch_size set to 20, and our sequence length is 100, then you will end up with 20 windows of length 100, each advancing forward by one day.  The purpose of target_cols is so that you can specify which columns are targets in your prediction.  Say the first 10 columns are actual data you are predicting and the rest of the columns are just features, you can specify target_cols = 10 and when you request data from this Dataset it will give you all of the columns in the data and then just target_cols in the target.

 

Posted in Data Analytics | Tagged | Leave a comment

Finding the ideal num_workers for Pytorch Dataloaders

One of the biggest bottlenecks in Deep Learning is loading data.  having fast drives and access to the data is important, especially if you are trying to saturate a GPU or multiple processors.  Pytorch has Dataloaders, which help you manage the task of getting the data into your model.  These can be fantastic to use, especially for large datasets as they are very powerful and can handle things such as shuffling of data, batching data, and even memory management.  Pytorches Dataloaders also work in parallel, so you can specify a number of “workers”, with parameter num_workers, to be loading your data.  Figuring out the correct num_workers can be difficult.  One thought is you can use the number of CPU cores you have available.  In many cases, this works well.  Sometimes it’s half that number, or one quarter that number.  There are a lot of factors such as what else the machine is doing, and the type of data you are working with.  The nice thing about Dataloaders is they can be working on loading data while your GPU is processing data.  This is one reason why loading data into CPU memory is not a bad idea………..it saves valuable GPU memory and allows your computer to be making use of the CPU and GPU simultaneously.

The best way to go about tackling this is to run a basic test.  One thing I can tell you for sure is it is painfully slow to leave num_workers set to default.  You should absolutely at least set it to something higher.  Using 0 or 1 num_workers can take say 1-2 minutes to load a batch.  Having it set correctly can get this down to a few seconds!  When you are doing a bunch of interactions in your model this really adds up and can be the primary way you speed up your training.

Here is code I used to benchmark and find my ideal num_workers:

 

Here is an example of the output on CIFAR-10 Data:

Obviously there are a lot of factors that can contribute to the speed in which you load data and this is just one of them.  But it is an important one.  When you have multiple GPU’s it is very important that you can feed them as fast as they can handle, and oftentimes people are falling short of this.   You can see from the above output, using at least num_workers=4 is highly beneficial.  I have had datasets where setting this parameter much higher was required for a drastic increase.  It’s always good to check!

Posted in Data Analytics | Tagged | Leave a comment

Basic HBase Java Classes and Methods – Part 8: Disable and Delete a Table

In order to delete a table in HBase it must be disabled first.  This forces any data in memory to be flushed to disk.  Because this is an admin operation, we must create an Admin object, similar to how we did when creating a table.  After the Admin object is created, we simply pass in the Table object to its disableTable method, followed by passing the Table object into its deleteTable method.   It’s very straight forward and simple.

Full code is below:

 

Posted in Data Analytics | Tagged , | Leave a comment

Basic HBase Java Classes and Methods – Part 7: Delete from a Table

Deleting data from an HBase table is very similar in overall structure to many of our previous operations.  First we have our general skeleton code.  As before we use static variable declarations to make the code look a lot nicer.

We will create a Delete object, with a parameter of 1, which means we are deleting from the row with row key 1.  We will then add the columns we wish to delete using the addColumn method, and finally call the delete method on our Table object and pass in our Delete object as its parameter.

Putting it all together the complete code is as so:

The last part of this series will show how to Disable and then Drop an HBase table using Java.  Next Basic HBase Java Classes and Methods – Part 8: Disable and Delete a Table

Posted in Data Analytics | Tagged , | Leave a comment

Basic HBase Java Classes and Methods – Part 6: Scan a Table

In HBase a Scan is similar to a Select in SQL.  Again we return to skeleton code which is very similar to what we have seen before.  I will put comments into the three areas we will be addressing:

So there are three areas we will cover that are new:

  • Define a ResultScanner
  • Code to Scan the table
  • Close scan resources

Define a ResultScanner

First just before our try clause, we define a ResultScanner.  We do this outside of the try clause for the same reason we do the table outside the try clause, because we need to check and close the resources in the finally clause and if we were to define this inside the try clause it would then be out of scope.

 Scan the table

We get a Table object from our Connection object, just as we have done before.  This time we run a getScanner method on the Table object using a Scan object that we instantiate.  The results come back as a list which we iterate though.  Each iteration we use the CellUtil utility class which allows us to access each part of the ResultScanner separately.  You can read more about CellUtil in the JavaDoc.

CellUtil.cloneRow() is used to access the key
CellUtil.cloneFamily() is used to access the column family name
CellUtil.cloneQualifier() is used to access the column name
CellUtil.cloneValue() is used to access the column value

 Close scan resources

Similar to how we close the connection and the table, we also need to close our ResultScanner.

The final code is as follows:

This produces the following output:

In our next part,  we will look at how to delete data from an HBase table.  Next Basic HBase Java Classes and Methods – Part 7: Delete from a Table

Posted in Data Analytics | Tagged , | Leave a comment