All Articles

Time Series Reshaping for Deep Learning in Databricks

Photo by Jon Tyson on Unsplash
Photo by Jon Tyson on Unsplash

Using Databricks SQL, preparing time series data for deep learning training can be straightforward.

Data Overview

To demonstrate this, we’ll use a simple fictitious dataset. Our goal is to prepare this data for deep learning time series forecasting or classification models and use multiple predictive features. The input data looks as follows:

Example Records
Example Records

For a given machine ID, we may want to predict the operating hours in the next day, failure rate, etc.

Databricks SQL Functions

Time series data used in deep learning models such as LSTMs often are required to be in the shape (batch, timestep, feature).

This requires some additional data prep as the desired shape isn’t a simple 2d tabular setup. In the case of our example, for each Machine ID we want to organize a number of features together for each day.

The critical functions in Databricks SQL will be struct, collect_list, and standard partitioning/window functions.

Databricks SQL Setup

Assume our example data is loaded into a table already. Preparing our desired shape with a given number of lookback steps becomes simple:

SELECT
     *
    ,collect_list(struct(
        local_temperature,
        operating_hours
    )) OVER (PARTITION BY machine_id 
             ORDER BY cal_date 
             ROWS BETWEEN 90 PRECEDING and CURRENT ROW) as ts_data
FROM delta_table_historical

This will collect all our historical data for the past 90 days and current into one new column with the data partitioned and ordered correctly. We can store it in a variable spark_df (spark_df = spark.sql(…)) for later use. Note - no gaps should be present in the data and enough history must be available per partition.

Numpy conversion

Once our data is prepared per training instance, we can convert it to a numpy array easily to prepare for deep learning training. This can be accomplished as follows:

import numpy as np
ts_np = np.squeeze(np.array(spark_df.select('ts_data').collect()), axis=1).astype('float32')

At this point our data reshaping is complete and can be utilized in a deep learning model!

For a deeper dive into some of the concepts related to this article, check out the following books: