Data scientists are waking up to the fact that combining multiple, relatively simple predictive models is often more efficient, when tackling a time series forecasting challenge, than using the latest super-duper neural network.
One question remains - how should you combine these models?
Stacking is the popular answer. It is not the best one...
Stacking is a supervised ensemble learning technique. The idea is to combine base predictive models into a higher-level model with lower bias and variance. So far so good, but...
Stacking makes the standard machine learning assumption that data is independent and identically distributed ("IID").
While that suits lots of data, it rarely applies to real-life time series:
As a consequence, stacking will be good at replicating seasonal patterns, but will fail when facing truly non-stationary time series.
The way stacking combines base models is a bit cumbersome:
That's a lot of calculations - especially when using multiple layers of stacked models, since stacking works by batches: updating the higher-level model requires re-training everything on the entire dataset.
It is easy to understand that increasing the modeling distance between inputs and outputs reduces interpretability. This well-known problem with deep learning also affects stacking: after x layers of stacked Random Forests, finding the contribution of a given input to final prediction accuracy becomes a full-time job.
And addressing that problem with a model-agnostic interpretation method such as LIME involves more parameter tuning, thus complicating the productization of the solution.
Sequential aggregation is...
The key difference with stacking is that base models are combined linearly, not via additional layers of models.
Base models receive positive weights summing to one, and updating incrementally based on their performance. The following graph is an example of model weights changing over time (on an oil price prediction challenge):
This combination mechanism has two interesting direct consequences.
Another, less obvious advantage is that sequential aggregation accommodates non-standard prediction errors (in addition to classic classification and regression errors), as long as they are convex. And as we discussed in a previous article, this is key to avoiding production disappointments and transforming nerdy machine learning applications into business-oriented decision management solutions.
***
Datapred automates both stacking and sequential aggregation. Feel free to contact us for a discussion these capabilities in the context of your modeling projects. And for additional resources on time series modeling, this page is a good place to start.