Our goal in this post is to discuss our standard strategy (beyond respecting basic time series modeling principles) for building accurate predictive models. We will use the example of commodity procurement optimization.
Machine learning always works better when targeting the real industrial objective, not a proxy.
If you are managing a grain mill, your real operational question is not: « What will be the price of wheat in four weeks? ». It is probably closer to: « How should I plan my wheat purchase orders over the next 4 weeks? ».
To answer the first question, backtesting a standard L1 or L2 regression error will be fine. But to answer the second question, you must backtest, and thus first formalize, the relevant loss function over the corresponding period.
Those are very different machine learning problems, yielding different solutions — the second solution being operationally superior to the first. Implementing it requires extended talks with business experts — one of the reasons why auto-ML doesn’t work for real industrial applications.
Superior modeling lets you display the influence of explanatory variables and model parameters (e.g. training window size, prediction horizon) over time.
Your commodity procurement optimization solution could use a sequential and linear combination of multiple predictive models, where each model is specific to: (i) a group of homogeneous variables (e.g. commodity prices, weather forecasts), and (ii) a structuring parameter value (e.g. training window = 1 day, 1 week, 2 weeks).
The relative weight of each model in the combination thus stands for the influence of the corresponding group of variables or parameter value, with the following benefits:
You could realize that a short rolling training window is best for optimizing your loss function, meaning that for those variables, recent observations are more relevant.
You know the famous quote about unknown unknowns:
Industrial life is full of unknown unknowns. By definition, they are not in your historical data, so the only way to prepare for them is to watch how your model reacts to extreme values or totally new circumstances.
Practically, this means you should backtest your model with varying variables, parameters and operational costs/constraints.
Datapred data scientists use two types of tests for unknown unknowns:
You could assume that purchase orders based on your commodity procurement solution are super slow, and check if model performance holds up (robustness test). You could also enter new values for a key explanatory variable, and ask domain experts if the corresponding results are realistic.
***
Datapred Explore is designed for data scientists with projects requiring foolproof time series modeling. Contact us for more information or a discussion of how Datapred could help. You can also check this page for a list of time series modeling resources.