Published On: February 23, 20246.6 min read

Introduction

As machine learning models grow in sophistication, their increased complexity often reduces their interpretability. Standard methods of assessing model performance, such as mean squared error and mean absolute error for regression tasks, often overlook the nuanced insights that could be gained from a more comprehensive evaluation. This is particularly true for complex models like time series forecasting, where a simple accuracy number might not capture the full story.

The real-world impact of trusting a model without understanding its inner workings can lead to mistakes and potentially damaging results. Thankfully, the importance of interpretability in machine learning has gained significant attention recently, and with it, the need for more robust evaluation methods. In this blog post, we delve into the importance of using more detailed visualization techniques like plots and graphs for assessing the performance of multivariate time series models. By exploring innovative evaluation metrics designed specifically for time series forecasting, we shed light on local interpretability, allowing for more informed decision-making.

What are we trying to do?

When attempting to forecast occupancy in larger areas, there are several factors to consider. When building a forecasting system, we have to factor in all the external parameters that can (or can’t) be added as variables. There are some typical variables that we can account for …

Some data challenges along the way

Firstly, seasonality plays a significant role with multiple cycles affecting occupancy rates. In public spaces, the number of visitors is closely related to the seasons and weather conditions.

Secondly, the quality and quantity of data can vary depending on the time of day, with some periods potentially having more and better data than others. This is also heavily impacted by the measurement systems that are in place (eg. computer vision systems might have shortcomings in the dark, systems with radio waves might be more robust, but less precise).

Lastly, there can be more variability in occupancy at specific parts of the day. The challenge lies in effectively measuring all these aspects to ensure accurate predictions. Unexpected events can also have a big impact on the consistency of the measurements.

To mitigate these effects and improve the accuracy of occupancy predictions, several methods can be employed:

  • Seasonality Adjustment: To handle the impact of seasonality, methods such as time series decomposition can be employed. This involves breaking down a time series into its components (trend, seasonal, and residual). Seasonal adjustment can help in identifying the underlying trends more accurately.
  • Data Aggregation: For times of the day where data is sparse, one solution could be to aggregate the data over larger time intervals. For example, instead of considering hourly data, one could aggregate data at a daily level. This can help in making the data more robust and less prone to fluctuations.
  • Smoothing Techniques: To handle the variability in specific parts of the day, smoothing techniques such as moving averages or exponential smoothing can be used. These techniques can help in reducing the noise and uncovering the underlying patterns in the data.
  • Utilizing Advanced Forecasting Models: Today there are quite some tools and frameworks out there for time series forecasting, varying from univariate approaches (like ARIMA models) to extended multi-variate models that can account for a lot of interaction effects (like custom LSTM models or other model architectures).

So, long story short: There are a lot of possibilities, but the key component will remain to have proper evaluation methods, which assess a model’s usability in the business context in which it is supposed to run.

The model

For the current case, we eventually went with an LSTM model with over 50 input features over 24 time steps. The picture below shows a quick overview of the model input:

This model primarily uses data from occupation metrics in and around the city, such as parking occupancy, high-level telecom information, and other specific occupancy measurement tools. It also considers contextual information about events that may affect occupancy levels.

While these input features do not cover all city activities, they offer a comprehensive overview of the current situation.

Seeing is believing

In assessing the complexity of the forecasting model, it’s crucial to examine its performance at different parts of the day. This is particularly important for ensuring the model’s effectiveness during critical periods such as mornings and noons. These are crucial periods of the day, to make sure the traffic conditions are adapted properly and to optimize the staffing in shops and restaurants over noon (and other shopping hours).

Next to that, we also wanted to know how much into the future we can predict. The amount of timesteps that we can predict into the future had an immediate impact on the (use) cases that can be enabled by this model.

How far can (and should) we predict?

In a best attempt to compare the accuracy for different forecasting horizons, we trained 24 models with 1 to 24 timesteps into the future. We stopped training at 20 epochs, to have an early indication of where the accuracy is going. On the test set we used the best_model_weights.

When plotting the Mean Absolute Percentage Error (MAPE) in this manner, we observe a gradually increasing MAPE. Which shows that the accuracy reduces as the forecasting horizon increases. This makes sense, and is what we expected. Nevertheless, we still observed quite some variability in the MAPE, which means the model didn’t entirely capture the complexity. This is also strongly related to the fact that we stopped training after 20 epochs, independently of the learning curve of the model.

Are we able to predict accurately at all times?

We also wanted to know whether we were better during some times of the day. In a somewhat different fashion than the tests above, we trained a model to predict the upcoming 24 hours of the day.

To evaluate our performance through-out the day we combined the hour in which we did the prediction (X-axis), the number of hours in the future we predicted in relation to the hour of the day (Y-axis) and the Mean Absolute Percentage Error (colour).

The plot above led us to the conclusion that we are somewhat better at predicting the first half of the day at the beginning of the day. This is quite useful to manage crowdedness in the morning.

Next to that, we observed that we’re less accurate in predicting the first hours of the afternoon at noon, and it gets better in the afternoon.

Leasons Learned

Models are not only for predictions

It’s not uncommon for ML Engineers to try to jump straight to model development to solve a problem (yes, let’s admit it). However, model development and the different flavours and set-ups are an important part of capturing and evaluating the possibilities with the data at hand. So take your time to develop different model architectures in the light of the (business) evaluations you want to do.

No one-size-fits-all

It’s generally known that neural networks are incredibly good at solving well-defined, narrow problems (despite the perceived general problem-solving capabilities of Large Language Models — we’re not just yet at the level of AGI).

So this also goes for this use case: It’s less effective to have one model that forecasts the upcoming 24 hours and to build multiple use cases on top of that. Instead we opt to build dedicated models per specific use case (eg. morning shift predictions, lunch-staffing predictions, …).

No one-size-fits-all

We still observed quite some variance in our test results, which shows that our model(s) didn’t (yet) capture all complexity. Next to that, we observed some systematic, lower accuracy of measurements during the night. This got (partly) resolved by moving to multi-variate models (the LSTM with the features displayed above), which was able to get the best of the different measurements at hand. Yet again, we learned that data quality is key in building proper data solutions.