计数预测的比例评级

论文标题

计数预测的比例评级

Scaling-aware rating of count forecasts

论文作者

Tichy, Malte C., Babounikau, Illia, Wolke, Nikolas, Ulbrich, Stefan, Feindt, Michael

论文摘要

预测质量应在理论上可能的内容以及实践中有合理的期望的背景下进行评估。通常，人们可以识别与概率预测的清晰度的近似上限，这使误差指标的较低（不一定是可实现的）。在零售预测中，Poisson分布给出了一个简单但通常不可见量的清晰度限制。当使用传统指标（例如平均绝对误差）评估预测时，很难判断某个实现的值是否反映了不可避免的泊松噪声或真正表明过度分散的预测模型。此外，每个评估度量标准都遭受精度缩放：也许令人惊讶的是，度量标准的价值主要由销售率和由此产生的速率依赖性泊松噪声定义，仅次于预测质量。对于任何度量，比较两组预测产品通常会产生“缓慢的行为的性能比快速搬运工差”，反之亦然，是幼稚的缩放陷阱。为了提炼预测的固有质量，我们将预测分为近似相等的预测值的桶，并分别评估每个桶的指标。通过将每个存储桶实现的价值与基准进行比较，我们获得了预测质量的直观可视化，可以将其汇总为单个评级，从而使预测质量在不同的产品甚至行业之间可比。因此，将开发的缩放性预测评级应用于M5竞争数据集上使用的预测模型，以及Blue Yonder在塞恩斯伯里（Sainsbury）在英国超市的杂货产品零售解决方案提供的现实生活预测。结果允许通过非专家对模型质量有明确的解释和高级理解。

Forecast quality should be assessed in the context of what is possible in theory and what is reasonable to expect in practice. Often, one can identify an approximate upper bound to a probabilistic forecast's sharpness, which sets a lower, not necessarily achievable, limit to error metrics. In retail forecasting, a simple, but often unconquerable sharpness limit is given by the Poisson distribution. When evaluating forecasts using traditional metrics such as Mean Absolute Error, it is hard to judge whether a certain achieved value reflects unavoidable Poisson noise or truly indicates an overdispersed prediction model. Moreover, every evaluation metric suffers from precision scaling: Perhaps surprisingly, the metric's value is mostly defined by the selling rate and by the resulting rate-dependent Poisson noise, and only secondarily by the forecast quality. For any metric, comparing two groups of forecasted products often yields "the slow movers are performing worse than the fast movers" or vice versa, the naïve scaling trap. To distill the intrinsic quality of a forecast, we stratify predictions into buckets of approximately equal predicted value and evaluate metrics separately per bucket. By comparing the achieved value per bucket to benchmarks, we obtain an intuitive visualization of forecast quality, which can be summarized into a single rating that makes forecast quality comparable among different products or even industries. The thereby developed scaling-aware forecast rating is applied to forecasting models used on the M5 competition dataset as well as to real-life forecasts provided by Blue Yonder's Demand Edge for Retail solution for grocery products in Sainsbury's supermarkets in the United Kingdom. The results permit a clear interpretation and high-level understanding of model quality by non-experts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题