论文标题
探索高质量视频框架插值的运动歧义和对齐
Exploring Motion Ambiguity and Alignment for High-Quality Video Frame Interpolation
论文作者
论文摘要
对于视频框架插值(VFI),现有的基于深度学习的方法强烈依赖地面真相(GT)中间帧,这些帧有时会忽略从给定相邻框架判断的运动性质。结果,这些方法倾向于产生平均溶液,这些解决方案不够清晰。为了减轻这个问题,我们建议放宽对尽可能接近GT的中间框架重建中间框架的要求。为此,在假设插值内容应与给定框架中的对应物保持相似的结构的假设,我们会开发出纹理一致性损失(TCL)。鼓励满足此约束的预测,尽管它们可能与预定义的GT有所不同。没有铃铛和哨子,我们的插件TCL能够改善现有VFI框架的性能。另一方面,以前的方法通常采用成本量或相关图以实现更准确的图像/特征翘曲。但是,o(n^2)({n是指像素计数})计算复杂性使其对于高分辨率案例而言是不可行的。在这项工作中,我们设计了一个简单,高效(O(n))但功能强大的跨尺度金字塔对准(CSPA)模块,其中多尺度信息得到了高度利用。广泛的实验证明了拟议策略的效率和有效性。
For video frame interpolation (VFI), existing deep-learning-based approaches strongly rely on the ground-truth (GT) intermediate frames, which sometimes ignore the non-unique nature of motion judging from the given adjacent frames. As a result, these methods tend to produce averaged solutions that are not clear enough. To alleviate this issue, we propose to relax the requirement of reconstructing an intermediate frame as close to the GT as possible. Towards this end, we develop a texture consistency loss (TCL) upon the assumption that the interpolated content should maintain similar structures with their counterparts in the given frames. Predictions satisfying this constraint are encouraged, though they may differ from the pre-defined GT. Without the bells and whistles, our plug-and-play TCL is capable of improving the performance of existing VFI frameworks. On the other hand, previous methods usually adopt the cost volume or correlation map to achieve more accurate image/feature warping. However, the O(N^2) ({N refers to the pixel count}) computational complexity makes it infeasible for high-resolution cases. In this work, we design a simple, efficient (O(N)) yet powerful cross-scale pyramid alignment (CSPA) module, where multi-scale information is highly exploited. Extensive experiments justify the efficiency and effectiveness of the proposed strategy.