Fast-MFQE: A Fast Approach for Multi-Frame Quality Enhancement on Compressed Video

Chen, Kemi; Chen, **g; Zeng, Huanqiang; Shen, Xueyuan

doi:10.3390/s23167227

Open AccessArticle

Fast-MFQE: A Fast Approach for Multi-Frame Quality Enhancement on Compressed Video

by

Kemi Chen

¹

,

**g Chen

^1,*,

Huanqiang Zeng

^1,2 and

Xueyuan Shen

¹

College of Information Science and Engineering, Huaqiao University, **amen 361021, China

²

College of Engineering , Huaqiao University, Quanzhou 362021, China

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(16), 7227; https://doi.org/10.3390/s23167227

Submission received: 18 July 2023 / Revised: 11 August 2023 / Accepted: 12 August 2023 / Published: 17 August 2023

(This article belongs to the Special Issue Machine Learning Based 2D/3D Sensors Data Understanding and Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

For compressed images and videos, quality enhancement is essential. Though there have been remarkable achievements related to deep learning, deep learning models are too large to apply to real-time tasks. Therefore, a fast multi-frame quality enhancement method for compressed video, named Fast-MFQE, is proposed to meet the requirement of video-quality enhancement for real-time applications. There are three main modules in this method. One is the image pre-processing building module (IPPB), which is used to reduce redundant information of input images. The second one is the spatio-temporal fusion attention (STFA) module. It is introduced to effectively merge temporal and spatial information of input video frames. The third one is the feature reconstruction network (FRN), which is developed to effectively reconstruct and enhance the spatio-temporal information. Experimental results demonstrate that the proposed method outperforms state-of-the-art methods in terms of lightweight parameters, inference speed, and quality enhancement performance. Even at a resolution of 1080p, the Fast-MFQE achieves a remarkable inference speed of over 25 frames per second, while providing a PSNR increase of 19.6% on average when QP = 37.

Keywords:

compressed video enhancement; lightweight models; inference time; real-time

1. Introduction

Nowadays, there is a surplus of ultra-high-definition (UHD) videos accessible for online viewing, imposing substantial strain on communication bandwidth. To transmit videos within the constraint of limited network bandwidth, video compression is vital for reducing the bit rate. However, highly efficient video coding standards, such as H.264/AVC [1] and H.265/HEVC [2], introduce artifacts when using de-correlation and predictive coding techniques, degrading the quality of the video to some extent [3]. As illustrated in Figure 1, after being transmitted in low bandwidth, the reconstructed video is of low quality. The artifacts (i.e., blurring, ringing, blocking and distortion in motion, etc.) are obvious. When these videos are used for subsequent visual tasks, such as object recognition, object detection, and object tracking, etc., the low quality affects the performance dramatically [4,5]. Therefore, quality enhancement for compressed video is crucial for video applications and has emerged as a crucial area of research.

For image or single-frame quality enhancement, traditional methods [6,7,8,9,10,11] aimed to enhance the quality of compressed JPEG images by optimizing the transform coefficients of a specific compression standard. Specifically, refs. [8,9] proposed Shaped-Adaptive DCT (SADCT) and Regression Tree Fields (RTF) to reduce JPEG image blocking artifacts, respectively. Nevertheless, it is challenging to apply these methods to other compression tasks due to the limited generalization ability. With the advance of the deep learning method, an expanding range of methods have embraced convolutional neural network (CNN) approaches [12,13,14,15,16] to improve the compressed image quality. In [12], a four-layer AR-CNN was first introduced to deal with various artifacts in JPEG images. Based on this, Zhang et al. [13] proposed a deep DnCNN for multi-image restoration. Then, based on residual non-local attention, a method named RNAN [17] was proposed to eliminate the image noise. Subsequent methodologies included the use of recursive units and gate units to remove JPEG artifacts [18], as well as the implementation of a dual-stream multi-path recursive residual network [19]. Later on, Lin et al. [20] proposed a multiscale image fusion approach to remove JPEG artifacts effectively, and achieved exceptional objective quality. But these methods cannot be extended to compressed video directly, since they treat frames independently and thus fail to exploit temporal information.

To enhance the quality of the compressed video, a 10-layer CNN automatic decoder (DCAD) [21] was the first work to mitigate distortion in compressed videos. In [22], two sub-networks of DS-CNN were introduced to address both intra-frame and inter-frame artifacts. Their main purpose was to enhance the target frame by leveraging the spatial correlation between video frames. There are many multi-frame compressed video enhancement methods [23,24,25,26]. Yang et al. [23] introduced a multi-frame quality enhancement network, named MFQE1.0, which leveraged adjacent high-quality frames to enhance the target frame. MFQE2.0 [24] was an improved version. And then, QG-ConvLSTM [25], a method utilizing bidirectional recurrent convolution, to capture the extensive temporal information. Based on the deformable convolution (DCN), Deng et al. [26] introduced patio-temporal deformable convolution (STDF) to extract temporal information from multiple frames by expanding the input frames to 7 or even 9 frames effectively. These methods aimed to enhance the target frame by leveraging the temporal relationships among multiple video frames primarily. Generally, multi-frame compressed video enhancement tended to achieve better results compared to single-frame enhancement due to its utilization of richer temporal and spatial information. However, these methods for compressed video enhancement faced the following challenges:

(1) The parameters of networks are excessively large, which challenges the efficiency of the training and real-time tasks.

(2) Existing methods tend to prioritize enhanced results at the expense of inference speed.

Therefore, it is necessary to explore lightweight and high-performance models for compressed video quality enhancement.

The term “lightweight model” refers to compressing the model size to maximize computational speed while preserving the accuracy. Researchers have been paying increasing attention to develo** lightweight models in the field of image classification to enable deployment on mobile devices [14,15,16,27,28]. Among the pioneer endeavors in develo** lightweight models, SqueezeNet [29] emerged, replacing 3 × 3 convolutions with 1 × 1 convolutions, resulting in a parameter reduction of approximately one-fiftieth compared to AlexNet [30]. Subsequently, Xception [31] further reduced the parameters by decoupling the Inception structure [32]. The ResNeXt [33] introduced group convolutions and reduced the parameters by integrating the residual network and Inception structure [32] effectively. In 2017, Google introduced MobileNet [34], which pioneered the concept of depthwise separable convolution (DSC) to effectively reduce the parameters in neural networks. Subsequently, MobileNet V2 [35] surpassed the previous performance benchmarks by implementing inverted residual structures and linear bottleneck layers. After that, ShuffleNet [36,37] reduced the model parameters by successfully employing group convolutions and enabling inter-channel interaction via channel shuffling operations. More recently, Huang et al. introduced CondenseNet [38], a novel approach that combines model pruning and group convolutions to effectively reduce the number of model parameters. However, these methods were not used for video quality enhancement (VQE).

To enhance the quality of compressed video and achieve superior inference performance, an end-to-end CNN-based method for VQE task, named Fast-MFQE, is proposed. The main contributions of the method are as follows:

(1) A novel IPPB module is designed to reduce the multi-frame information redundancy and fasten the inference speed;

(2) STFA and FRN modules are proposed to effectively extract the temporal features and the multi-frame correlation.

More intuitively, the parameters and the performance of diverse VQE methods are shown in Figure 2. It can be seen that compared to state-of-the-art VQE methods, the proposed Fast-MFQE method demonstrates smaller parameters and superior inference performance, as well as improving the quality of the compressed video, such as

Δ

PSNR and

Δ

SSIM

\times 10^{- 2}

. The structure of the remaining sections in this paper is as follows: Section 2 provides a detailed exposition of the proposed method. Section 3 presents the experimental results and provides an analysis of the superior performance of the proposed method. Section 4 concludes the paper.

2. The Proposed Fast-MFQE

The architecture of the proposed Fast-MFQE is shown in Figure 3, where Depthwise Separable Convolution (DSC) [34] is employed in place of traditional convolution to decrease the computational complexity and enhance the inference speed of the neural network. The primary objective of Fast-MFQE is to generate an enhanced video frame

{\hat{O}}_{t}

that closely resembles the Ground-truth frame in the pixel domain. The Ground truth refers to the original uncompressed video frame. To leverage temporal information from adjacent frames, Fast-MFQE takes the target frame

V_{t}

and its neighboring frame

{\{V_{t \pm n}\}}_{n = 1}^{N}

as the input of the network. There are three main models in Fast-MFQE; each will be illustrated in the following subsections.

2.1. Image Pre-Processing Building Modules (IPPB)

There are prevalent approaches in compressed video enhancement that utilize multiple video frames as input to effectively incorporate temporal information from diverse frames. However, these networks encounter challenges in achieving rapid inference when processing high-resolution video frames due to the substantial increase in computational complexity. Consequently, pre-processing of the input frames becomes imperative to facilitate fast inference in high-resolution video.

To leverage the information from adjacent frame

{\{V_{t \pm n}\}}_{n = 1}^{N}

to enhance the target frame

V_{t}

, Fast-MFQE utilizes both

{\{V_{t \pm n}\}}_{n = 1}^{N}

and

V_{t}

as inputs to the network. To decrease the data volume of the input frames and enhance the model’s inference speed, Fast-MFQE introduces the Image Pre-Processing Building Modules (IPPB) inspired by [34,35]. As depicted in Figure 3, IPPB consists of two primary components: Mean Shift and Pixel Unshuffle.

2.1.1. Mean Shift

In general, video frames exhibit substantial spatial redundancy, and mitigating this redundancy can reduce input data effectively. In a groundbreaking work, Zhang et al. [39] first introduced the Mean Shift operation to image super-resolution tasks and presented the RCAN network, which achieved remarkable results. The Mean Shift operation serves to normalize data by emphasizing individual differences by subtracting the statistical mean value from each image sample. Drawing inspiration from [37,39], Fast-MFQE employs the Mean Shift operation to diminish redundant information in images, enhancing model training speed and consequently reducing the inference time.

Let Fast-MFQE take the adjacent frame

{\{V_{t \pm n}\}}_{n = 1}^{N}

and the target frame

V_{t}

as the network input (n = 3), with Mean Shift operation denoted as

M S (\cdot)

. Then, the feature after Mean Shift operation

F_{M S}

can be expressed as:

F_{M S} = M S ({\{V_{t \pm n}\}}_{n = 1}^{N}, V_{t})

(1)

2.1.2. Pixel Unshuffle

While reducing redundancy in individual video frames through mean shift operations is effective, downsampling the input data is necessary to further alleviate the computational burden on the network.

Inspired by the FFDnet network [40], Fast-MFQE employs a reversible downsample (R-Downsample) operation to divide the input frames into four sub-frames, aiming to reduce the input data volume within the network. This operation decreases the model’s computational cost and inference time while effectively preserving more detailed information. Consequently, it facilitates improved model performance and enhances the generalization ability. It is worth noting that the inverse operation of R-Downsample is denoted as R-Upsample.

Let the four sub-frames generated by the R-Downsample operation be denoted as

I_{l}

(l = 1, 2, 3, 4), with the R-Downsample denoted as

R D (\cdot)

. Then, the

I_{l}

can be expressed as:

I_{l} = R D (F_{M S})

(2)

2.2. Spatio-Temporal Attention Fusion (STAF)

Spatio-temporal information of video frames is essential for quality enhancement. To enhance the fusion of image information from different temporal and spatial contexts, Fast-MFQE introduces the Spatio-Temporal Attention Fusion (STAF) module. This module consists of two 3 × 3 convolutions that extract spatial information from the video frames. Subsequently, temporal attention extraction is performed to capture the temporal characteristics of the frames, as described in Figure 4. Then, the spatial and temporal information is fused by concatenating in the manner of channel dimension. Finally, the concatenated information undergoes fusion through a 1 × 1 convolutional layer. This process ensures the effectiveness of the spatial and temporal integration for compressed video enhancement. Let the spatial information extracted by the two 3 × 3 convolutions be denoted as

F_{S}

and the temporal information extracted by temporal attention be denoted as

F_{T}

. The fused spatio-temporal feature is denoted as

F_{S T}

by the 1 × 1 convolution. These features are formulated as follows:

\begin{matrix} F_{S} = C o n v 3 (C o n v 3 ([I_{1}, I_{2}, I_{3}, I_{4}])) \end{matrix}

(3)

\begin{matrix} F_{T} = T A ([I_{1}, I_{2}, I_{3}, I_{4}]) \end{matrix}

(4)

\begin{matrix} F_{S T} = C o n v 1 ([F_{S}, F_{T}]) \end{matrix}

(5)

where

C o n v 3 (\cdot)

and

C o n v 1 (\cdot)

are the map** functions of 3 × 3 convolution and 1 × 1 convolution, respectively.

[\cdot, \cdot]

represents the concatenation operation, and

T A (\cdot)

is the map** function of temporal attention.

2.3. Feature Reconstruction Network (FRN)

To achieve precise reconstruction of video frames, the Feature Reconstruction Network (FRN) is introduced in Fast-MFQE. As shown in Figure 5, the FRN consists of dense residual blocks, primarily difference learning and residual learning. Difference learning is dedicated to capturing high-frequency information in video frames by calculating element-wise differences in feature maps. And residual learning aims to learn diverse feature information by calculating element-wise additions of feature maps. By incorporating both difference and residual learning, relevant features are captured and integrated effectively, enabling the generation of refined video frames. Let

R_{t}

denote the reconstructed information generated by the FRN while

V_{t}

represents the target frame; the reconstructed frames, denoted as

R_{t}^{H Q}

, can be expressed as follows:

R_{t}^{H Q} = R_{t} + V_{t}

(6)

2.4. Loss Function

To encourage the enhanced frame

R_{t}^{H Q}

to be as close as possible to the original uncompressed frame

V_{r a w}

in the pixel domain, Fast-MFQE adopts the mean square error between the enhanced frame

R_{t}^{H Q}

and the original uncompressed frame

V_{r a w}

as the loss function of the model, which is formulated as follows:

L (θ) = \frac{1}{H W C} {||R_{t}^{H Q} - V_{r a w}||}_{2}^{2}

(7)

where H, W, and C represent the height, width, and number of channels of the image under evaluation, respectively.

θ

can be learned through the gradient descent algorithm [41] to solve Equation (7). Thus, Fast-MFQE can be trained effectively to enhance the quality of compressed videos.

3. Experiments

In this section, the effectiveness of the proposed Fast-MFQE method is demonstrated by extensive experiments. The experimental settings are introduced in Section 3.1, and the performance comparisons of the Fast-MFQE method with state-of-the-art methods for JCT-VC testing sequences [2] are illustrated in Section 3.2.

3.1. Settings

3.1.1. Datasets

The Fast-MFQE model is trained using the dataset introduced in MFQE2.0 [24]. The dataset [24] is divided into two parts. Firstly, 18 sequences from the Joint Collaborative Team on Video Coding (JCT-VC) [2] are commonly utilized as a test set. Secondly, the remaining 142 random sequences are split into non-overlap** training (106 sequences) and validation (36 sequences) sets. All 160 sequences are compressed using HM16.5 [1] in Low-Delay configuration, which is an encoding platform for H.265, with QP set to 27, 32, and 37. The results demonstrate the excellent generalization ability of the proposed method, making it applicable to different QP values.

3.1.2. Quality Enhancement Assessment Metrics

Extensive research [13,42,43,44] has been conducted to develop efficient and accurate methods for assessing the quality of images and video frames. The quality enhancement evaluation metric is used to measure the quality of distorted images or video frames by comparing the corresponding ground truth quantitatively using full-reference evaluation metrics. In this experiment, two widely used evaluation metrics, namely PSNR and SSIM [42], are employed.

3.1.3. Parameter Settings

The basic settings and hyperparameters of the experiments are presented here. The Fast-MFQE model is trained using the PyTorch framework. Specifically, the Fast-MFQE takes three frames (n = 3) as input to the network. The iteration number is

3 \times 10^{5}

and the mini-batch size is 32. The cropped size is reduced to 128 × 128. The learning rate is set to

1 \times 10^{- 4}

and halved every

1 \times 10^{5}

iterations. Note that the above hyperparameters are tuned over the training set. Finally, the parameters of the Fast-MFQE network are updated utilizing the Adam algorithm [41] until the network converges.

3.2. Performance Comparison

3.2.1. Quantitative Comparison

In this section, the performance of the Fast-MFQE model is evaluated using Peak Signal-to-Noise Ratio (PSNR) and Structure Similarity Index Measure (SSIM) quantitatively, which are used objective quality evaluation metrics widely. The Fast-MFQE model is compared with AR-CNN [12], DnCNN [13], RNAN [17], MFQE1.0 [23], and MFQE2.0 [24]. Among these methods, AR-CNN [12], DnCNN [13], and RNAN [17] are methods for enhancing the quality of compressed images, MFQE1.0 [23] is the first method used for multi-frame compressed video enhancement, and MFQE2.0 [24] is the most advanced method for enhancing the quality of compressed videos. To ensure a fair comparison, all of these methods are trained and tested on the same dataset.

Table 1 reports the

Δ

PSNR and

Δ

SSIM results, which are calculated between enhanced and compressed frames averaged over each test sequence. Note that

Δ

PSNR > 0 and

Δ

SSIM > 0 indicate improvement in objective quality for the compressed video. Specifically, compared to the most advanced compressed video enhancement method, MFQE2.0 [24], Fast-MFQE achieves an average increase of 19.6% in PSNR and an average increase of 8.2% in SSIM at QP = 37, and an average increase of 12.2% in PSNR and an average increase of 14.2% in SSIM at QP = 27. Although the enhancement effect of Fast-MFQE is close to that of MFQE2.0 [24], the inference speed of Fast-MFQE is faster. Overall, The Fast-MFQE outperforms all compared methods in terms of objective quality enhancement.

3.2.2. Subjective Comparison

In this section, the subjective evaluation of the Fast-MFQE is mainly focused on the following. As shown in Figure 6, the video frames are enhanced as follows: BasketballDrill at QP = 37, BlowingBubbles at QP = 32, Catus and Traffic at QP = 42. It can be observed that the proposed Fast-MFQE method has sharper edges and more vivid details than other methods. For example, the basketball in BasketballDrill, the face in BlowingBubbles, the words in Catus, and the car in Traffic can be restored with fine textures in Fast-MFQE, which is similar to MFQE2.0 [24].

3.2.3. Comparison of Inference Performance

In this section, the inference capability and the degree of lightweight of the Fast-MFQE model are quantitatively evaluated based on the frame rate and amount of parameters (Param). The Fast-MFQE model is compared with DnCNN [7], RNAN [17], MFQE1.0 [23], MFQE2.0 [24], and STDF [26]. It should be noted that STDF [26] is currently the most lightweight model used for compressed video enhancement. All models are tested on the following configurations: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz 2.30 GHz, Nvidia GeForce GTX 1080Ti GPU, and Ubuntu 20.04 for the sake of fairness.

Table 2 reports the number of parameters (Param) and the frame rate of different models. The Fast-MFQE maintains a speed of over 25 frames per second for all resolution videos. Notably, even when processing 1080p resolution videos, the Fast-MFQE achieves a speed close to 25 frames per second, ensuring smooth and non-stuttering video processing. Additionally, the Fast-MFQE reduces the parameters by 33.4% compared to the current lightest model, STDF [26]. Overall, the Fast-MFQE outperforms the compared methods in terms of model inference speed and parameters.

3.2.4. Ablation Studies

As shown in Table 3, the ablation test was performed at QP = 37. Although Model2 and Model3 have significantly improved the Inference speed, the enhancement effects, such as

Δ

PSNR and

Δ

SSIM, are declining sharply. Therefore, in order to better balance the enhancement effect and Inference speed, we choose Model1; that is, the three modules of IPPB, STFA, and FRN are all necessary.

3.2.5. Perceptual Quality Comparison

In this section, the performance of the Fast-MFQE is evaluated quantitatively using the learned perceptual image patch similarity (LPIPS) [45] and perceptual index (PI) [46], which are widely used perceptual quality assessment metrics. Table 4 reports the

Δ

LPIPS and

Δ

PI results, which are calculated between enhanced and compressed frames averaged over each test sequence. Note that

Δ

LPIPS

< 0

and

Δ

PI

< 0

indicate improvement in perceptual quality. As shown in this table, the Fast-MFQE is significantly superior to all the compared methods in terms of perceptual quality enhancement.

3.2.6. Subjective Quality and Inference Speed at Different Resolutions

This section focuses on the performance of the proposed Fast-MFQE regarding inference and enhancement across videos with different resolutions. Note that these videos are all tested at QP = 37. As shown in Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11, we test the inference speed and performance of the model at different resolutions and QP = 37. The results demonstrate that the Fast-MFQE is capable of performing fast inference with high quality across different resolutions.

4. Conclusions

This paper presents a fast multi-frame quality enhancement approach, named Fast-MFQE, which facilitates efficient model inference. The Fast-MFQE is the first lightweight model in the field of compressed video enhancement. Extensive experiments demonstrate that the Fast-MFQE outperforms previous methods in terms of its lightweight parameters, fast inference speed, and quality enhancement performance on benchmark datasets. Its remarkable attributes make it an ideal solution for real-time applications such as video streaming, video conferencing, and video surveillance, unlocking a range of possibilities in these domains.

Author Contributions

Conceptualization, K.C. and J.C.; writing—original draft preparation, K.C.; writing—review and editing, J.C., H.Z. and X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key R&D Program of China, under Grant 2021YFE0205400, in part by the National Natural Science Foundation of China under Grant 61976098, in part by the Natural Science Foundation for Outstanding Young Scholars of Fujian Province under Grant 2022J06023, in part by the Natural Science Foundation of Fujian Province under Grant 2022J01294, in part by the Key Science and Technology Project of **amen City under grant 3502Z20231005, and in part by the Collaborative Innovation Platform Project of Fuzhou-**amen-Quanzhou National Independent Innovation Demonstration Zone under Grant 2021FX03.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Acknowledgments

We sincerely appreciate the anonymous reviewers’ critical comments and valuable suggestions for improving the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

IPPB	Image pre-processing building module.
STFA	Spatio-temporal fusion attention.
FRN	Feature reconstruction network.
UHD	Ultra-high-definition.
QoE	Quality of experience.
CNN	Convolutional neural network.
PSNR	Peak Signal-to-Noise Ratio.
SSIM	Structural Similarity Index Measure.
DSC	Depthwise Separable Convolution.

References

Sullivan, G.J.; Ohm, J.-R.; Han, W.-J.; Wiegand, T. Overview of the High Efficiency Video Coding (HEVC) Standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 12, 1649–1668. [Google Scholar] [CrossRef]
Ohm, J.-R.; Sullivan, G.J.; Schwarz, H.; Tan, T.K.; Wiegand, T. Comparison of the coding efficiency of video coding standards—including high efficiency video coding (hevc). IEEE Trans. Circuits Syst. Video Technol. 2012, 12, 1669–1684. [Google Scholar]
Li, S.; Xu, M.; Deng, X.; Wang, Z. Weight-based R-λ rate control for perceptual high efficiency video coding coding on conversational videos. Signal Process. Image Commun. 2015, 10, 127–140. [Google Scholar] [CrossRef]
Lu, G.; Ouyang, W.; Xu, D.; Zhang, X.; Cai, C.; Gao, Z. An end-to-end deep video compression framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10998–11007. [Google Scholar]
Galteri, L.; Seidenari, L.; Bertini, M.; Bimbo, A.D. Deep generative adversarial compression artifact removal. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4836–4845. [Google Scholar]
Foi, A.; Katkovnik, V.; Egiazarian, K. Pointwise Shape-Adaptive DCT for High-Quality Denoising and Deblocking of Grayscale and Color Images. IEEE Trans. Image Process. 2007, 5, 1395–1411. [Google Scholar] [CrossRef]
Zhang, X.; **ong, R.; Fan, X.; Ma, S.; Gao, W. Compression artifact reduction by overlapped-block transform coefficient estimation with block similarity. IEEE Trans. Image Process. 2013, 12, 4613–4626. [Google Scholar] [CrossRef]
Sheikh, H.R.; Bovik, A.C.; de Veciana, G. An information fidelity criterion for image quality assessment using natural scene statistics. IEEE Trans. Image Process. 2005, 11, 2117–2128. [Google Scholar] [CrossRef]
Jancsary, J.; Nowozin, S.; Rother, C. Loss-specific training of non-parametric image restoration models: A new state of the art. In Proceedings of the 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 112–125. [Google Scholar]
Jung, C.; Jiao, L.; Qi, H.; Sun, T. Image deblocking via sparse representation. Signal Process. Image Commun. 2012, 3, 663–677. [Google Scholar] [CrossRef]
Chang, H.; Ng, M.K.; Zeng, T. Reducing artifacts in JPEG decompression via a learned dictionary. IEEE Trans. Signal Process. 2014, 2, 718–728. [Google Scholar] [CrossRef]
Dong, C.; Deng, Y.; Loy, C.C.; Tang, X. Compression Artifacts Reduction by a Deep Convolutional Network. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 576–584. [Google Scholar]
Zhang, K.; Zuo, W.; Chen, Y.; Meng, D.; Zhang, L. Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Trans. Image Process. 2017, 7, 3142–3155. [Google Scholar] [CrossRef]
Han, W.; Zhao, B.; Luo, J. Towards Smaller and Stronger: An Edge-Aware Lightweight Segmentation Approach for Unmanned Surface Vehicles in Water Scenarios. Sensors 2023, 23, 4789. [Google Scholar] [CrossRef]
Coates, W.; Wahlström, J. LEAN: Real-Time Analysis of Resistance Training Using Wearable Computing. Sensors 2023, 23, 4602. [Google Scholar] [CrossRef]
**ao, S.; Liu, Z.; Yan, Z.; Wang, M. Grad-MobileNet: A Gradient-Based Unsupervised Learning Method for Laser Welding Surface Defect Classification. Sensors 2023, 23, 4563. [Google Scholar] [CrossRef]
Zhang, Y.; Li, K.; Li, K.; Zhong, B.; Fu, Y. Residual non-local attention networks for image restoration. ar**v 2019, ar**v:1903.10082. [Google Scholar]
Tai, Y.; Yang, J.; Liu, X.; Xu, C. MemNet: A persistent memory network for image restoration. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4549–4557. [Google Scholar]
**, Z.; Iqbal, M.Z.; Zou, W.; Li, X.; Steinbach, E. Dual-Stream Multi-Path Recursive Residual Network for JPEG Image Compression Artifacts Reduction. IEEE Trans. Circuits Syst. Video Technol. 2021, 2, 467–479. [Google Scholar] [CrossRef]
Lin, M.-H.; Yeh, C.-H.; Lin, C.-H.; Huang, C.-H.; Kang, L.-W. Deep Multi-Scale Residual Learning-based Blocking Artifacts Reduction for Compressed Images. In Proceedings of the IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), Hsinchu, Taiwan, 18–20 March 2019; pp. 18–19. [Google Scholar]
Wang, T.; Chen, M.; Chao, H. A novel deep learning-based method of improving coding efficiency from the decoder-end for high efficiency video coding. In Proceedings of the Data Compression Conference (DCC), Snowbird, UT, USA, 4–7 April 2017; pp. 410–419. [Google Scholar]
Yang, R.; Xu, M.; Wang, Z. Decoder-side high efficiency video coding quality enhancement with scalable convolutional neural network. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017; pp. 817–822. [Google Scholar]
Yang, R.; Xu, M.; Wang, Z.; Li, T. Multi-frame quality enhancement for compressed video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–32 June 2018; pp. 6664–6673. [Google Scholar]
Guan, Z.; **ng, Q.; Xu, M.; Yang, R.; Liu, T.; Wang, Z. Mfqe 2.0: A new approach for multi-frame quality enhancement on compressed video. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 946–963. [Google Scholar] [CrossRef] [PubMed]
Yang, R.; Sun, X.; Xu, M.; Zeng, W. Quality-gated convolutional lstm for enhancing compressed video. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 532–537. [Google Scholar]
Deng, J.; Wang, L.; Pu, S.; Zhuo, C. Spatio-temporal deformable convolution for compressed video quality enhancement. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 2374–3468. [Google Scholar]
Zhang, T.; Zhang, Y.; **n, M.; Liao, J.; **e, Q. A Light-Weight Network for Small Insulator and Defect Detection Using UAV Imaging Based on Improved YOLOv5. Sensors 2023, 23, 5249. [Google Scholar] [CrossRef]
Han, N.; Kim, I.-M.; So, J. Lightweight LSTM-Based Adaptive CQI Feedback Scheme for IoT Devices. Sensors 2023, 23, 4929. [Google Scholar] [CrossRef]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. ar**v 2016, ar**v:1602.07360. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. ar**v 2017, ar**v:1610.02357. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. ar**v 2014, ar**v:1409.4842. [Google Scholar]
**e, S.; Girshick, R.; Dollar, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. ar**v 2017, ar**v:1704.04861. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. ar**v 2019, ar**v:1801.04381. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. ar**v 2018, ar**v:1807.11164. [Google Scholar]
Huang, G.; Liu, S.; van der Maaten, L.; Weinberger, K.Q. CondenseNet: An Efficient DenseNet Using Learned Group Convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2752–2761. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. ar**v 2018, ar**v:1807.02758. [Google Scholar]
Zhang, K.; Zuo, W.; Zhang, L. FFDNet: Toward a Fast and Flexible Solution for CNN based Image Denoising. IEEE Trans. Image Process. 2018, 9, 4608–4622. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. ar**v 2014, ar**v:1412.6980. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 4, 600–612. [Google Scholar] [CrossRef]
Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In Proceedings of the Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003; pp. 1398–1402. [Google Scholar]
Sheikh, H.R.; Bovik, A.C. Image information and visual quality. IEEE Trans. Image Process. 2006, 2, 430–444. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. ar**v 2018, ar**v:1801.03924. [Google Scholar]
Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Change Loy, C.; Qiao, Y.; Tang, X. ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks. ar**v 2018, ar**v:1809.00219. [Google Scholar]

Figure 1. Artifacts of compressed video at 240p and QP = 37.

Figure 2. Inference speed and performance comparison of VQE methods.

Figure 3. Architecture of the proposed Fast-MFQE.

Figure 4. Architecture of Spatio-Temporal Attention Fusion (STAF).

Figure 5. Architecture of Feature Reconstruction Network (FRN).

Figure 6. Subjective quality comparison on BasketballDrill at QP = 37, BlowingBubbles at QP = 32, Catus and Traffic at QP = 42. It can be seen that the Fast-MFQE approach achieves a clearer effect than the most advanced compressed video enhancement method MFQE2.0 [24].

Figure 7. Subjective quality and inference speed at 240p.

Figure 8. Subjective quality and inference speed at 480p.

Figure 9. Subjective quality and inference speed at 720p.

Figure 10. Subjective quality and inference speed at 1080p.

Figure 11. Subjective quality and inference speed at 1600p.

Table 1. Overall comparison for

Δ

PSRN(dB) and

Δ

SSIM (

\times 10^{- 2}

) over test sequences at three QPs.

Table 1. Overall comparison for

Δ

PSRN(dB) and

Δ

SSIM (

\times 10^{- 2}

) over test sequences at three QPs.

QP	Video Sequence		AR-CNN [12]		DnCNN [13]		RNAN [17]		MFQE1.0 [23]		MFQE2.0 [24]		Fast-MFQE
QP	Video Sequence		$Δ$ PSNR↑	$Δ$ SSIM↑	$Δ$ PSNR↑	$Δ$ SSIM↑	$Δ$ PSNR↑	$Δ$ SSIM↑	$Δ$ PSNR↑	$Δ$ SSIM↑	$Δ$ PSNR↑	$Δ$ SSIM↑	$Δ$ PSNR↑	$Δ$ SSIM↑
37	A	Traffic	0.27	0.50	0.35	0.64	0.40	0.86	0.50	0.90	0.59	1.02	0.61	1.23
	A	PeopleOnStreet	0.37	0.76	0.54	0.94	0.74	1.30	0.80	1.37	0.92	1.57	0.97	1.67
	B	Kimono	0.20	0.59	0.27	0.73	0.33	0.98	0.50	1.13	0.55	1.18	0.66	1.23
		ParkScene	0.14	0.44	0.17	0.52	0.20	0.77	0.39	1.03	0.46	1.23	0.53	1.33
		Cactus	0.20	0.41	0.28	0.53	0.35	0.76	0.44	0.88	0.50	1.00	0.64	1.16
		BQTerrace	0.23	0.43	0.33	0.53	0.42	0.84	0.27	0.48	0.40	0.67	0.52	0.86
		BasketballDrive	0.23	0.51	0.33	0.63	0.43	0.92	0.41	0.80	0.47	0.83	0.74	0.91
	C	RaceHourse	0.23	0.49	0.31	0.70	0.39	0.99	0.34	0.55	0.39	0.80	0.53	0.93
		BQMall	0.28	0.69	0.38	0.87	0.45	1.15	0.51	1.03	0.62	1.20	0.72	1.23
		PartyScene	0.14	0.52	0.22	0.69	0.30	0.98	0.22	0.73	0.36	1.18	0.44	1.31
		BasketballDrill	0.23	0.48	0.42	0.89	0.50	1.07	0.48	0.90	0.58	1.20	0.63	1.26
	D	RaceHorses	0.26	0.59	0.34	0.80	0.42	1.02	0.51	1.13	0.59	1.43	0.68	1.47
		BQSquare	0.21	0.30	0.30	0.46	0.32	0.63	-0.01	0.15	0.34	0.65	0.47	0.68
		BlowingBubles	0.16	0.46	0.25	0.76	0.31	1.08	0.39	1.20	0.53	1.70	0.61	1.89
		BasketballPass	0.26	0.63	0.38	0.83	0.46	1.08	0.63	1.38	0.73	1.55	0.88	1.67
	E	FourPeople	0.40	0.56	0.54	0.73	0.70	0.97	0.66	0.85	0.73	0.95	0.87	0.97
		Johnny	0.24	0.21	0.47	0.54	0.56	0.88	0.55	0.55	0.60	0.68	0.71	0.73
		KristenAndSara	0.41	0.47	0.59	0.62	0.63	0.80	0.66	0.75	0.75	0.85	0.86	0.88
		Average	0.25	0.50	0.36	0.69	0.41	0.62	0.46	0.88	0.56	1.09	0.67	1.18
32		Average	0.19	0.17	0.33	0.41	/	/	0.43	0.58	0.52	0.68	0.63	0.69
27		Average	0.16	0.09	0.33	0.26	/	/	0.40	0.34	0.49	0.42	0.55	0.48

Table 2. Inference speed at different resolutions and number of parameters(param).

		Inference Speed(f/s)					Param(k)
	Res.	120p	240p	480p	720p	1080p
Method		120p	240p	480p	720p	1080p
DnCNN [13]		191.8	54.7	14.1	6.1	2.6	556
RNAN [17]		5.6	3.2	1.4	0.6	0.08	8957
MFQE1.0 [23]		34.3	12.6	3.8	1.6	0.7	1788
MFQE2.0 [24]		56.5	25.3	8.4	3.7	1.6	255
STDF [26]		13.27	36.4	9.1	3.8	1.6	365
Fast-MFQE		162.1	60.3	43.1	32.3	25.7	243

Table 3. Ablation studies at QP = 37.

Model	Model1	Model2	Model3
IPPB	Yse	No	No
STFA	Yes	Yes	No
FRN	Yes	Yes	Yes
$Δ$ PSNR/ $Δ$ SSIM ( $\times 10^{- 2}$ )	0.68/1.19	0.42/0.89	0.21/0.45
Inference speed(f/s)	32.1	45.3	73.2

Table 4. Perceptual Quality Comparison.

QP	Video Sequence		AR-CNN [12]		DnCNN [13]		RNAN [17]		MFQE1.0 [23]		MFQE2.0 [24]		Fast-MFQE
QP	Video Sequence		$Δ$ LPIPS↓	$Δ$ PI↓	$Δ$ LPIPS↓	$Δ$ PI↓	$Δ$ LPIPS↓	$Δ$ PI↓	$Δ$ LPIPS↓	$Δ$ PI↓	$Δ$ LPIPS↓	$Δ$ PI↓	$Δ$ LPIPS↓	$Δ$ PI↓
37	A	Traffic	0.028	0.720	0.027	0.653	0.026	0.644	0.027	0.593	0.023	0.572	0.018	0.569
	A	PeopleOnStreet	0.029	0.726	0.028	0.631	0.029	0.643	0.026	0.631	0.019	0.539	0.017	0.520
	B	Kimono	0.030	0.733	0.031	0.664	0.032	0.657	0.029	0.622	0.020	0.617	0.018	0.639
		ParkScene	0.028	0.728	0.032	0.635	0.031	0.624	0.028	0.572	0.024	0.691	0.021	0.701
		Cactus	0.031	0.699	0.026	0.586	0.025	0.590	0.027	0.630	0.030	0.616	0.027	0.593
		BQTerrace	0.029	0.746	0.027	0.614	0.028	0.573	0.027	0.621	0.026	0.593	0.028	0.582
		BasketballDrive	0.030	0.732	0.031	0.720	0.029	0.680	0.030	0.675	0.025	0.623	0.021	0.641
	C	RaceHourses	0.027	0.751	0.026	0.709	0.025	0.716	0.031	0.695	0.018	0.641	0.019	0.611
		BQMall	0.032	0.638	0.031	0.682	0.032	0.644	0.024	0.632	0.017	0.671	0.028	0.576
		PartScene	0.027	0.699	0.028	0.627	0.027	0.609	0.025	0.594	0.026	0.617	0.030	0.548
		BasketballDrill	0.032	0.721	0.031	0.662	0.032	0.614	0.030	0.601	0.027	0.614	0.021	0.509
	D	RaceHourse	0.030	0.758	0.029	0.631	0.028	0.629	0.027	0.622	0.030	0.601	0.019	0.627
		BQSquare	0.029	0.771	0.029	0.691	0.027	0.678	0.025	0.631	0.029	0.597	0.020	0.561
		BlowingBubles	0.032	0.712	0.031	0.786	0.032	0.645	0.031	0.591	0.022	0.596	0.023	0.558
		BasketballPass	0.031	0.733	0.027	0.673	0.028	0.593	0.032	0.573	0.023	0.610	0.026	0.606
	E	FourPeople	0.026	0.726	0.028	0.765	0.027	0.712	0.026	0.670	0.022	0.632	0.021	0.617
		Johnny	0.028	0.761	0.027	0.668	0.027	0.623	0.028	0.640	0.024	0.621	0.023	0.618
		KristenAndSara	0.031	0.715	0.030	0.719	0.031	0.639	0.030	0.627	0.026	0.615	0.024	0.597
		Average	0.029	0.726	0.028	0.673	0.028	0.639	0.027	0.623	0.023	0.614	0.022	0.592
32		Average	0.028	0.564	0.026	0.533	0.023	0.515	0.023	0.501	0.020	0.495	0.018	0.493
27		Average	0.026	0.377	0.024	0.345	0.022	0.386	0.021	0.374	0.019	0.326	0.017	0.314

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, K.; Chen, J.; Zeng, H.; Shen, X. Fast-MFQE: A Fast Approach for Multi-Frame Quality Enhancement on Compressed Video. Sensors 2023, 23, 7227. https://doi.org/10.3390/s23167227

AMA Style

Chen K, Chen J, Zeng H, Shen X. Fast-MFQE: A Fast Approach for Multi-Frame Quality Enhancement on Compressed Video. Sensors. 2023; 23(16):7227. https://doi.org/10.3390/s23167227

Chicago/Turabian Style

Chen, Kemi, **g Chen, Huanqiang Zeng, and Xueyuan Shen. 2023. "Fast-MFQE: A Fast Approach for Multi-Frame Quality Enhancement on Compressed Video" Sensors 23, no. 16: 7227. https://doi.org/10.3390/s23167227

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fast-MFQE: A Fast Approach for Multi-Frame Quality Enhancement on Compressed Video

Abstract

1. Introduction

2. The Proposed Fast-MFQE

2.1. Image Pre-Processing Building Modules (IPPB)

2.1.1. Mean Shift

2.1.2. Pixel Unshuffle

2.2. Spatio-Temporal Attention Fusion (STAF)

2.3. Feature Reconstruction Network (FRN)

2.4. Loss Function

3. Experiments

3.1. Settings

3.1.1. Datasets

3.1.2. Quality Enhancement Assessment Metrics

3.1.3. Parameter Settings

3.2. Performance Comparison

3.2.1. Quantitative Comparison

3.2.2. Subjective Comparison

3.2.3. Comparison of Inference Performance

3.2.4. Ablation Studies

3.2.5. Perceptual Quality Comparison

3.2.6. Subjective Quality and Inference Speed at Different Resolutions

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI