A Method for In-Loop Video Coding Restoration

Salazar, Carlos; Trujillo, Maria; Branch-Bedoya, John W.

doi:10.3390/electronics13122422

Open AccessArticle

A Method for In-Loop Video Coding Restoration

by

Carlos Salazar

^1,*

,

Maria Trujillo

²

and

John W. Branch-Bedoya

³

¹

Amazon Web Services, Denver, CO 80202, USA

²

Multimedia and Computer Vision Group, University of Valle, Cali 760042, Colombia

³

Department of Computing and Decision Sciences, National University, Medellin 050034, Colombia

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(12), 2422; https://doi.org/10.3390/electronics13122422

Submission received: 22 May 2024 / Revised: 11 June 2024 / Accepted: 14 June 2024 / Published: 20 June 2024

(This article belongs to the Special Issue Image and Video Processing Based on Deep Learning)

Download

Browse Figures

Versions Notes

Abstract

:

In-loop restoration is a post-processing task aiming to reduce losses caused by the quantization and the inverse quantization phases in a video coding process. Emerging in-loop restoration methods, most of them based on deep learning, have reported higher quality gains than classical filters. However, the complexity at the decoder side remains a challenge. The Sparse Restoration Method (SRM) is presented as a low-complexity method that utilizes sparse representation and Natural Scene Statistic metrics to enhance visual quality at the block level. Our method shows potential restoration benefits when applied to synthetic video sequences.

Keywords:

AV1; CNN; in-loop restoration; sparse representation; video restoration

1. Introduction

Video applications such as Ultra-High Definition (UHD), Virtual Reality (VR), online gaming, and social networks are increasingly growing. They were 65% [1] of the total Internet traffic in 2023. This trend leads current and emerging video codec standards such as High Efficiency Video Coding (HEVC) [2], AOMedia Video 1 (AV1) [3], Versatile Video Coding (VVC) [4], and AOMedia Video 2 (AV2) (evolution of existing AV1 and currently under standardization (https://gitlab.com/AOMediaCodec/avm (accessed on 1 December 2023)) to continue pushing boundaries of compression theory. All those codecs rely on the well-known hybrid video coding scheme [5], which utilizes block-based prediction, transform, and quantization coding. In addition to traditional video codecs, HEVC, AV1, VVC, and AV2, integrate an in-loop restoration block that aims to mitigate visual artifacts –such as blocking, ringing, and blurring– inevitably introduced during a compression process that causes degradation in both video coding efficiency and visual quality. In the aforementioned video codecs, the in-loop restoration is implemented through different approaches. In the case of AV1, the in-loop restoration relies on evaluating three options: Wiener filter, self-guided, and passthrough. Conversely, HEVC implements a Sample Adaptive Offset filter (SAO) and a deblocking filter (DBF). On the other hand, VVC and AV2, as future video codecs, are still evaluating new algorithms that increase coding efficiency and visual quality, for instance, the Convolutional Neural Network (CNN) methods introduced by Ding et al. [6] and Kong et al. [7]. Both approaches leverage the learning of distorted and clean frames, assisted by quantization compression level, to obtain superior restoration performance in terms of Bjontegaard Delta-Rate (BD-Rate) [8] against classic algorithms. However, those methods add a complexity factor of 30, at a decoder side, which is still out of the scope of future standards—a factor of less than or equal to 10.

In this context, we propose the Sparse Restoration Method (SRM) as a low-complexity in-loop restoration approach for AV2 aiming to increase video coding efficiency and visual quality while maintaining a decoder complexity in a factor of less that or equal to 10. SRM relies on the sparsity property of residuals to predict a sparse vector in a decoder. This prediction is used to compensate for each decoded (distorted) block. SRM models the decoding residuals (DR), calculated as the difference between a reference and a decoded frames, and utilizes the Discrete Cosine Transform (DCT) as a dictionary to exploit the statistical correlation of sparse vector nonzero coefficient magnitudes across different quantization levels. Additionally, a variable of the Natural Scene Statistic (NSS) model [9] is used as guiding information for blind in-loop restoration tasks at the block level. SRM is a threefold method: (1) sparse decoding residuals, (2) sparse coefficient magnitude estimator, and (3) sparse position estimator. We leveraged the official and open-access AOM raw video dataset to compare the performance of SRM against that of the anchor AV2 codec using three objective visual quality metrics: the Peak-Signal-to-Noise Ratio (PSNR), the Structured Similarity Indexing Method (SSIM) [10], and the Video Multi-Method Assessment Fusion (VMAF) (https://github.com/Netflix/vmaf (accessed on 1 June 2024)). Experimental evaluation shows that SRM performs in-loop restoration with a 1–2% BD-Rate gain under synthetic video sequences, indicating a potential use with cartoons, computer animated movies, and video games.

2. Related Works and Theoretical Background

2.1. In-Loop Restoration in Existing AV1 Video Codec

The reference video codec AV1 supports three post-processing filters that are enabled independently. The post-processing stage starts with a block resulting from the addition of prediction (

X_{p r e}

) and residual (

\hat{r}

) blocks. Then, the block is ready to fill in an inter-prediction buffer and pass to the display picture module. Before that, a post-processing filter may be applied with two main objectives: (1) improve inter-prediction for a more accurate reconstruction frame in the buffer and reduce overall bitrate, and (2) increase visual output quality. That is a typical case for a super-resolution, where before post-processing filters, there is an up-scaling operation that naturally truncates image details. Figure 1 presents the high-level workflow of a post-processing stage. Each filter is also further detailed.

2.1.1. Deblocking Filter

During the encoding process, a residual from reference (X) and predicted (

X_{p r e}

) frames is split into transform blocks (TR). Each TR is computed using the discrete cosine transform (DCT), or the asymmetric discrete sine transform (ADST), to eliminate spatial correlation. Then, the resulting coefficients are quantized into N different levels determined by a quantization parameter (QP). The decoder performs inverse processes—transform and quantization—to recovers residuals, which are finally added to a predicted block (

X_{p r e}

) to build a decoded block (Y). In most of the cases (

Q P \neq 0

), inverse quantization and transformation introduce losses and boundary artifacts between adjacent transformation blocks (Figure 2). This effect occurs because the encoder processes each block independently (threads) and does not maintain smooth transitions at the boundaries. Therefore, AV1 implements a deblocking filter tool aiming to mitigate visual effects.

The deblocking algorithm uses vertical and horizontal lowpass FIR filters with 4, 8, or 14 taps for a luma component and 4 or 6 taps for a chroma component. The size of the lowpass FIR filters relies on the minimum transform size between blocks sharing a boundary. For instance, Figure 3 shows a case where the dimension of B-block determines the size of the filter. To avoid wrongly blurring natural edges, a deblocking tool implements a sequence of thresholds to determine if a filter is applied or not. The conditions are (1)

∣ p_{1} - p_{0} ∣ > T_{0}

, (2)

∣ q_{1} - q_{0} ∣ > T_{0}

, (3)

2 * ∣ p_{0} - q_{0} ∣ + \frac{∣ p_{1} - q_{1} ∣}{2} > T_{1}

, (4)

∣ p_{3} - p_{2} ∣ > T_{0}

, and (5)

∣ q_{3} - q_{2} ∣ > T_{0}

, where

T_{0}, T_{1} > 0

and control the sensitivity of the algorithm. Conditions 4 and 5 are used only for filter taps 8 and 14.

Figure 4 illustrates the position of the pixels

q_{x}

and

p_{x}

.

2.1.2. Constrained Directional Enhancement Filter (CDEF)

CDEF [11] is designed to remove ringing artifacts around hard edges—depicted in Figure 5. It is achieved by applying two filters (45° off) to each pixel. The selection of a proper filter is performed based on the minimization of Equation (1), as follows:

E_{d}^{2} = \sum_{k} \sum_{p \in P_{d, k}} {(x_{p} - μ_{d, k})}^{2},

(1)

where P is a group of pixels in a selected direction, as shown in Figure 6, and

μ

is the mean of the group P.

In-loop restoration filters are applied into loop restoration units (LRU) that can be

64 \times 64

,

128 \times 128

, or

256 \times 256

pixel blocks. Each LRU can independently select one of three possible restoration options.

2.1.3. Self-Guided

It relies on calculating two possible restored versions (

Y_{1}

and

Y_{2}

) of a decoded block Y. Then, it projects a mismatch between each version and a reference to obtain a unified restored patch (

Y_{r}

), described in Equation (2).

Y_{r} = Y + α (Y_{1} - Y) + β (Y_{2} - Y)

(2)

Before that, Y is denoised using Equation (3) on a pixel basis with two pairs of parameters sent from the encoder

(r_{1}, e_{1}), (r_{2}, e_{2})

. Finally,

Y_{1}

and

Y_{2}

are computed using

α

and

β

. The self-guided filter requires four parameters to be exchanged from an encoder, as follows:

\hat{y} = \frac{σ^{2}}{σ^{2} + e} y + \frac{e}{σ^{2} + e} μ,

(3)

where

\hat{y}

is a denoised pixel,

μ

is the mean of a block determined by r, and e is the standard deviation.

2.1.4. Wiener Filter

Wiener theory [13] is a well-known restoration technique widely applied to 1-D time series systems since 1949. It uses minimum mean square error (MMSE) to predict a signal

s (t)

, after being corrupted by noise

w (t)

. Both signals are considered wide-sense stationary processes. Wiener initially presented two versions: causal and non-causal filters. The last case was not physically realizable for time series because it considers past, present, and future samples.

Wiener’s theory became relevant but had not been applied into 2-D scenarios—image filtering, prediction, and smoothing. It was only in 1982 that Ekstrom [14] presented a physically realizable 2-D version of the original Wiener filter and demonstrated that it may also be extended to multi-dimensions. Ekstrom formulated an optimal error calculation to find the best parameters that define a filter configuration as described in the following Equation (4):

H (z) = \frac{S_{x y} (z)}{S_{x x} (z)},

(4)

where

S_{x y}

is the spectral energy of the correlation between X (reference) and Y (distorted) images. Similarly,

S_{x x}

is the spectral energy of the autocorrelation of X.

The feasibility of applying Wiener filter on images gained relevance in video restoration problems, since it aims to find a kernel that linearly relates an original image and its distorted version. In other words, it allows an encoder to determine kernel coefficients that a decoder can use to restore each block. This leads to the implementation of Wiener in both AV1 and HEVC, and it is still part of AV2. The first version implemented in AV1 presented outstanding compression efficiency, but complexity and signal information were desirable for improvement. Therefore, under the assumption of symmetry, Siekmann et al. [15] proposed a separable filter that achieves reductions of 33.33% and 50% in sum and multiplication operations for a

3 \times 3

block. In addition, for a

9 \times 9

block size, efficiencies of 77.77% and 80% were observed for sum and multiplications, respectively.

Besides operation performance, signaling information is also drastically reduced. For example, a

9 \times 9

block using a non-separable filter requires 81 coefficients. For a non-separable symmetric filter, it requires 41 coefficients; and in the case of a separable symmetric filter, it requires 18 coefficients. In other words, it represents 78% less signaling information to send to an encoder. It was also noted that non-separable and separable filters maintain a similar performance in terms of bit saving. The separable version is, in fact, the official implementation of Wiener filter in an AV1 reference code. However, processing time and complexity are still factors to improve in future approaches. Table 1 presents the performance results of Wiener filter vs. passthrough mode using AV1 over 10 frames (

1920 \times 1080

).

The post-processing stage starts with a block (Y) resulting from the addition of prediction (

X_{p r e}

) and residual (

\hat{r}

) blocks. Then, Y is ready to fill in an inter-prediction buffer and pass to the display picture module. Before that, a post-processing filter may be applied with two main objectives: (1) improve inter-prediction for a more accurate reconstruction frame in the buffer and reduce the overall bitrate and (2) increase the visual output quality. Figure 1 presents the high-level workflow of the post-processing stage. In-loop restoration is part of the post-processing and aims to find the coefficients of a filter, either self-guided or Wiener, which increases the visual quality when applied to the distorted block—received by the decoder. An in-loop filter is also designed to achieve the minimum signaling bits required to transmit the coefficients from an encoder to a decoder. In the case of the Wiener filter, Siekmann et al. [15] proposed a separable version that requires 78% less signaling information bits compared with the original non-separable Wiener method. All post-processing modules, including in-loop filters, are included in the baseline version of the AV2 video codec that is still in the standardization process.

2.2. In-Loop Restoration Based on Deep Learning

In-loop restoration approaches for tackling video compression problems from the perspective of deep learning architectures become more limited and point to video coding applications in most cases. At first, Jia et al. [16] proposed a content-aware CNN-based (CAC) method based on a block-based model selection and restoration modules incorporated into the HEVC reference code. The first component implements a discriminative network to select the most appropriate CNN [17] per Coding Tree Unit (CTU). The selection, of which CNN is used, relies on the previous labeling of a training content, where several categories are considered. Therefore, the discriminative network chooses a CNN, which minimizes a loss function described in Equation (6), considering a specific content, as follows:

J_{C T U} = Δ D_{C T U} + λ R,

(5)

Δ D_{C T U} = D_{C T U}^{'} - D_{C T U},

(6)

where

J_{C T U}

refers to the performance of the n-sima CNN,

Δ_{C T U}

is the variation in terms of BD-rate between a CTU before and after the restoring process,

λ

is the Lagrange multiplier that controls the trad-off between rate and distortion. Finally, R represents the required amount of bits for signaling. The CAC method has been evaluated against anchor HEVC, obtaining performance improvements in BD-rate between 2% and 4%. Complexity is also considered against Very Deep Super Resolution (VDSR) [18], Variable-Filter-Size Residue-Learning CNN (VRCNN) [19], and Adaptive Loop Filtering (ALF) [20]. Reported results indicated that the ALF—a current non-DL-based algorithm embedded into HEVC—is still more efficient achieving 123% vs. 11656% of CAC in terms of decoder complexity.

Considering previous results, it is straightforward that quality may be enhanced. However, complexity is still far from reaching the expected level of future video codec standards (<10×). Inspired by these facts, Ding et al. [6] proposed SimNet, a CNN-based method that reduces complexity by leveraging a skip** strategy where a simple CNN network restores specific frames and the remaining frames continue using traditional restoration non-DL-based methods. SimNet has been formally evaluated to be a part of AV2. It applies restoration on intra and Inter frames. The Inter frames rely on a propagation impact over the group of pictures (GOP). SimNet contains N cascading convolutional layers and a ReLu at last (Figure 7). The depth of the networks depends on a QP (Quantization Parameter) at each block.

SimNet reported overall performance improvements in BD-rate of 7.27% and 5.47% for Intra and Inter frames, respectively, against the reference codec AV1. It also showed a processing time reduction of 12.65% compared with AV1. However, the original paper does not provide a report of the decoder complexity.

The original publication of SimNet highlights an over-filtering problem closely related to an effect caused by Inter frames that are filtered and then used as a reference for next frames. The effect is a propagated noise that reduces the performance of CNN-based restoration methods in comparison with AV1. SimNet tackles this issue by skip** frames that potentially introduce errors. Considering this problem, Ding et al. [21] propose a transfer-learning method for only-Inter frames that incorporates back reconstructed frames into a training set of the CNN model. The algorithm achieves higher visual quality against the reference code HEVC but increases the processing time by around 4%. The encoder complexity is not reported, but we infer that SimNet may consume around 5x due to progressive training.

Kong et al. [7] present a Guided CNN architecture with similar outcomes, BD-rate: 1.84–3.06% and additional processing time, 23.79% (Figure 8). The restoration lies on a linear combination of N-CNN’s weighted outputs. Each CNN aims to capture distinct features of an image. However, the paper does not specify which characteristics capture from an image. The idea behind this is to have a lightweight array of N-CNN and obtain an optimal vector of weights that minimize the loss function described in Equation (7).

r_{c o r r} = a_{0} r_{0} + a_{1} r_{1} + . . . + a_{M - 1} r_{M - 1} .

(7)

3. Sparse Restoration Method (SRM)

Considering opportunities in the field of in-loop restoration, we introduce SRM that is threefold: (1) sparse decoding residual, (2) sparse coefficient magnitude estimator, and (3) sparse position estimator.

3.1. Sparse Decoding Residuals

During the video compression process, spatial residuals—the difference between a reference and an Inter/Intra predicted frames—are transformed into the frequency domain. The resulting coefficients, such as DCT, are mapped into predefined quantized factors, which may vary from 0 to 255, where 0 is lossless. After that, an entropy operation is applied to eliminate statistical redundancies. In the particular case of DCT, it means using more bits to represent DC and low-frequency coefficients. Finally, a sequence of bits is sent to the decoder. Equation (8) models frequency domain residuals after quantization Q [22] and transform T operations, as follows:

g = Q [T {r}] .

(8)

In order to recover an original frame, the decoder uses the prediction frame and its corresponding residual. The latter is obtained after applying inverse transform and quantization operators to an entropy-decoded bitstream. However, quantization is a lossy task that depends on the number of levels—such as 85, 110, or 210. A trade-off between bitrate and quality is expected. Therefore, the result of an inverse transform is not equal to a residual r (Equation (9)), as follows:

\hat{r} = T^{- 1} [Q^{- 1} {g}],

(9)

r \neq \hat{r} .

Since quantization error (

e_{q}

) is linear, we can approximate r as follows:

r \approx T^{- 1} [Q^{- 1} {g} + e_{q}],

(10)

r \approx \hat{r} + T^{- 1} [e_{q}] .

As illustrated in Figure 1, a reference frame X is theoretically expected to be equal to its prediction

X_{p r e}

plus a residual r. Integrating this into Equation (10) leads to the expression of a relation between a decoded frame Y and its reference X in the following Equation (11):

X \approx X_{p r e} + r,

(11)

X \approx X_{p r e} + \hat{r} + T^{- 1} [e_{q}],

X \approx Y + T^{- 1} [e_{q}] .

At the decoder,

X_{p r e}

and

\hat{r}

are always obtained through prediction and inverse quantization/transformation operations, respectively. Therefore, SRM models a differential cause by the inverse transform of a quantization error (

T^{- 1} [e_{q}]

). From now on, we call the differential decoding residual (DR). The reference AV1 video codec implements a series of tools to reduce the effect of quantization on block boundaries or recover lost information—the same as DR. Those processes require few bytes to improve the visual quality of a decoded frame, and the final bitrate is barely impacted. SRM takes into account the constraint of a low impact in the final bitrate by modeling DR using sparse theory to rely on a few nonzero coefficients expanded by a proper basis—or a dictionary.

In our proposal, we use

γ

to represent DR, a sparse vector

α

, a dictionary

Φ

, and the

L_{0}

—norm to state the problem in Equation (12):

{| | α | |}_{0} \leq z s . t . γ = α Φ .

(12)

The use of an

L_{0}

norm and the Orthogonal Matching Pursuit [23] algorithm relies on our target, which is to minimize the total of nonzeros instead of an error. The solution of the optimization problem, in Equation (12), provides a sparse vector

α

that, together with a dictionary

Φ

, can estimate a residual

γ

. The calculation of

α

is run entirely on the encoder. The resulting few nonzero coefficients and the corresponding positions in a vector are sent to the decoder as restoration signaling information. The residual

γ

is split into

\sqrt{n} \times \sqrt{n}

(i.e.,

8 \times 8

,

16 \times 16

, or

32 \times 32

blocks), which are processed simultaneously to make the problem suitable in terms of computer memory usage. An operator

R_{i, j}

is introduced to extract a

i, j

patch of size

\sqrt{n} \times \sqrt{n}

. Equation (13) models

i, j

sparse vector

α

, as follows:

| | α_{i, j} {| |}_{0} \leq z, s . t . R_{i, j} γ = α_{i, j} Φ .

(13)

The patch-based approach requires blocks to be overlapped in order to avoid edge reconstruction artifacts. Mairal et al. [24] introduce a weighted average formulation that we adapt in the following Equation (14):

\hat{γ} = {(\sum_{i, j} R_{i, j}^{T} R_{i, j})}^{- 1} (\sum_{i, j} R_{i, j}^{T} Φ α_{i, j}) .

(14)

3.2. Sparse Coefficient Magnitude Estimation

The statistical redundancy of the sparse coefficients is exploited using a DCT basis as a dictionary. Thus, a sparse vector is interpreted as a truncated version of

T [γ]

, where T is the DCT transform operator. Therefore, nonzero coefficients follow a predictable statistical behavior [25]. In fact, AV1/AV2 models the magnitude of DCT coefficients as the Laplace distribution [26], which permits the estimation of rate distortion per different transform configurations. Our method uses a similar assumption, but with the Gaussian distribution Equation (15), since it fits better the empirical distribution of the absolute magnitude of a DC nonzero coefficient (z) and the Gamma distribution Equation (16) for the rest of the coefficients. The only exception is

Q P = 85

, where the Laplace distribution Equation (17) better fits the empirical distribution of the absolute value of the AC coefficients.

f (∣ z ∣) = \frac{1}{σ \sqrt{2 π}} e^{- \frac{1}{2} {(\frac{∣ z ∣ - μ}{σ})}^{2}},

(15)

f (z ∣ a, b) = \frac{z^{a - 1} e^{- b z} b^{a}}{Γ (a)},

(16)

f (z ∣ μ, b) = \frac{1}{2 b} e^{(- \frac{∣ z - μ ∣}{b})} .

(17)

The parameters of the distributions described in Equations (15)–(17) are correlated with the QP levels. Accordingly, we estimate those parameters per each

Q P \in {85, 110, 135, 160, 185, 210}

, as shown in the following Table 2.

SRM estimates nonzero coefficients following the probability distribution functions instead of sharing them with a decoder. The magnitude of a coefficient may vary but still adds gain to a distorted frame, since the method is concentrated in a residual, not in a frame, as illustrated in Figure 9. For instance, in

Q P = 135

, the cost of predicting coefficients is −0.16 dB in terms of PSNR. This means that the decoder compromises 0.16 dB in quality but, at the same time, achieves an approximately 25 K (2.78%) byte reduction per frame of dimensions

1280 \times 720

(This example use blocks of

16 \times 16

and

{∥ α ∥}_{0} = 2 .

)

3.3. Sparse Coefficient Position Estimation

The positions of nonzero coefficients, in

α

, are most relevant frequencies in a decoded block Y, which requires compensation, from a frequency domain perspective. On the decoder side, this information is unknown. However, assuming that quantization errors equally affect all AC components of Y, we expect that the most relevant frequencies in Y are also the ones requiring more restoration. Experimental validation—by executing restoration over patches across raw video sequences with different QP levels—shows that in 80% of the cases (across all QP levels), at least one of the three most relevant coefficients in Y corresponds to the most relevant frequencies in the residual decoding block. Thus, implementing a DCT transform of Y could bring this information to the decoder. However, the sign of DCT coefficients does not keep a relation with a potential sign of the nonzero coefficient on a specific position. Therefore, we introduced a novel algorithm that relies on image quality blind assessment in the frequency domain, introduced by Saad M. et al. [27], to evaluate the most efficient combination, in terms of the sign of two nonzero coefficients in the sparse vector whose positions are determined by the location of the most relevant components of a DCT transform of Y.

Figure 9. Visual and objective comparison of a frame restoration. At the top, sparse coefficients are sent to the decoder. At the bottom, sparse coefficients are predicted by the decoder. The distorted original image (not displayed here) reports a PSNR = 38.81 dB. This example uses

Q P = 135

.

Figure 9. Visual and objective comparison of a frame restoration. At the top, sparse coefficients are sent to the decoder. At the bottom, sparse coefficients are predicted by the decoder. The distorted original image (not displayed here) reports a PSNR = 38.81 dB. This example uses

Q P = 135

.

We follow a well-established definition of NSS where natural images are characterized by a Generalized Gaussian Distribution (GGD) and the effects of distortions, such as JPEG blocking and blur, affect the shape and histogram of an original image. However, we are not interested in evaluating image quality. Instead, we use features of GGD as criteria to determine a proper predicted decoding residual block. Specifically, we utilize the GGD constant parameter a described in Equation (18). This allows us to avoid signaling data between the encoder and the decoder to share coefficients information, besides a single bit per restoration block to guide a restoration process. The complete details of our prediction algorithm are given below in Algorithms 1 and 2.

F_{X} (x, μ, σ^{2}, τ) = a e^{{(- b ∣ x - μ ∣)}^{τ}} x \in ℜ,

(18)

b = (\frac{1}{σ} \sqrt{\frac{Γ (3 / ϕ)}{Γ (1 / ϕ)}}) a = (\frac{b ϕ}{2 Γ (1 / ϕ)}) .

Algorithm 1 Sparse prediction algorithm at the encoder per block basis.

Input: Reference block: x, Decoded block: y, block size: (

\sqrt{n} \times \sqrt{n}

)

Task: Set encoder-flag

Apply DCT to x and y
- $x^{f} = T [x], y^{f} = T [y]$
Calculate the GGD feature a for x and y:
- $a_{x} = (\frac{b_{x} ϕ}{2 Γ (1 / ϕ)}) a_{y} = (\frac{b_{y} ϕ}{2 Γ (1 / ϕ)})$
Set encoder-flag:
- $\frac{a_{x}}{a_{y}} > 1 \to$ encoder-flag=1,
- $\frac{a_{x}}{a_{y}} < 1 \to$ encoder-flag=0

Algorithm 2 Sparse prediction algorithm at the decoder.

Input: Decoded block: y, block size: (

\sqrt{n} \times \sqrt{n}

),

e n c o d e r_f l a g

\in {0, 1}, Q P

Task: Predict the decoding residual block at the decoder (

γ

)

Apply DCT to the decoded block:
- $y^{f} = T [y]$
Identify the position in $y^{f}$ of the two coefficients with the largest absolute value:
- $A = s o r t (∣ y^{f} ∣) s . t . max (A) = A [0] \land A [0] \geq A [1],$
- $p_{0} = k_{0} s . t . A [0] = ∣ y_{k_{0}}^{f} ∣,$
- $p_{1} = k_{1} s . t . A [1] = ∣ y_{k_{1}}^{f} ∣$
Predict the magnitude of the sparse nonzero coefficients:
- ${c_{0}, c_{1}} \sim L a p l a c e (μ, b) \forall Q P \in {85},$
- ${c_{0}, c_{1}} \sim N (μ, σ^{2}) \forall Q P \in {110, 135, 160, 185, 210}$
Create four sparse vectors with the possible sign combinations:
- $\forall i \notin {p_{0}, p_{1}} ν_{0} [i] = 0, ν_{0} [p_{0}] = + c_{0}, ν_{0} [p_{1}] = + c_{1},$
- $\forall i \notin {p_{0}, p_{1}} ν_{1} [i] = 0, ν_{1} [p_{0}] = - c_{0}, ν_{1} [p_{1}] = + c_{1},$
- $\forall i \notin {p_{0}, p_{1}} ν_{2} [i] = 0, ν_{2} [p_{0}] = + c_{0}, ν_{2} [p_{1}] = - c_{1},$
- $\forall i \notin {p_{0}, p_{1}} ν_{3} [i] = 0, ν_{3} [p_{0}] = - c_{0}, ν_{3} [p_{1}] = + c_{1}$
Obtain four potential restored blocks in the DCT domain:
- $y_{0}^{f} = y^{f} + ν_{0},$
- $y_{1}^{f} = y^{f} + ν_{1},$
- $y_{2}^{f} = y^{f} + ν_{2},$
- $y_{3}^{f} = y^{f} + ν_{3}$
Calculate the GGD feature a for each potential restored block:
- $b_{i} = (\frac{1}{σ_{i}} \sqrt{\frac{Γ (3 / ϕ)}{Γ (1 / ϕ)}}), a_{i} = (\frac{b_{i} ϕ}{2 Γ (1 / ϕ)}),$
- $A = {a_{0}, a_{1}, a_{2}, a_{3}}$
Select the restored block that obtains the max or min of the GDD feature a:
- encoder-flag=0 $\to {a_{i} = \max (A)},$
- encoder-flag=1 $\to {a_{i} = \min (A)}$
- $γ = ν_{i} Φ, y_{r e s t o} = y + γ$

4. Experiments Design

4.1. Dataset

For the experimentation, we follow the AOM Common Test Conditions v2.0 (https://aomedia.org/docs/CWG-B075o_AV2_CTC_v2.pdf (accessed on 1 December 2023)), which recommends a series of raw video sequences with diverse characteristics—including content type, bit depth, resolution, and color sub-sampling—intending to provide standard scenarios to test new algorithms against the reference codec. Thus, five test sequences were selected to assess the performance of SRM. The selection was made based on two criteria: (1) sequences with content proper for All Intra (AI) configuration, and (2) due to restrictions in cloud computing services, we choose sequences with resolutions < 1920 × 1080 and bit-depth = 8. Table 3 presents general details of a selected sequences. In addition, the standardization group defines a set of QPs to evaluate each configuration. In our case, we follow All Intra(AI) recommendation:

Q P = {85, 110, 135, 160, 185, 210}

. Raw videos are publicly available and hosted at an open-source platform: **ph (https://media.xiph.org/video/aomctc/test_set/ (accessed on 1 January 2024)). Figure 10 shows examples of frames belonging to sub-classes A2–A5 and B1.

4.2. Computing Details

Considering the total number of video sequences (32) and the number of QP levels (6), we conducted 192 executions using the anchor encoder configuration (AV2 + switchable filter) (https://gitlab.com/AOMediaCodec/avm (accessed on 2 February 2024)) and 192 executions using SRM. At each execution, we also performed a post-processing reference quality assessment using PSNR, SSIM, and VMAF. In all cases, compressed file and quality metrics are automatically uploaded to an AWS S3 bucket as a source to the overall performance evaluation based on the BD-Rate. Each of the 384 tests was performed using AWS Batch (https://aws.amazon.com/batch/ (accessed on 4 February 2024)), leveraging a docker container and automation, as described in the reference architecture for a super-resolution [28], as illustrated in Figure 11. In addition, we use EC2 instance type c5.2xlarge as a part of our AWS Batch job definition.

5. Results

Table 4 presents a comparison of SRMs across video sequences A2–A5 and B1. We use the BD-Rate to evaluate the performance under different bitrate and quality conditions. Despite the prediction efficiency, with regard to the accuracy and reduction of signaling bits, the encoder has to inform the decoder whether a block is subjected to restoration or not. SRM is partially blind, which can restore block accuracy in most cases (up to 70%). In the remaining 30%, the encoder has to inform the decoder that restoration is not required. A guide bit (encoder-flag) is also added to patches that are subjected to restoration. Therefore, a frame from group A2 (

1920 \times 1080

) with a block size of

32 \times 32

requires approximately 379 bytes for signaling, considering that 50% of blocks are restored. This number (379 bytes) is double that of the switchable filter.

Another exciting result is that SRM performs better on SSIM and VMAF metrics for B1 sequences, which, according to the MSU Graphics and Media Lab Video Group [29], are 90.57% and 93.86% (respectively) correlated to human subjective score, which surpasses the 87.43% reported for PSNR. It goes in the direction of NSS, which assesses the structure of images instead of measuring a distance between pixels. Additionally, SSIM and VMAF use the principle of structure of images. Figure 12 presents a subjective example of the performance of SRM restoration. In this case, the restored block starts to recover details compared with the decoded block. In a similar case, in Figure 13, we observed how the bush in the image recovers part of its original shape that was lost during the quantization and inverse quantization processes. Our method aims to compensate frequencies—in DCT-domain—that are more relevant for an image and provide more gain in terms of details. Therefore, we omit DC coefficients that are not considered by a GGD feature.

Regarding complexity, SRM shows approximately 110% and 115% processing time per frame of natural (A2–A5) and synthetic (B1) sequences compared with anchor AV1/AV2 + SF (Table 5). Those results are aligned with the low-complexity implementation of sparse analysis against traditional deep learning methods detailed in Section 2.2.

6. Conclusions and Future Works

We proposed SRM as a low-complexity method that shows the capability of being used with synthetic video sequences. A large market of digital games, child movies, and educational and training videos, among others, can benefit from the use of SRM.
SRM was able to predict a proper DR (decoding residual) block using the GGD shape parameter (a) as a quality selector factor at a block level in the DCT space, which maximized the objective visual quality of a restored block. Moreover, SRM achieved a 2.5 BD-rate gain, in terms of VMAF, against existing switchable restoration filter—in AV1/AV2—over synthetic content (class B), while complexity was kept between 105% and 110%. SRM leverages a guiding restoration flag to determine which blocks require restoration and which blocks can be passthrough. A future improvement for SRM will be targeted to predict this flag at the decoder side, which will reflect a significant bitrate reduction (approx. 10–15%).
Sparse representation is, without a doubt, an efficient method for image restoration tasks. Regarding the video coding in-loop restoration scenario, the critical challenge was eliminating required information to transfer between the encoder and decoder to represent the nonzero coefficients. However, moving the high-intensive task to the decoder is not an option, considering the real-time exigency during the decoding process. Therefore, we developed a hybrid approach where most of the required information was predicted in the decoder, and only a guiding bit encoder-flag was required. The reason for utilizing a guiding bit is the poor precision of predicting if a block, i.e., $32 \times 32$ , should be collapsed or expanded in terms of the GGD.
Traditional full-reference quality metrics, such as PSNR, SSIM, and VMAF, are not completely consistent for assessing image/video restoration. These metrics rely on an existing reference image that is not always artifacts free—i.e., noise, blocking, and blurring. Therefore, future works should be addressed to define computational-efficient and real-time non-reference metrics at a frame and block levels, in order to provide human-correlated data to the decoder to perform restoration without requiring context information shared by the encoder. This mechanism will improve in-loop restoration efficiency in terms of the required amount of signaling bits.

Author Contributions

Conceptualization and methodology, C.S., M.T. and J.W.B.-B.; investigation, C.S. and M.T.; evaluation tests, C.S.; formal analysis, C.S., M.T. and J.W.B.-B.; writing—original draft preparation, C.S. and M.T.; writing—review and editing, M.T. and J.W.B.-B.; supervision, M.T. and J.W.B.-B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

Author Carlos Salazar was employed by the company Amazon (United States). The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Lyn, C. The Global Internet Phenomena Report; Technical Report; Sandvine: Waterloo, ON, Canada, 2023. [Google Scholar]
Sullivan, G.J.; Ohm, J.R.; Han, W.J.; Wiegand, T. Overview of the High Efficiency Video Coding (HEVC) Standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1649–1668. [Google Scholar] [CrossRef]
Han, J.; Li, B.; Mukherjee, D.; Chiang, C.H.; Grange, A.; Chen, C.; Su, H.; Parker, S.; Deng, S.; Joshi, U.; et al. A Technical Overview of AV1. Proc. IEEE 2021, 109, 1435–1462. [Google Scholar] [CrossRef]
Bross, B.; Wang, Y.K.; Ye, Y.; Liu, S.; Chen, J.; Sullivan, G.J.; Ohm, J.R. Overview of the Versatile Video Coding (VVC) Standard and Its Applications. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3736–3764. [Google Scholar] [CrossRef]
Bhojani, D.R.; Dwivedi, V.J.; Thanki, R.M. Comparative Comparison of Standard and Hybrid Video Codec. In Hybrid Video Compression Standard; Springer: Singapore, 2020; pp. 57–58. [Google Scholar] [CrossRef]
Ding, D.; Chen, G.; Mukherjee, D.; Joshi, U.; Chen, Y. A progressive CNN in-loop filtering approach for inter frame coding. Signal Process. Image Commun. 2019, 94, 116201. [Google Scholar] [CrossRef]
Kong, L.; Ding, D.; Liu, F.; Mukherjee, D.; Joshi, U.; Chen, Y. Guided CNN Restoration with Explicitly Signaled Linear Combination. In Proceedings of the 2020 IEEE International Conference on Image Processing, Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 3379–3383. [Google Scholar] [CrossRef]
Barman, N.; Martini, M.G.; Reznik, Y. Revisiting Bjontegaard Delta Bitrate (BD-BR) Computation for Codec Compression Efficiency Comparison. In Proceedings of the 1st Mile-High Video Conference, MHV ’22, New York, NY, USA, 1–3 March 2022; pp. 113–114. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A. Reduced- and No-Reference Image Quality Assessment. IEEE Signal Process. Mag. 2011, 28, 29–40. [Google Scholar] [CrossRef]
Wang, Z.; Simoncelli, E.; Bovik, A. Multiscale structural similarity for image quality assessment. In Proceedings of the Thrity-Seventh Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 9–12 November 2003; Volume 2, pp. 1398–1402. [Google Scholar] [CrossRef]
Valin, J.M. The Daala Directional Deringing Filter. ar**v 2016, ar**v:1602.05975. [Google Scholar]
Umnov, A.V.; Krylov, A.S.; Nasonov, A.V. Ringing artifact suppression using sparse representation. Lect. Notes Comput. Sci. 2015, 9386, 35–45. [Google Scholar]
Wiener, N. The Linear Predictor and Filter for Multiple Time Series. In Extrapolation, Interpolation, and Smoothing of Stationary Time Series: With Engineering Applications; MIT Press: Cambridge, MA, USA, 1964; pp. 104–116. [Google Scholar]
Ekstrom, M.P. Realizable Wiener Filtering in Two Dimensions. IEEE Trans. Acoust. Speech Signal Process. 1982, 30, 31–40. [Google Scholar] [CrossRef]
Siekmann, M.; Bosse, S.; Schwarz, H.; Wiegand, T. Separable Wiener filter based adaptive in-loop filter for video coding. In Proceedings of the 28th Picture Coding Symposium, Nagoya, Japan, 8–10 December 2010; pp. 70–73. [Google Scholar] [CrossRef]
Jia, C.; Wang, S.; Zhang, X.; Wang, S.; Liu, J.; Pu, S.; Ma, S. Content-Aware Convolutional Neural Network for In-Loop Filtering in High Efficiency Video Coding. IEEE Trans. Image Process. 2019, 28, 3343–3356. [Google Scholar] [CrossRef] [PubMed]
O’Shea, K.; Nash, R. An Introduction to Convolutional Neural Networks. ar**v 2015, ar**v:1511.08458. [Google Scholar]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar] [CrossRef]
Dai, Y.; Liu, D.; Wu, F. A Convolutional Neural Network Approach for Post-Processing in HEVC Intra Coding. In Lecture Notes in Computer Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 28–39. [Google Scholar] [CrossRef]
Chen, C.Y.; Tsai, C.Y.; Huang, Y.W.; Yamakage, T.; Chong, I.S.; Fu, C.M.; Itoh, T.; Watanabe, T.; Chujoh, T.; Karczewicz, M.; et al. The adaptive loop filtering techniques in the HEVC standard. In Proceedings of the Applications of Digital Image Processing XXXV; Tescher, A.G., Ed.; Society of Photo-Optical Instrumentation Engineers (SPIE): Bellingham, WA, USA, 2012; Volume 8499, p. 849913. [Google Scholar] [CrossRef]
Ding, D.; Kong, L.; Chen, G.; Liu, Z.; Fang, Y. A Switchable Deep Learning Approach for In-Loop Filtering in Video Coding. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 1871–1887. [Google Scholar] [CrossRef]
Segall, C.A.; Katsaggelos, A.K.; Molina, R.; Mateos, J. Super-Resolution from Compressed Video. In Super-Resolution Imaging; Chaudhuri, S., Ed.; Springer: Boston, MA, USA, 2000; pp. 211–242. [Google Scholar] [CrossRef]
Cai, T.T.; Wang, L. Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise. IEEE Trans. Inf. Theory 2011, 57, 4680–4688. [Google Scholar] [CrossRef]
Mairal, J.; Sapiro, G.; Elad, M. Learning multiscale sparse representations for image and video restoration. Multiscale Model. Simul. 2008, 7, 214–241. [Google Scholar] [CrossRef]
Reininger, R.C.; Gibson, J.D. Distributions of the Two-Dimensional DCT Coefficients for Images. IEEE Trans. Commun. 1983, 31, 835–839. [Google Scholar] [CrossRef]
Oxford. A Dictionary of Statistics; Oxford University Press: Oxford, UK, 2014. [Google Scholar] [CrossRef]
Saad, M.A.; Bovik, A.C.; Charrier, C. DCT statistics model-based blind image quality assessment. In Proceedings of the 2011 18th IEEE International Conference on Image Processing, Brussels, Belgium, 11–14 September 2011; pp. 3093–3096. [Google Scholar] [CrossRef]
Salazar, C.; Madan, S.; Bodas, A.V.; Velasco, A.; Bird, C.A.; Barut, O.; Trigui, T.; Liang, X. AWS Compute Video Super-Resolution powered by the Intel^® Library for Video Super Resolution. In Proceedings of the 3rd Mile-High Video Conference, Denver, CO, USA, 11–14 February 2024; pp. 124–125. [Google Scholar] [CrossRef]
Antsiferova, A.; Lavrushkin, S.; Smirnov, M.; Gushchin, A.; Vatolin, D.; Kulikov, D. Video compression dataset and benchmark of learning-based video-quality metrics. ar**v 2023, ar**v:2211.12109. [Google Scholar]

Figure 1. Post−processing filters (simplified view).

Figure 2. Illustration of boundary artifacts caused by quantization during AV1 video compression. An original frame at the top, where white line refers to the boundary before being split into two blocks. A decoded frame at the bottom with green box highlighting boundary discontinuity cause by quantization process on each block.

Figure 3. Illustration of boundary blocks used to determine the size of a deblocking filter.

Figure 4. Illustration of boundary pixels involved in deblocking filtering.

Figure 5. Illustration or ringing artifact. left image refers to original frame. Right image shows ringing artifacts around the object edges [12].

Figure 6. CDEF in eight directions, where dark cube represents the central pixel [11].

Figure 7. Architecture of the CNN in-loop filter [16].

Figure 8. Architecture of the Guided CNN restoration [7].

Figure 10. Illustration of video test sequences A2–A5 and B1.

Figure 11. The AWS computer architecture used for tests.

Figure 12. Subjective assessment (luma plane) for SRM restoration at the block level, using QP = 210. At the left: original block (reference), at the center: distorted block due AV2 compression artifacts, and at the right: restored block after executing SRM and compensating it with predicted DR (decoding residual).

Figure 13. SRM frame restoration using Y plane. U and V planes are the same for the decoded frame. Using a sequence B1 with quantization of QP = 210. Top: original block (reference) at the left, distorted block due AV2 compression artifacts at the center, and restored block with +1 dB VMAF gain at the right. Bottom: full reference frame.

Table 1. Performance results of Wiener filter (on) vs. passthrough (off) mode in AV1 [3].

	On	Off	Perf. On vs. Off
Average speed (fps)	15	22	$- 32$ %
Total encoding time (ms)	40,856	27,951	$+ 46$ %

Table 2. Estimated parameters of the probability distribution functions by QP level.

QP	DC	AC
85	$- -$	$L a p l a c e (μ = 6.01, b = 1.08)$
110	$N (a = 13.25, b = 0.70)$	$N (μ = 10.06, σ = 2.77)$
135	$Γ (a = 6.30, b = 2.42)$	$N (μ = 17.98, σ = 6.94)$
160	$Γ (a = 4.09, b = 5.89)$	$N (μ = 33.45, σ = 14.68)$
185	$Γ (a = 3.04, b = 13.11)$	$N (μ = 61.37, σ = 28.53)$
210	$Γ (a = 2.50, b = 27.81)$	$N (μ = 110.20, σ = 54.29)$

Table 3. Selected raw video test sequences.

Class	Sub-Class	Resolution	Total
Natural Videos (A)	A2	$1920 \times 1080$	10
	A3	$1280 \times 720$	6
	A4	$640 \times 360$	6
	A5	$480 \times 270$	3
Synthetic (B)	B1	$1920 \times 1080$	7

Table 4. AV2 + SF (switchable filter) vs. AV2 + SRM (our method). BD-Rate (PSNR, SSIM, and VMAF) for Only-Intra mode is assessed on sequences A2–A5 and B1. AV2 + SRM surpasses the baseline method (AV2 + SF) in synthetic videos (B1).

		BD-Rate *
Sequence	Implementation	PSNR	SSIM	VMAF
A2	AV2 + SF	−1.816	−0.035	−1.994
	AV2 + SRM	0.487	−0.337	−2.206
A3	AV2 + SF	−1.813	−0.183	−2.310
	AV2 + SRM	0.794	0.763	−1.642
A4	AV2 + SF	−2.156	−1.890	−0.326
	AV2 + SRM	1.158	1.355	0.730
A5	AV2 + SF	−0.499	−0.627	−1.33
	AV2 + SRM	0.615	−0.174	−1.276
B1	AV2 + SF	−0.337	0.025	−1.603
	AV2 + SRM	1.047	−0.949	−2.585

* Negative is better.

Table 5. Only-Intra computational complexity (seconds/frame).

Sequence	AV2 + SF *	AV2 + SRM *	$Δ$ Time
A2–A5	18.32	20.21	+110.31%
B1	14.84	15.61	+105.18%

* Running on AWS c5.2xlarge instance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Salazar, C.; Trujillo, M.; Branch-Bedoya, J.W. A Method for In-Loop Video Coding Restoration. Electronics 2024, 13, 2422. https://doi.org/10.3390/electronics13122422

AMA Style

Salazar C, Trujillo M, Branch-Bedoya JW. A Method for In-Loop Video Coding Restoration. Electronics. 2024; 13(12):2422. https://doi.org/10.3390/electronics13122422

Chicago/Turabian Style

Salazar, Carlos, Maria Trujillo, and John W. Branch-Bedoya. 2024. "A Method for In-Loop Video Coding Restoration" Electronics 13, no. 12: 2422. https://doi.org/10.3390/electronics13122422

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Method for In-Loop Video Coding Restoration

Abstract

1. Introduction

2. Related Works and Theoretical Background

2.1. In-Loop Restoration in Existing AV1 Video Codec

2.1.1. Deblocking Filter

2.1.2. Constrained Directional Enhancement Filter (CDEF)

2.1.3. Self-Guided

2.1.4. Wiener Filter

2.2. In-Loop Restoration Based on Deep Learning

3. Sparse Restoration Method (SRM)

3.1. Sparse Decoding Residuals

3.2. Sparse Coefficient Magnitude Estimation

3.3. Sparse Coefficient Position Estimation

4. Experiments Design

4.1. Dataset

4.2. Computing Details

5. Results

6. Conclusions and Future Works

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI