3.1. Early CU Termination Threshold Selection with TCPOC
As shown in
Figure 1, the coding sequence of the CTU adopts the z-scan mode, and the coding is performed in the order of depth level. The numbers on the coded depth level indicate the coding order of each coding unit. This order is to ensure that each coding unit can obtain the reference CU above and on the left side of the CU during encoding. The reference information is the information of the coded unit and is used to predict the coding parameters of the current CU. Of course, the coding unit located at the top of the slice or the left border of the image frame needs to be removed.
To achieve the best RD–complexity trade-off in the mode decision process, the optimal threshold with the lowest process complexity for a given prediction accuracy should be chosen. The early determination strategy allows one to save some computation burden in the mode process by the TCPOC threshold with a descending order and to stop the process either because a better rate distortion performance cannot be arrived or because a minimal cost threshold is attained.
As discussed in
Section 2.1, there is a one-to-one correspondence between the coding quadtree structure and the
TCPOC value of a coding CU. Therefore, the best coding quadtree structure of the current CU is specified by the optimal
TCPOC. If the best coding quadtree structure is reached before the final depth CU is processed, without any coding efficiency degradation, the remaining partitioning CUs can be skipped with less size. It should be pointed out that, on the one hand, the accuracy of intra-prediction increases with increasing the greater depth of the CTU, since the average prediction residual difference between a predicted sample and the reference sample decreases. On the other hand, the average efficiency of prediction coding typically increases with the smaller depth value. Hence, the optimal depth structure feature that allows the skip** of some partition units provides the possibility to select a suitable trade-off between the complexity reduction and coding efficiency.
Taking into account that the partitioning of the current CU depth into the next depth provides a more suitable description with accurate prediction parameters, in order to reduce the total cost, the next depth should be processed depending on the of the current depth. When is below a pre-set threshold TH corresponding to the optimal RD cost, the non-texture cost takes up a large portion of the total cost. Thus, no further cost reduction in the next depths is necessary. Based on the coupling relationship between TCPOC and depth level from level 0 to level 2 in each CU, an early determination scheme is designed to determine if the processing of the next depth can be skipped. An appropriate threshold can accelerate the CU size decision scheme to the maximum extent while still maintaining a high coding efficiency.
3.2. Design of Early Termination Classification Model Based on Fuzzy Support Vector Machine
The rate-distortion improvement of the HEVC video encoder is mainly due to the high prediction accuracy of the CU partition. A smaller coding unit structure is used for high-texture complexity content, and a larger coding unit structure is used for smoother and low-texture complexity content. Therefore, the texture feature is the key feature that determines the size of the coding unit. In this paper, the spatial texture complexity, directional texture complexity, and coding unit texture content difference complexity are selected as texture features.
The content variance can accurately represent the spatial content complexity. Therefore, the spatial content complexity of each coding unit in this paper is defined as:
where
is the number of pixels of the current coding unit
CU, and
is the brightness value of the
pixel in the current coding unit. The variance can accurately describe the global content complexity of the coding unit, and the complexity of local details needs to introduce other features. Intra-frame coding emphasizes the directionality of coding units and reference blocks. In this paper, the directional texture complexity is defined as:
where
,
,
, and
represent the
Sobel gradient in the directions of horizontal, vertical, and
and
, respectively.
The different depths of the coding units have their own content complexity, and the
size coding unit is used as the basic unit to extract the content texture difference between the depths of the coding unit. The difference complexity of the coding unit texture content can be expressed as:
where
is the
th sub-coding unit of the current
CU and
is the number of the smallest coding units in the current coding unit,
.
By combining the above texture content features, using the non-texture features including the current CTU segmentation structure information, the bit rate of the header, and the quantization step, etc., the problem of predicting the early termination of the CU size decision can be described as a binary classification problem. Many types of classifiers are used to solve binary classification problems. Support vector machines are machine learning classifiers based on statistical learning. In traditional SVM, all sample data have the same importance to the classification hyperplane, and the classifier assigns the same penalty factor to all the sample data. However, in the application of intra-frame coding unit classification, sample data are often affected by noise and feature distribution, and have different effects on the classification hyperplane. Therefore, in order to solve such practical problems, the fuzzy support vector machine introduces fuzzy membership degree, assigns different membership degrees to different sample data, distinguishes the different degrees of influence of the sample data on the classification hyperplane, and determines a more accurate classification hyperplane.
For a given training data set
,
represents the feature vector of each sample,
;
represents two different classification of
;
is a fuzzy membership function, indicating the reliability of the
sample
belonging to the
category,
; and support vector machines use the feature map** function
to map the training samples to the high-dimensional feature space, namely
, the converted training samples
are obtained, and the classification hyperplane is
, where the kernel function is
. In intra-frame predictive coding, the depth distribution of the coding units is not uniform, and the sample distribution is unbalanced. As shown in
Figure 4, for the distribution of the positive samples A and B, if the traditional distance membership degree is used to express the influence relationship, the two sample points have the same membership degree, but in fact the sample point B has similar negative sample distribution attributes. Therefore, the local distribution attributes of the samples have better classification characteristics. This paper proposes a membership function combining the distance scale and the local distribution scale of information entropy to more accurately solve the influence of noise and abnormal points.
Introducing the imbalance factor of the data set, the general form of the fuzzy support vector machine can be expressed as:
In the formula,
and
, respectively, represent the penalty factor of positive and negative sample data, and
represents the relaxation factor. The optimal classification hyperplane of Equation (14) is solved by the Lagrangian multiplier method, as follows:
According to the above formula, the dual plan is obtained:
The final decision function of the optimal classification hyperplane is:
The centers of the positive and negative classes are:
In the formula, the number of positive sample data is
and the number of negative sample data is
.
Substituting the above formula can get:
where
is a small positive number to avoid a denominator of 0 and to ensure that
.
According to the concept of information entropy, the average amount of uncertainty information of
belonging to the positive and negative category is as follows:
where
and
, respectively, represent the probability that
belongs to the positive and negative class, and the probability is obtained by sampling through random neighbors. That is, the
sample points
with the Euclidean distance closest from the current sample point
are selected as the sampling data set, and the number of positive and negative samples in the data set
,
can be calculated. The probability of the negative class is
,
. After getting the average amount of information of
, a new fuzzy membership function can be obtained:
where
.
Therefore, by fusing the two Equations (21) and (23), the fuzzy membership function of each sample data can be defined as:
where
is the control factor,
, ensuring that
.
In order to improve the classification accuracy of fuzzy support vector machines and to reduce the additional computational complexity of classification, the early termination of the classification model training proposed in this paper is divided into two stages: offline learning training and online fuzzy membership update. Offline learning training is using the original video reference encoder to encode multiple types of video sequences directly to generate a training data set. After obtaining a more accurate classification model, the model is loaded into the video encoder to process the coding unit to terminate the mode decision processing in advance; in the prediction stage, the first encoding frame of each video sequence uses the original video reference encoder to obtain accurate fuzzy membership parameters and to obtain more accurate imbalance factors, which are used to terminate the decision of the fuzzy support vector machine model early. In addition, in order to train the model to obtain a robust classification model, in two video sequences “BQTerrace (1920 × 1080)” and “BasketballDrill (832 × 480)”, the first 50 frames of the video are chosen, using “22”, “27”, and “32”, 37” four quantization parameters to generate training data sets. The two video sequences have different video content characteristics and spatial resolutions, which are suitable for obtaining complete training.
3.3. Initial Best Depth Prediction
Video content has a high degree of statistical correlation with spatial units. Therefore, the size of the coding unit is also highly correlated, and this spatial correlation has been used for coding unit prediction in many research works [
6,
7]. However, in these works, only the CTU is the target of prediction, and the correlation between the inner code depths of the CTU and the correlation of the same spatially adjacent coding unit sizes have not been deeply explored. In this paper, we not only use spatial correlation features, but also introduce rate-distortion cost-related features between coding unit depths, design a more accurate fuzzy support vector machine classifier, and make decisions about coding unit splits in advance.
It can be seen from
Table 1 that when the video content is smooth or the quantization parameter is large, the proportion of the best coding unit depth level of “0” is very high, reaching an average of 19.8%, while the video sequence “Kimono” is nearly 30% under the conditions of each quantization parameter. These statistical data also show that the best coded depth is directly predicted as “0” at the CTU level, and subsequent coding unit depth level calculations will be all saved. Therefore, in this paper, a fuzzy support vector machine classifier 0 is proposed in the stage of the depth level “0” stage to predict the result where the best depth level is “0”. The main idea is to use the correlation between the rate-distortion cost of the current depth level “0” and the optimal rate-distortion cost of spatially adjacent CTUs, that is, the upper CTU, the left CTU, the upper left CTU, and the upper right CTU, combined with the average of the coded CTU to optimize the rate-distortion cost, form a candidate set, select the smallest rate-distortion cost, and use it to design the classifier.
In order to accurately describe the relationship between the optimal rate-distortion cost and the spatially adjacent CTUs, the normalized optimal rate-distortion cost difference rate is defined:
where
is the optimal rate-distortion cost of the current CTU and
is the smallest rate-distortion cost in the candidate set.
The original HM encoder is used to encode three standard video sequences (BQTerrace, BasketballDrill, Kimono) and to collect actual data to get
Figure 5. It can be seen from the figure that there is a high degree of correlation between the current CTU and the rate-distortion cost of the coded CTU. Nearly 80% of the
RDD is less than 5%, and more than 90% of the
RDD is less than 10%. Therefore, the rate-distortion cost correlation has a high prediction stability.
If the classifier 0 determines that the depth level “0” is a non-optimal depth, then enter the classifier 1, as shown in
Figure 6 In the design of classifier 1, based on the relevant information obtained by the actual rate-distortion optimization calculation for the coding unit depth “0”, the relevant features between the depths are extracted to determine in advance whether to skip the rate-distortion processing of the coding unit of the current depth level. As shown in
Figure 7, directly using the information of the coded depth level “
”, the rate distortion cost of the depth level “
” is:
According to
Figure 6, the current coding unit depth level is further divided into
, and the sub-region of the depth level
corresponding to the coding unit of depth level
is defined as
, where
. The coded area is divided into four parts according to the sub-blocks of the corresponding area, and the classification is carried out according to the header information and residual information. The corresponding relationship is:
where
is the coding distortion
of each sub-region corresponding to the depth level “
”. Similarly, the corresponding texture part code rate is
and the non-texture information obtained from the depth level “
” coding is
, and the part that is equally distributed to each sub-region is
. Then, Equation (27) becomes:
where
It can be seen from
Section 2.2 that there is a high correlation between the minimum rate-distortion cost of spatially adjacent CTUs and the current CTU.
is used to evaluate the sub-block segmentation complexity of the current coding unit to determine whether to omit the current coding unit division jumps directly to the next depth level, where
is the smallest CTU
RD cost in the candidate set and
. The threshold calculation for skip** the segmentation process can be defined as:
where
The original HM encoder is used to encode three standard video sequences (BQTerrace, BasketballDrill, and Kimono) with the
value of three to and collect the actual data to get
Table 4.
It can be seen from
Table 4 that for different quantization parameters
Qp, the accuracy rate is above 92%, the average accuracy rate is 95.67%, the accuracy rate is very high, and the accuracy rate is also very stable under different quantization parameters. Especially for video sequences (BQTerrace, BasketballDrill), most of the optimal depth levels are greater than “2”, and the accurate prediction results can skip the rate distortion calculation of depth levels “1” and “2”. Therefore, the rate-distortion cost correlation decision between depth levels also has higher prediction stability.
3.4. Overall Algorithm
As shown in
Figure 6, the fast optimization decision-making scheme of coding unit size proposed in this paper includes three fast algorithms: the optimal coded depth level “0” early decision, coding depth level skip early, and coding unit size termination early. The three algorithms proposed in this paper are designed based on the texture and non-texture rate-distortion relationship so that each unit can make an optimal decision based on the space and the internal rate-distortion cost of the CTU. In addition to the encoded features, space complexity features and non-texture/texture relationship features are also applied to our fast algorithm. In addition, for the characteristics and statistical distribution of the coding unit size, a set of fuzzy support vector machine classifiers are designed, which are used in three fast algorithms to obtain adaptive feature thresholds under different trade-off requirements so that we can effectively balance
RD performance and complexity. For each input video sequence, the first three frames of images are encoded by the original HM encoder, and the actual rate-distortion optimized coding unit size results and the corresponding feature data are obtained, which are used to train the current FSVM to classify 0, 1, and 2, respectively.
According to the above analysis process, our proposed coding unit size decision algorithm is summarized as follows:
Step (1): Start encoding from the depth level of “0”, perform rate-distortion optimization encoding processing, and obtain the code rate and distortion cost of the depth level of “0”. Based on the encoded information and spatially related coding unit information, collect the feature vectors related to the classifier, and input the collected feature vector data into the classifier “0” to obtain the discrimination result. If it is “Yes”, it will exit the subsequent encoding depth as the depth level “0” is the best encoding size; otherwise, if it is “No”, it will continue the subsequent depth level encoding and will go to step (2).
Step (2): On the basis that the depth level “0” has been encoded, enter the early skip classifier to determine whether to enter or omit the encoding calculation of the depth level “1”. Using the partition structure shown in
Figure 7, the rate-distortion cost of the corresponding sub-coding unit is extracted, and the feature vector obtained in step (1) is combined into the classifier “1”. If the classification result is “Yes”, the current depth level is directly omitted, and the processing of the depth level “2” is performed, and then it will go to step (5); otherwise, the classification result is “No”, and the rate distortion calculation of the current depth level “1” is directly entered. Go to step (3).
Step (3): Execute the rate-distortion optimization process of the depth level “1” coding unit, obtain the feature vector data of the current depth level, and input the early termination classifier 2. If the early termination is “Yes”, then the depth level “1” is the optimal coding depth; otherwise, go to step (4).
Step (4): Perform depth level “1” early skip classifier processing and determine whether to enter or omit the depth level “2” encoding calculation. Using the partition structure shown in
Figure 7, the rate-distortion cost of the sub-coding unit corresponding to the depth level “1” is extracted, the corresponding feature vector is extracted, and it is inputted to the classifier “1”. If the classification result is “Yes”, directly omit the coding calculation of the depth level “2” and go to the processing step (6) of the depth level “3”; otherwise, enter the rate distortion calculation of the depth level “2” and go to step (5).
Step (5): Execute the rate-distortion optimization process of the depth level “2“, obtain the feature vector data of the current depth level, and input the early termination classifier 2. If the early termination is “Yes”, then the depth level “2” is the optimal coding depth; otherwise, go to step (6).
Step (6): Execute the rate-distortion calculation of the depth level “3” and go to step (7).
Step (7): Choose the best coding depth level.