The flow chart of the SDWBF Algorithm proposed in this paper is shown in
Figure 1. (i) Redundant frame filtering: First, the UAV video is filtered through the pixel differences between frames; then, based on the initial filtering, weighted fusion with visual and structural similarity, it is filtered again to obtain the final dataset. (ii) Target detection: First, adjust the scale of the pre-training dataset to improve the performance of the detector. Then, use the target detector to extract static semantic features to obtain the target’s initial category and visual position. Finally, perform the final dataset deep optical flow estimation, transform the deep optical flow result into a dynamic bounding box, and perform weighted fusion with the static bounding box to improve the preliminary detection results.
3.1. Weighted Filtering Algorithm for Redundant Frames of Drone Video
Due to high frame rate photography, many video frames have high similarity and high redundancy in UAV videos. If the drone video is detected frame by frame, the calculation is enormous and time-consuming. Therefore, this paper first proposes a weighted filtration algorithm for redundant frames. The method primarily uses the inter-frame pixel difference algorithm to perform preliminary frame filtration. On this basis, filtration with the weighted fusion of the visual and structural similarity of the frames reduces the computational complexity of detection. The picture is essentially a two-dimensional signal, which is composed of multiple different frequency components [
19]. The inter-frame pixel difference algorithm mainly takes advantage of the low-frequency components of the picture. It removes the high-frequency part of the picture to reduce the amount of image information by narrowing the picture and using the gray image method, which is suitable for pictures of extremely high similarity to be primarily selected. Image similarity measurement is to fuse visual similarity and structural similarity. Visual similarity is measured by image features (color, shape, texture, etc.) that conform to human vision. Structural similarity is a global way to compare image quality by statistical indicators (entropy, grayscale, etc.). However, the image background in the experimental dataset has high similarity, and the object’s moving speed is slow, resulting in a large amount of calculation to extract the image features. The method proposed in this paper does not require a large amount of calculation of image features but compares frames from a global perspective.
A weighted filtration algorithm for redundant frames proposed in this paper first extracts the frame of the UAV video through the inter-frame pixel difference algorithm. The method for judging the pixel difference value between frames is as follows:
Formula (
1),
and
, respectively, represent the binary value of the
kth hash value conversion of the
ith frame and
jth frame image, ⊕ represents the exclusive OR operation,
represents the calculation result of the difference value of the
ith and
jth frame. In Formula (
2),
represents the similarity measurement parameter between two frames. When the similarity measurement parameter is 5, when the similarity measurement parameter
is less than 5, we consider the two frames to be similar and filter this frame. When
is greater than 5, we consider it dissimilar.
Then, the frame is filtered again, combining with visual image similarity and structural similarity on the former basis. EMD [
20] has better robustness than other methods of measuring visual similarity [
15]. The main idea of EMD is to measure the distance between two distributions. The distance of the histogram calculated by EMD is used as the visual similarity of the image. Compared with other methods of measuring structural similarity, SSIM [
21] is a complete evaluation index of reference image quality [
15]. It combines brightness, contrast, and structure to measure image similarity. To accelerate the processing speed, the RGB image is converted into gray space, and SSIM is calculated in gray space [
22]. This algorithm fuses visual and structural similarity using weight ratio and defines parameter
w as the weight coefficient. The PDCSF (Pixel difference cascade similarity fusion) algorithm is as follows.
means normalizing the mean value of EMD distance,
means the mean value of EMD distance between all original frames, and
means the standard deviation of EMD distance between all original frames. When the EMD distance value is larger, the image is less similar. To facilitate the later calculation, the smaller the
value is, the more dissimilar it will be, which is the same as the SSIM value. The Formula (
5) indicates whether to filter this frame. When the value is 1, the frame is filtered, and similar frames are filtered.
3.2. Small-Sized Pedestrians Detection Method Based on the Weighted Fusion of Static and Dynamic Bounding Boxes
The ideal detector must achieve high accuracy in positioning and recognition and high efficiency in terms of time. In recent years, many effective object detection methods have been proposed, such as SSD [
11] and Fast R-CNN [
23]. This paper selects the recently proposed CNN architecture YOLOv4 [
13] as the object detector for extracting static features. Compared with other networks, this network achieves an optimal balance between speed and accuracy. Because the distance between the target and the camera from the UAV’s perspective is too far, resulting in smaller target size, tiny objects pose a big challenge for feature representation. In addition, the datasets used for network pre-training and the dataset learned by the detector may degrade the feature representation and the detector [
9], increasing the risk of false detection in large-scale and complex backgrounds. Therefore, YOLOv4 will cause missed detections or multiple detections of small objects.
To further improve the accuracy of moving object detection, this paper proposes a small-size pedestrian detection method based on the weighted fusion of static and dynamic bounding boxes. First, the pre-training and detector learning datasets are scale-matched to improve the detector’s feature representation and performance. Secondly, on this basis, the difference in consecutive frames is used to correspond to the motion information of the moving object, and the static and motion boxes are weighted fusion to improve detection capabilities. Currently, the most usual method of extracting the motion information is optical flow [
24], and deep learning methods have achieved great success in solving the problem of optical flow estimation, such as Flownet [
25], PWC-net [
26], etc. The recently proposed LiteFlowNet3 [
24] is more accurate than other optical flow estimation networks. Therefore, this paper uses LiteFlowNet3 to generate optical flow images, extract motion boxes through threshold segmentation, and finally, the weighted fusion of the moving boxes with the static boxes extracted by the previous object detector. This paper uses the concept of Intersection-over-Union (IoU) [
27] to achieve static and motion boxes weight fusion by calculating the static boxes generated by YOLOv4 and the IoU based on the motion boxes generated by LiteFlowNet3. IoU is the ratio of the intersection area of two bounding boxes to their union.
A and
B are the areas of two different bounding boxes. The calculation formula is as follows:
3.2.1. Scale Matching to Reduce the Loss of Detector Features
In this paper, the scale matching between the pre-training and detector learning datasets is put forward to improve the feature representation. The static boxes of the object detector and the dynamic boxes are based on optical flow estimation is weighted fusion, which makes full use of the motion information of objects to reduce the miss rate and achieve more accurate small-size pedestrian detection. In this paper, the static boxes collection generated by YOLOv4 is defined as , and the motion boxes collection generated based on LiteFlowNet3 is defined as . The method of object detection based on static and motion boxes weighted fusion mainly includes the following steps.
The pre-training dataset and the detector learning dataset are scale-matched to improve the feature representation and improve the performance of the detector. The scale matching essentially makes the target scale histogram distribution of the pre-training dataset and the detector learning dataset similar. First, we calculate the average size
of the label box in any picture in the pre-training dataset, select a bin in the scale histogram of the detector learning dataset. Secondly, we determine the used bin on the size
of the scale-matched label, the scale migration ratio
/
is obtained. Finally, scale matching is performed on the pictures in the pre-training dataset according to the scale migration ratio. The calculation formula for scale matching is shown in Equation (
7). Among them,
,
and
represent the minimum and maximum size of the object, respectively.
represents the general density function, the probability density function of the scale
s of any dataset
X is expressed as
. Then
E represents the pre-training dataset, and
represents the target training set. The abscissa of the probability density function is the size of the dataset label frame, and the ordinate is the probability density. The scale matching function
f maps the label box size
in the pre-training set
E to
.
After the pre-training dataset and the detector learning dataset are scale-matched, and the performance of the object detector is improved, the extracted keyframes are input into it to extract the deep semantic features, then the static bounding boxes set is obtained.
3.2.2. A Weighted Fusion Algorithm for Static and Dynamic Bounding Boxes
The optical flow image is obtained by LiteFlowNet3. The binary motion map is obtained by threshold segmentation of the optical image, analyzing the connected components of the binary motion image, and finally obtaining the motion feature set
. The overall flow chart is shown in
Figure 2.
- (1)
To obtain an RGB optical flow image, we convert the optical flow vector generated by LiteFlowNet3. Different colors in the RGB image represent different directions of motion, and the color depth indicates the speed of motion.
- (2)
We perform threshold segmentation on the optical flow image to obtain a binary motion image. Threshold segmentation is divided into global and local threshold methods [
28]. The global threshold method uses global information to find the optimal segmentation threshold for the entire image. However, for small-size pedestrian images, it is difficult to separate the object and background using the threshold of the whole image because of the small area that the objects occupied in the image. Therefore, this paper uses the local threshold method to obtain the binary motion image by segmenting the RGB optical flow image. Its idea is to self-adaptively calculate different thresholds according to the brightness distribution of different areas of the image. For image
P, calculate the value
of each pixel
in the image through Gaussian filtering, the Gaussian filter function can denoize the optical flow image to a certain extent, and set
as a constant, the Gaussian filter function [
29] is Equation (
8).
Forward binarization point by
value.
where
is the pixel value of
in the RGB optical flow image, and
is the binary image of the image
P.
- (3)
Analyzing the connected components of
[
30], the motion feature bounding box collectionset
. Connected component analysis is to find a continuous subset of pixels in
and mark them. These marked subsets constitute the motion feature box collection set
of the object.
The static bounding boxes collection set
generated by YOLOv4 is merged with the motion boxes collection set
generated based on LiteFlowNet3 to compare the size of the IoU of the static and motion boxes at a certain distance. Among them, the static and dynamic bounding boxes are mainly weighted and fused by Formula (
10), where box corresponds to the bounding box coordinates, and
represents the variance.
In addition, we carry out statistical analysis on the speed of the object movement between video frames. The size of pedestrians is 50px × 50px (px is pixel), and the size of cyclists is 70px × 70px. Because there are far more pedestrians than cyclists in the data, this paper takes the size of pedestrians as the standard. Then, the moving speed between the video frames, namely pixel offset between frames of pedestrians and cyclists, is 1px and 3px, respectively. Moreover, the moving bounding box usually exceeds the static bounding box. This paper retains an error range of about five frames. Therefore, this paper sets the distance r to 55px ± 10px according to the object pixel size and its inter-frame pixel offset,
is the value range of IOU, when
,
, the recognition effect is best and the pseudocode of the fusion algorithm is shown in Algorithm 1.
Algorithm 1: A weighted fusion algorithm for static and dynamic bounding boxes. |
- 1:
for To do - 2:
for To do - 3:
is the calculation distance function - 4:
if then - 5:
if then - 6:
// is weighted fusion area function. - 7:
- 8:
end if - 9:
if then - 10:
, - 11:
end if - 12:
if then - 13:
- 14:
else - 15:
- 16:
- 17:
end if - 18:
end if - 19:
end for - 20:
end for
|