Occluded Pedestrian Detection Techniques by Deformable Attention-Guided Network (DAGN)
Abstract
:1. Introduction
- First, we have designed a deformable convolution with attention module (DCAM) that generates the attention feature map corresponding to the deformable receptive field. The DCAM enables the network to adapt to diverse poses of pedestrians and occluded instances via deformable convolution. Furthermore, it can obtain attention features to capture effective contextual dependency information among different positions by a non-local (NL) attention block.
- Second, we have optimized the detection localization by using an improved loss function. The traditional smooth-L1 loss has been replaced with complete-IoU (CIoU) loss [13] for regression. The regression loss with CIoU, instead of the commonly used -norm, can facilitate prediction with more accurate localization, as shown in Figure 2.
- Third, effective techniques for pedestrian detection in diverse traffic scenes have been explored in our work. The distance IoU-based (DIoU) NMS was adopted to refine the prediction boxes to improve the detection performance of occluded instances. A preprocessing with adaptive local tone map** (ALTM) based on the Retinex [15] algorithm was implemented to enhance the detection accuracy under poor illuminance.
- Finally, experiments on three well-known traffic scene pedestrian benchmarks, Caltech [8], CityPersons [9], and EuroCity Persons (ECP) datasets [14], demonstrated that the proposed method leads to notable improvement in performance for the detection of heavily occluded pedestrians. Compared with the published best results, our proposed method achieved significant improvements of 12.44%, 5.3%, and 5.0%, respectively, in of the heavily occluded sets of the Caltech [8], CityPersons [9], and ECP [14] datasets.
2. Related Works
2.1. Deep-Learning-Based Pedestrian Detection Methods
2.2. Occluded Pedestrian Detection Methods
2.3. Attention- and Deformable-Convolution-Related Methods
3. Deformable Attention-Guided Network (DAGN)
3.1. Deformable Convolution with Attention Module (DCAM)
3.2. Target Optimization
3.2.1. Loss Function
3.2.2. Non-Maximum Suppression for Prediction
3.3. Illumination Preprocessing for Testing
4. Experimental Results
4.1. Experimental Setup and Evaluation Metrics
4.2. Caltech Pedestrian Dataset
4.2.1. Training Configuration on Caltech Dataset
4.2.2. Ablation Experiments on Caltech Pedestrian Dataset
4.2.3. Comparison with the State-of-the-Art Methods of Caltech Pedestrian Datasets
4.3. CityPersons Dataset
4.3.1. Training Configuration
4.3.2. Comparison with the State-of-the-Art Method of CityPersons
4.4. EuroCity Persons (ECP) Dataset
4.4.1. Training Configuration
4.4.2. Comparison with State-of-the-Art ECP Dataset
4.5. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Liu, W.; Liao, S.; Ren, W.; Hu, W.; Yu, Y. High-level semantic feature detection: A new perspective for pedestrian detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5182–5191. [Google Scholar]
- Brazil, G.; Yin, X.; Liu, X. Illuminating Pedestrians via Simultaneous Detection and Segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4960–4969. [Google Scholar]
- ** based on retinex for high dynamic range images. In Proceedings of the IEEE International Conference on Consumer Electronics, Las Vegas, NV, USA, 11–14 January 2013; pp. 153–156. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European conference on computer vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar] [CrossRef] [Green Version]
- Redmon, J.; Farhadi, A.; Ap, C. YOLOv3: An Incremental Improvement. ar** for Displaying High Contrast Scenes. Comput. Graph. Forum 2003, 22, 419–426. [Google Scholar] [CrossRef]
- Ma, Y.; Yu, D.; Wu, T.; Wang, H. PaddlePaddle: An Open-Source Deep Learning Platform from Industrial Practice. Front. Data Comput. 2019, 1, 105–115. [Google Scholar] [CrossRef]
- PaddleDetection, v2.0.0-rc0. Available online: https://github.com/PaddlePaddle/PaddleDetection (accessed on 23 February 2021).
- Zhang, S.; Benenson, R.; Omran, M.; Hosang, J.; Schiele, B. How far are we from solving pedestrian detection? In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1259–1267. [Google Scholar]
- Hasan, I.; Liao, S.; Li, J.; Akram, S.U.; Shao, L. Pedestrian detection: The elephant in the room. ar**v 2020, ar**v:2003.08799. [Google Scholar]
- Song, T.; Sun, L.; **e, D.; Sun, H.; Pu, S. Small-scale pedestrian detection based on topological line localization and temporal feature aggregation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 536–551. [Google Scholar] [CrossRef]
Dataset | Case | Height of Pedestrian (pixels) | Occlusion Area | Visibility |
---|---|---|---|---|
Caltech [8] | Reasonable | >50 | <35% | >0.65 |
Heavy occlusion | >50 | 35–80% | 0.2–0.65 | |
All | >20 | - | >0.2 | |
CityPersons [9] | Reasonable | >50 | - | >0.65 |
Heavy occlusion | >50 | - | >0.9 | |
Partial occlusion | >50 | - | 0.65–0.9 | |
Bare | >50 | - | <0.65 | |
Small | 50–75 | - | >0.65 | |
Medium | 75–100 | - | >0.65 | |
Large scale | >100 | - | >0.65 | |
EuroCity Persons [14] | Reasonable | >40 | <40% | - |
Small | 30–60 | <40% | - | |
Occluded | >40 | 40–80% | - | |
All | >30 | - | - |
Cascade R-CNN+FPN | DCAM (−8.51) | CIoU Loss (−4.17) | DIoU-NMS (−5.48) | ALTM (−3.92) | MR (%) |
---|---|---|---|---|---|
- | 55.30 | ||||
- | √ | 46.79 | |||
- | √ | √ | 42.62 | ||
- | √ | √ | √ | 37.14 | |
- | √ | √ | √ | √ | 33.22 |
Method | Miss Rate | Hardware | Scale | Run Time (s/img) | ||
---|---|---|---|---|---|---|
R | HO | A | ||||
MS-CNN [5] | 9.54 | 48.60 | 55.77 | Titan GPU | ×1 | 0.067 |
RPN + BF [4] | 7.28 | 54.60 | 59.88 | Tesla K40 GPU | ×1.5 | 0.5 |
SDS-RCNN [2] | 6.43 | 38.70 | 56.77 | Titan × GPU | ×1.5 | 0.21 |
ALF [6] | 6.07 | 50.98 | 59.06 | GTX1080 Ti GPU | ×1 | 0.05 |
CSP [1] | 4.54 | 45.81 | 56.94 | GTX1080 Ti GPU | ×1 | 0.058 |
Pedestron [42] | 1.48 | 22.11 | 25.48 | Nvidia tesla V100 | ×1 | - |
DAGN (ours) | 6.03 | 33.22 | 46.83 | Titan × GPU | ×1 | 0.11 |
DAGN++ (ours) | 1.84 | 9.67 | 17.68 | Titan × GPU | ×1 | 0.11 |
Method | Backbone | R | H | P | B | S | M | L | Run Time (s/img) |
---|---|---|---|---|---|---|---|---|---|
Faster R-CNN [18] | VGG-16 | 15.4 | - | - | - | 25.6 | 7.2 | 7.9 | - |
TLL [43] | ResNet-50 | 15.5 | 53.6 | 17.2 | 10.0 | - | - | - | - |
RepLoss [25] | ResNet-50 | 13.2 | 56.9 | 16.8 | 7.6 | - | - | - | - |
OR-CNN [26] | VGG-16 | 12.8 | 55.7 | 15.3 | 6.7 | - | - | - | - |
ALF [6] | ResNet-50 | 12.0 | 51.9 | 11.4 | 8.4 | 19.0 | 5.7 | 6.6 | 0.27 |
CSP [1] | ResNet-50 | 11.0 | 49.3 | 10.4 | 7.3 | 16.0 | 3.7 | 6.5 | 0.33 |
APD [28] | ResNet-50 | 10.6 | 49.8 | 9.5 | 7.1 | - | - | - | 0.12 |
APD [28] | DLA-34 | 8.8 | 46.6 | 8.3 | 5.8 | - | - | - | 0.16 |
Pedestron [42] | HRNet | 7.5 | 33.9 | 5.7 | 6.2 | 8.0 | 3.0 | 4.3 | 0.33 |
DAGN (ours) | ResNet-50 | 11.9 | 43.9 | 12.1 | 7.6 | 18.7 | 5.8 | 5.9 | 0.22 |
DAGN++ (ours) | ResNet-50 | 8.4 | 28.6 | 7.0 | 5.6 | 9.2 | 2.4 | 5.6 | 0.22 |
Method | Reasonable | Small | Occluded | All |
---|---|---|---|---|
SSD [16] | 13.1 | 23.5 | 46.0 | 29.6 |
Faster R-CNN [18] | 10.1 | 19.6 | 38.1 | 25.1 |
YOLOv3 [17] | 9.7 | 18.6 | 40.1 | 24.2 |
Cascade R-CNN [31] | 6.6 | 13.6 | 31.3 | 19.3 |
DAGN (ours) | 5.9 | 14.2 | 26.3 | 17.5 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
**e, H.; Zheng, W.; Shin, H. Occluded Pedestrian Detection Techniques by Deformable Attention-Guided Network (DAGN). Appl. Sci. 2021, 11, 6025. https://doi.org/10.3390/app11136025
**e H, Zheng W, Shin H. Occluded Pedestrian Detection Techniques by Deformable Attention-Guided Network (DAGN). Applied Sciences. 2021; 11(13):6025. https://doi.org/10.3390/app11136025
Chicago/Turabian Style**e, Han, Wenqi Zheng, and Hyunchul Shin. 2021. "Occluded Pedestrian Detection Techniques by Deformable Attention-Guided Network (DAGN)" Applied Sciences 11, no. 13: 6025. https://doi.org/10.3390/app11136025