1. Introduction
According to surveys, around 1.35 million fatalities and 50 million injuries occur globally each year due to car accidents [
1,
2]. The quality of vehicle hardware is undoubtedly the foundation of safe driving. However, drivers’ inexperience and inattention are the leading causes of car accidents compared to the current mature and reliable technology for producing vehicle hardware [
3]. On the contrary, there is no such thing as inattention, fatigued driving, drunk driving, or operating errors with automated driving. Therefore, the development of autonomous driving to reduce collisions has become a focus of discussion in various fields. As one of the core hardware for realizing autonomous driving, vehicle sensors can provide rich environment sensing information like the human sensory system [
4]. So, they can be used to reduce traffic accidents caused by drivers. Radar, lidar, and cameras, as several sensors widely used in self-driving cars, are often combined to provide more comprehensive target information and more accurate detection results than a single sensor [
5]. The more mainstream approaches to combining sensors include fusing information from radar and cameras, lidar and cameras, or all three [
6]. Lidar has two significant drawbacks, namely that it costs more than other sensors and has low reliability in complex environments [
7]. Even though the current price of lidar has shown a downward trend, it is still restricted by factors such as lighting conditions, extreme weather, and electromagnetic disturbance [
8,
9,
10]. Cameras enable object detection and semantic understanding, which are impossible with lidar. For instance, the data obtained through a camera can be utilized to identify traffic signs and gesticulations [
11,
12]. Radar can accurately obtain information on the range, direction, and velocity of an object [
13,
14], which are challenging to obtain with a camera. Therefore, approaches that integrate radar and camera data have become prevalent in both research and commercial applications [
15].
In decision-level information fusion, the ranges of interest (ROIs) of different objects usually overlay each other, which may lead to false alarms. In [
16], when the radar ROIs of the same target overlap, the authors use the most extensive box as the preferred one and then calculate the average ROI from two ROIs with similar sizes and bottoms. However, this method can only address cases where the ROIs of targets detected by the same sensor overlap. Additionally, radar often makes false detections so that irrelevant targets are considered valid [
17]. In order to eliminate spurious detections, the density-based spatial clustering of applications with noise (DBSCAN) method is employed for grou** radar point clouds [
18]. Experiments have shown that many radar misdetections can be eliminated by using this additional clustering method.
The visual target detection parts of the above fusion algorithms use a traditional hand-designed feature approach. This approach usually requires a priori knowledge to recognize targets on the road [
19]. Nevertheless, the complexity and variability of scenarios will make target detection based on a priori appearance challenging. Unlike traditional manual feature design methods, deep learning methods can automatically learn the features of targets. Meanwhile, deep learning methods unify feature extraction and classifier learning in a single framework and are capable of end-to-end learning. Therefore, they have received extensive attention in recent years [
20,
21,
22].
The visual detection component in [
23] uses Faster R-CNN [
24] to detect targets. Although this method can achieve high accuracy without prior knowledge, the two-stage detection algorithm is too slow to use in practical engineering projects. In [
25], a technique for detecting objects that combines millimeter wave radar and optical sensors is introduced. In this publication, the YOLOv4 algorithm is utilized for real-time object detection. This algorithm boasts the ability to detect targets promptly. However, the architecture of YOLOv4 is intricate, and it has a substantial model footprint. So, there are still high requirements for computing capability and device storage resources.
Compared to using a single sensor for target detection, all of these fusion methods mentioned above are able to obtain better detection performance. However, these methods still suffer from the disadvantages of low detection accuracy, insufficient system robustness, and high computational cost. In particular, when the ROIs between different targets detected by different sensors overlap, the decision-level information fusion methods will suffer from the problems of a low target matching success rate and imprecise multi-target detection accuracy. In order to address these problems, a robust target detection algorithm based on the fusion of frequency-modulated continuous wave (FMCW) radar and a monocular camera is proposed in this paper.
Firstly, the method employs a reliable lane marker-based detection algorithm [
26] to process the image and extract lane information. Then, the two-dimensional fast Fourier transform (2D-FFT), constant false alarm rate (CFAR), Angle-FFT, and DBSCAN algorithms are applied to pre-process the radar measurements. Furthermore, through the exclusion of null targets and the establishment of lane boundary thresholds, the valid targets are selected from the radar detections. Next, YOLOv5 [
27] is used to extract the targets from the visual data. Immediately, the targets identified as valid by both vision and radar are sorted into their respective lanes. Ultimately, data from both the radar and camera are integrated within the same lane by aligning their spatial and temporal dimensions. This is achieved through the calibration of the camera, transformation of coordinate systems, and synchronization of sensor sampling instances. The experimental results indicate that this technique not only successfully alleviates the constraints associated with relying on a single sensor for sensing, but also bolsters the precision of information fusion. The primary contributions of the proposed methodology can be encapsulated in the subsequent points:
(1) We introduce a lane detection algorithm to filter out interfering targets outside the lanes, reducing the amount of data processing in the subsequent fusion algorithm;
(2) The positions of the detected lane lines are used as the effective detection ranges of targets instead of the traditional manual setting method, which improves the universality of the algorithm;
(3) We use the YOLOv5s model to detect targets in images and fuse the information of targets with the same lane markings from different sensors. This method can reduce the mutual interference between different targets in neighboring lanes, improving the fusion algorithm’s accuracy.
The other parts of this paper are summarized below.
Section 2 examines the information fusion framework. We present some critical steps and algorithms for decision-level information fusion in
Section 3.
Section 4 collects data from realistic environments to validate the algorithm proposed in this paper.
Section 5 gives conclusions based on the experimental results.