UAV Multi-Dynamic Target Interception: A Hybrid Intelligent Method Using Deep Reinforcement Learning and Fuzzy Logic

**a, Bingze; Mantegh, Iraj; **e, Wenfang

doi:10.3390/drones8060226

Open AccessArticle

UAV Multi-Dynamic Target Interception: A Hybrid Intelligent Method Using Deep Reinforcement Learning and Fuzzy Logic

by

Bingze **a

^1,*

,

Iraj Mantegh

²

and

Wenfang **e

¹

Department of Mechanical, Industrial and Aerospace Engineering, Concordia University, Montreal, QC H3G 1M8, Canada

²

Aerospace Research Centre, National Research Council of Canada, Montreal, QC H3T 2B2, Canada

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(6), 226; https://doi.org/10.3390/drones8060226

Submission received: 25 April 2024 / Revised: 22 May 2024 / Accepted: 27 May 2024 / Published: 29 May 2024

Download

Browse Figures

Versions Notes

Abstract

:

With the rapid development of Artificial Intelligence, AI-enabled Uncrewed Aerial Vehicles have garnered extensive attention since they offer an accessible and cost-effective solution for executing tasks in unknown or complex environments. However, develo** secure and effective AI-based algorithms that empower agents to learn, adapt, and make precise decisions in dynamic situations continues to be an intriguing area of study. This paper proposes a hybrid intelligent control framework that integrates an enhanced Soft Actor–Critic method with a fuzzy inference system, incorporating pre-defined expert experience to streamline the learning process. Additionally, several practical algorithms and approaches within this control system are developed. With the synergy of these innovations, the proposed method achieves effective real-time path planning in unpredictable environments under a model-free setting. Crucially, it addresses two significant challenges in RL: dynamic-environment problems and multi-target problems. Diverse scenarios incorporating actual UAV dynamics were designed and simulated to validate the performance in tracking multiple mobile intruder aircraft. A comprehensive analysis and comparison of methods relying solely on RL and other influencing factors, as well as a controller feasibility assessment for real-world flight tests, are conducted, highlighting the advantages of the proposed hybrid architecture. Overall, this research advances the development of AI-driven approaches for UAV safe autonomous navigation under demanding airspace conditions and provides a viable learning-based control solution for different types of robots.

Keywords:

artificial intelligence; uncrewed aerial vehicles (UAVs); hybrid control architecture; deep reinforcement learning; fuzzy logic; multiple target tracking; dynamic environment; safe autonomous navigation

1. Introduction

Uncrewed Aerial Vehicles (UAVs), with their strong mobility and high flexibility, offer a cost-effective option for many tasks such as agricultural inspections [1] and the exploration of hazardous areas [2], enhancing task accessibility and efficiency. Nowadays, the rapid advancement in Artificial Intelligence (AI) and a corresponding exponential increase in computing power have unlocked new possibilities [3]. This synergy enables UAVs equipped with AI abilities to perform complex tasks such as real-time path planning and obstacle avoidance more adeptly than traditional models reliant on pre-programmed algorithms [4,5]. Unlike their predecessors, these AI-driven UAVs can improve over time with extended training and repeated trials, acquiring a generalized capability to navigate safely and efficiently even in unknown or dynamic environments based on sensor data. This innovative direction not only expands the range of tasks that UAVs can undertake but also enhances their safety, opening up a new dimension in various sectors of human activity by leveraging the evolving capabilities of intelligent systems.

As a pivotal branch of AI, reinforcement learning (RL) represents a distinct approach [6], focusing on how agents should take actions in an environment to maximize the cumulative reward. Unlike other machine learning methods, such as classification models [7,8] or clustering models [9,10], that typically involve learning from a fixed dataset, RL is about learning from interaction with an environment, making decisions, evaluating the outcomes, and learning from successes and failures.

RL has been widely integrated into the field of UAV autonomous navigation, including missions such as target tracking [11], swarm control [12], collision avoidance [13,14], etc. Different RL methods are suitable for different tasks. Q-Learning [15] and Deep Q-Networks (DQNs) [16] are two classic value-based RL methods. They are appropriate for tasks with discrete action spaces [17,18]. In contrast, policy-based methods [19] directly learn the policy that maps states to actions without requiring a value function, which is particularly effective for problems with high-dimensional or continuous action spaces. However, policy-based methods may encounter stability and convergence issues, hindering the optimization process and potentially resulting in suboptimal policies. Combining the strengths of value-based and policy-based RL, actor–critic methods are proposed, and two models are incorporated: one for the policy (actor) and another for the value function (critic). These are suitable for complex tasks that benefit from the stability of value-based methods and the flexibility of policy-based approaches.

Within the actor–critic framework, neural networks are typically employed for learning and decision-making, which are referred to as deep reinforcement learning (DRL). Among various approaches, the Deep Deterministic Policy Gradient (DDPG) method employs experience replay and target networks [20] and can effectively learn a deterministic policy. In [21], Li et al. introduced a vision-based DDPG algorithm to enable UAVs to perform obstacle avoidance and destination tracking. This algorithm empowers the RL agent to assess the sizes of and distances between obstacles approximately, facilitating the generation of a collision-free path in real time. However, large Q estimates during the learning process might hinder the ability to find the optimal policy. Twin Delayed DDPG (TD3) [22] improves on this issue by employing two separate Q-networks and using the smaller Q-value to calculate gradients. Additionally, TD3 typically updates its policy after the value function is updated several times. This delayed update strategy helps prevent inaccuracies in value function estimation, thereby increasing the stability of the algorithm. Our previous work [23] demonstrated that an improved TD3 method enabled both single-UAV and multi-UAV settings to avoid obstacles and reach destinations in evolving contexts.

In addition, it is important to note that deterministic policy-based methods may lose effectiveness in cases filled with uncertainties. The Proximal Policy Optimization (PPO) method [24] introduces randomness into action selection by computing a probability distribution, promoting exploration and robustness in unpredictable environments. This relatively simple setting makes PPO more suitable for discrete action space problems [25], as the balance between exploration (trying new actions) and exploitation (utilizing known actions) is generally easier to achieve. In continuous spaces, gradient estimation can become more challenging due to the continuity and high dimensionality of actions. The Soft Actor–Critic (SAC) method is specifically designed for such cases, employing an entropy regularization process that improves the robustness against noise and optimizes the randomness (entropy) of the policy. Moreover, high sample efficiency is essential in real-world UAV applications where data collection can be time-consuming and costly. SAC can learn effective policies with fewer interactions and directly outputs the parameters of the action’s probability distribution in the policy network [26], which leads to smoother and more adaptable control in autonomous navigation.

Although DRL holds significant advantages in tackling complex decision-making and control problems, such as its capacity for autonomous learning and ongoing performance enhancement, processing of high-dimensional data, and adaptation to changes in the environment, its algorithmic design inevitably leads to several challenges. These include low sample efficiency [27] (as DRL typically requires a large amount of data samples to learn effective policies, which can be a limiting factor in real-world applications), issues with training stability and convergence [28], limited generalization capability [29], and high computational resource demands [30]. To address these issues, recent research has explored the integration of other methods to offset these weaknesses. For example, ref. [31] proposes a combination of DRL with Adaptive Control (AC) for a quadrotor landing task. An DRL strategy was selected for the outer loop to maintain optimality for the dynamics, while AC was utilized in the inner loop for real-time adjustment of the closed-loop dynamics towards a stable trajectory delineated by the RL model. In [32], the authors utilized the DDPG method to automate PID system tuning, simplifying mobile robot control by eliminating manual adjustment of multiple parameters. The deep neural network (NN) based agent also removes the need for state discretization. Furthermore, ref. [33] used a novel Convolutional Neural Network (CNN) approach to train the object recognition capabilities of the onboard RGB-D (red-green-blue-depth) camera of a UAV and incorporated the results and relative distances as inputs to a Markov Decision Process (MDP) based system, achieving effective obstacle avoidance and path planning.

The aforementioned integrated methods capitalize on the advantages of DRL while mitigating its limitations, enhancing system performance. However, these methods still possess their inherent hindrances. For instance, adaptive control systems, despite their capability for online parameter estimation and adjustment, require precise mathematical models, which are impractical to construct for cases performed in unknown environments. Similarly, while simple and effective, PID control may lack the flexibility to handle nonlinear systems or adapt to complex dynamics. Other machine learning algorithms, such as supervised learning, typically require extensive data for training and may not be directly applicable to real-time control challenges.

In summary, to achieve dynamic path planning for UAVs in uncharted scenarios, the learning and self-improvement capabilities of DRL are essential, but they must be complemented and augmented by another effective method, thereby creating a synergy that exceeds the sum of its parts. After an in-depth literature review–simulation–analyzing process, the fuzzy inference system (FIS) is identified as an excellent option. Unlike the methods discussed above, fuzzy logic offers an intuitive, flexible, and computationally efficient solution [34]. Particularly when integrated with DRL, fuzzy logic can provide auxiliary decision-making support, reduce unnecessary exploitation, accelerate the learning process, and enhance the system’s transparency and interpretability. Further details will be elaborated on in the next section.

The remainder of this article is organized as follows: Section 2 outlines the formulation of the problem, reviews recent related literature, and highlights the main contributions. Section 3 elaborates on the proposed hybrid control framework and the other associated algorithms involved in the system. Additionally, it presents the model design for the UAVs and the onboard sensors. Section 4 illustrates the numerical and simulation results, including a comparison between the performance of the proposed method and an RL-only approach. An analysis of the feasibility of the control system applied to physical drones is also included. Lastly, discussions and concluding remarks are provided in Section 5.

2. Problem Definition

This paper explores the challenges of dynamic settings through tracking cases of multiple mobile intruder aircraft. The focus is on develo** a versatile control framework capable of processing high-dimensional data to accurately navigate toward and intercept moving targets. It utilizes a simulated three-dimensional environment, incorporating flight dynamics, sensor capabilities, and factors such as wind and gravity, to provide a rigorous testing ground for the methodologies.

The complexity of this problem is multi-fold: firstly, the intruder aerial platforms exhibit varied speed and movement patterns, necessitating adaptive tracking capabilities. Secondly, the presence of multiple intruder aircraft introduces additional challenges for an RL problem. Also, each nontargeted platform becomes a potential mobile obstacle, complicating the tracking process. Lastly, the operation is time-constrained, emphasizing the need for efficiency and prompt action.

Improving UAV dynamic tracking has significant practical implications. Enhanced techniques enable drones to autonomously detect and respond to unauthorized activities over large areas, revolutionizing security and surveillance and reducing the need for human intervention [35]. Additionally, in search-and-rescue missions [36], AI-driven UAVs with advanced navigation systems can drastically reduce the time to locate individuals in danger, potentially saving lives.

This study seeks to advance the field of UAV safe autonomous navigation, offering insights and solutions for real-world applications. The characteristics of fuzzy logic-based approaches will be discussed in the following content, and the main contributions of this paper will also be detailed in this section.

2.1. Related Prior Work

Fuzzy logic (FL), as a soft computing method, diverging from the other conventional binary logic methods that strictly categorize statements as true or false, introduces a more nuanced approach by handling imprecision and accommodating degrees of truthfulness. Such a capability is practical in problems where information is ambiguous or incomplete. As a result, fuzzy logic finds extensive applications across multiple domains. It is instrumental in develo** control systems that require adaptive response mechanisms [37], enhancing decision-making processes by accounting for uncertainties [38], and improving pattern recognition techniques [39] by allowing for flexible criteria.

Within the domain of path planning and obstacle avoidance, algorithms developed based on fuzzy logic exploit input variables from sensors or data from machine learning models to generate continuous outputs. Hu et al. [40] proposed a fuzzy multi-objective path-planning method based on distributed predictive control for cooperative searching and tracking ground-moving targets by a UAV swarm in urban environments. This approach integrates an extended Kalman filter and probability estimation to predict the states of the ground targets, with path planning that accounts for obstructed view and energy consumption. Then, fuzzy logic is employed to prioritize objectives based on their importance levels. Berisha et al. [39] demonstrated a fuzzy controller utilizing readings from a stereo camera and range finders as input and generating two control commands for the robot’s wheels, enabling flexible collision avoidance, as their controller can react dynamically in testing. Our team introduced a Fuzzy-Kinodynamic RRT method [41], which employs the traditional rapidly exploring Random Tree algorithm for global path planning and presents a set of heuristic fuzzy rules for de-confliction in 3D spaces. The velocity obstacle (VO) approaches proposed by [42,43], which are designed especially for evading mobile intruders, offer a constructive foundation for the potential combination with fuzzy logic in the future. These algorithms calculate a VO cone for each moving object, where the cone’s direction and width depend on the obstacle’s position, velocity, and relative velocity to the ownship UAV. FIS can be subsequently employed to ascertain the optimal action (e.g., select an appropriate speed vector that does not intersect with any cones) and facilitate evasion based on the current state of the UAV and the environmental conditions, ensuring the ownship avoids all identified intruders safely.

From these applications, it is found that fuzzy logic can describe and handle uncertainties and vagueness using concepts from natural language [44], such as “large”, “medium”, “small”, etc. This approach closely mirrors how people evaluate situations and make decisions in daily life. Fuzzy logic also enhances system robustness in the face of input data variations or noises by using partial truth values and fuzzy sets. Additionally, FL does not require precise mathematical models, significantly simplifying problem descriptions. However, there are challenges in using FL, such as difficulties in establishing accurate rules for problems requiring precise control or lacking relevant experience. The process of designing membership functions and rules inevitably involves a degree of subjectivity, leading to potential inconsistencies and imperfections in performance. Fortunately, DRL methods can largely mitigate these issues. For instance, in our scenario, since the environment is unknown, it is insufficient to navigate a UAV to the target while avoiding potential obstacles by merely establishing a fuzzy logic system. The integration of DRL fills this gap effectively, enabling the drone to learn and find the optimal path gradually and autonomously. In the same way, the drone does not need to start learning from scratch. We can impart some basic movement or de-confliction rules to it through a fuzzy inference system, hel** it avoid taking some incorrect or illogical actions in training.

2.2. Contributions

Based on the discussions, we developed a hybrid control scheme that utilizes DRL and FIS. By leveraging the capability of DRL to perform beyond expectations after training and the ability of fuzzy logic to incorporate specific expert experience and human cognition, the integration of these two methods allows for mutual enhancement and compensation. The main contributions are summarized as follows:

A hybrid intelligent controller is developed for a six-degree-of-freedom underactuated quadrotor to autonomously navigate and intercept multiple intruder aircraft in three-dimensional spaces within a limited time. This approach utilizes innovative learning-based interaction mechanisms between SAC and FIS to improve learning efficiency and enable real-time path planning in unpredictable environments under a model-free context.
An innovative target selection algorithm and a refined approach to handling the observation space are established to tackle the RL multi-target challenge. The algorithm effectively mitigates the exponential growth of state–action pairs and prevents reward signal interference by focusing on one target at a time, thereby addressing a common cause of failure in RL multi-target problems: the task-repetition-caused ’Agent Confusion’ phenomenon.
A creative reset function is developed to enhance the generalization capabilities of the trained agent. This function regenerates the states of the ownship and the intruders at the start of each training episode. It allows the agent to adapt to new scenarios without re-training, increasing utility and effectiveness.
A practical reward function is designed, with exponential functions introduced in each component to address the challenge of sparse rewards in multi-target RL cases. This approach ensures high sensitivity to slight state changes and provides proportional adjustment of each component based on the importance of different objectives (e.g., target tracking and speed control).
A method involving nonuniform discretization of the observation space is applied to reduce the dimensionality of the state space and improve training efficiency. For each observation variable, larger-step discretization is used in ranges where sensitivity to state changes is low, while more precise discretization is applied in sensitive ranges, reducing training time and minimizing the risk of failing to converge.

3. Methodology

The use cases in this study involve scenarios with one ownship UAV and multiple intruder aircraft, where the objective of the ownship is to intercept each intruder while avoiding collisions during the flights.

Figure 1 illustrates the architecture of the proposed hybrid control system. The system leverages both DRL and fuzzy logic models to propel the ownship, processing its state information at each timestep for the observation space and reward functions. Concurrently, intruder aircraft follow distinct, predefined trajectories using PID controllers, outputting state data on position and velocities.

To address the multiple dynamic target interception problem, a suite of additional algorithms was developed in the SAC-FIS framework, which encompasses a target selection algorithm, methods to assess successful target interception, criteria for episode termination, and a reset function, among others. The rest of this section will provide a comprehensive breakdown of each component within this system.

3.1. The RL Agent Design

For the proposed case, the state–action relationship is multifaceted and unpredictable. The RL agent’s neural network design, which includes layers of varying densities and specific branch structures to compute mean and standard deviation (SD) separately, effectively encodes the uncertainties. Moreover, a concatenationLayer is utilized within the critic network to merge action and state information, enhancing the understanding of their dependencies and improving value estimations. Key training parameters (e.g., learning rate, entropy regularization coefficient, and optimization algorithm) have been fine-tuned through extensive testing, refining the trade-off between exploration and exploitation and enhancing adaptability. The specific network design is as follows:

Critic Network Architecture

The critic networks are defined as

Q_{i} (s, a; Φ_{i})

,

i \in {1, 2}

, where

Φ_{i}

includes layers for the state(s) and action(a) paths:

State Path (S_path) Layers:

$\begin{matrix} InputLayer (numObservations) \to FullyConnected (256) \\ \to ReLU \to FullyConnected (256) \to ReLU \end{matrix}$
Action Path (A_path) Layers:

$\begin{matrix} InputLayer (numActions) \to FullyConnected (256) \end{matrix}$
Concatenation and Subsequent Layers:

$\begin{matrix} Concat (S_{path}, A_{path}) \to FullyConnected (128) \to ReLU \to FullyConnected (128) \\ \to ReLU \to FullyConnected (64) \to ReLU \to FullyConnected (1) \end{matrix}$

Actor Network Architecture

The actor network is defined as

π (s; Θ)

, where

Θ

includes the following layers:

\begin{matrix} InputLayer (numObservations) \to FullyConnected (256) \to ReLU \\ \to FullyConnected (256) \to ReLU \to FullyConnected (128) \to ReLU \\ \to \begin{matrix} \{\begin{matrix} Mean Path : & FullyConnected (64) \to ReLU \to FullyConnected (numActions) \\ SD Path : & FullyConnected (numActions) \to ReLU \to Softplus . \end{matrix} \end{matrix} \end{matrix}

Based on the described neural network designs, the actor network

π (s; Θ)

processes the input state s to generate the parameters of the action distribution, specifically the mean and standard deviation. This enables the agent to select actions in a continuous action space. Meanwhile, the critic networks, parameterized by

Φ_{1}

and

Φ_{2}

, estimate the value of action–state pairs, guiding the agent towards actions that maximize the expected return. Prior to the commencement of the training process, it is imperative to initialize the following parameters.

Initial entropy coefficient: $α = 1$
Target entropy: $TargetEntropy = - numActions$
Length of the replay buffer B: $1 \times 10^{6}$ transitions.
Learning rates: $λ_{Q} = λ_{π} = 1 \times 10^{- 4}$
Discount factor: $γ = 0.99$
Target smoothing coefficient: $τ = 1 \times 10^{- 3}$
Mini-batch size: 256
Number of warm start steps: 1000
Optimizer algorithm for both actor and critics: adam
Gradient threshold: 1

A general training procedure for the proposed SAC-FIS method is delineated in Algorithm 1. Details regarding the specific control tasks raised in this paper, including the RL environment and the FIS, are discussed in the following subsections.

3.2. RL Environment Design

In this segment, the UAV and sensor model, along with the functions of the RL environment and other related algorithm designs, will be detailed.

3.2.1. The UAV and Sensor Model Design

The UAV platform selected in this paper for both ownship and intruder aircraft is the “+” type quadrotor, assumed to be center symmetric. Although this study employs a model-free SAC method, meaning the algorithm learns the optimal policy through interaction with the environment without relying on a predefined dynamic model of the environment, understanding the UAV model (including both dynamic and kinematic models) as referenced from [45,46] is still beneficial for effective training and control, as it provides critical physical constraints. Based on the rotation matrix, the dynamic model equations of this type of quadrotor are as shown in Equation (1). Additionally, A 3D LiDAR sensor is mounted on the ownship. The sensor’s azimuth (

θ_{a z}

) and elevation (

θ_{e l}

) limits are defined as

θ_{a z} \in [- 179^{\circ}, 179^{\circ}]

and

θ_{e l} \in [- 15^{\circ}, 15^{\circ}]

, respectively, with a maximum detection range of 20 m. For each intruder UAV, it will execute reciprocating flights at a specified speed along a predefined route.

Algorithm 1 The hybrid intelligent SAC-FIS control framework.

1:: Analyze the dynamic model of the controlled robot, and then determine the specific control variables (outputs) governed by the FIS based on specific control tasks and related requirements.
2:: Define input variables of the FIS that can effectively incorporate certain universal expert experiences without compromising the robot’s maneuverability.
3:: Define membership functions for each input variable, drawing from both extensive debugging results and thorough data analysis.
4:: Establish a comprehensive set of fuzzy rules that capture expert knowledge and observed data behavior.
5:: Load the FIS.
6:: Initialize actor network $Θ$ and critic networks $Φ_{1}, Φ_{2}$ with predefined parameters.
7:: Initialize target critic networks $Φ_{target 1}, Φ_{target 2}$ with parameters from $Φ_{1}, Φ_{2}$ .
8:: Initialize replay buffer B.
9:: for each iteration do
10:: Observe state S. {S includes both inputs and outputs of the FIS.}
11:: Compute the output of FIS, $F I S_{A}$ , and, in parallel, select action from the SAC model, $S A C_{A} \sim π (\cdot | S; Θ)$ , using the actor network with distribution parameters.
12:: Concatenate $F I S_{A}$ and $S A C_{A}$ to form the complete action space A.
13:: Execute action A in the environment, observe reward R, next state $S^{'}$ , and termination signal D.
14:: Store transition $(S, A, R, S^{'}, D)$ in B.
15:: if B’s size ≥ mini-batch size AND iteration > number of warm start steps then
16:: Sample a mini-batch of transitions $(S_{j}, A_{j}, R_{j}, S_{j}^{'}, D_{j})$ from B.
17:: For each sampled transition, compute the target value $y_{j}$ :

$y_{j} = R_{j} + γ (1 - D_{j}) min_{i = 1, 2} Q (S_{j}^{'}, π (S_{j}^{'}; Θ); Φ_{target i})$
18:: Update critic networks $Φ_{1}, Φ_{2}$ by minimizing the loss:

$Φ_{i} \leftarrow Φ_{i} - λ_{Q} \nabla_{Φ_{i}} \frac{1}{batch size} \sum_{j} {(Q (S_{j}, A_{j}; Φ_{i}) - y_{j})}^{2}$

{This update process allows the RL agent to incrementally understand and adapt to the logic of the FIS, collaborating with the FIS outputs to optimize the overall decision-making.}
19:: Update actor network $Θ$ using the policy gradient:

$Θ \leftarrow Θ + λ_{π} \nabla_{Θ} \frac{1}{batch size} \sum_{j} (α log (π (A_{j} | S_{j}; Θ)) - Q (S_{j}, A_{j}; Φ_{1}))$
20:: Adaptively adjust $α$ towards TargetEntropy.
21:: Softly update target networks:

$Φ_{target i} \leftarrow τ Φ_{i} + (1 - τ) Φ_{target i}, for i \in {1, 2}$
22:: end if
23:: end for

As depicted in Figure 2, the axes and sign convention of the selected UAV platform are provided. Also, the LiDAR sensor mounted at the center of the ownship is capable of providing 360-degree environmental detection surrounding the ownship, encompassing both obstacles and intruder aircraft. The 3D point cloud generated by the LiDAR will be sampled and used as the input for FIS; the detailed methods and discussion are elaborated in Section 3.3.

\{\begin{matrix} \ddot{x} = \frac{T}{m} (cos ϕ sin θ cos ψ + sin ϕ sin ψ) \\ \ddot{y} = \frac{T}{m} (cos ϕ sin θ sin ψ - sin ϕ cos ψ) \\ \ddot{z} = \frac{T}{m} (cos θ cos ϕ) - g \end{matrix}

(1)

where:

$\ddot{x}$ , $\ddot{y}$ , and $\ddot{z}$ are the accelerations in the x, y, and z directions, respectively.
T is the total thrust of the UAV.
$ϕ$ , $θ$ , and $ψ$ are the roll, pitch, and yaw angles, respectively.
m is the mass of the UAV.
g is the gravitational acceleration.

3.2.2. Action and Observation Spaces (A & S)

Considering the actual quadrotor’s dynamics, the control inputs are comprised of Roll, Pitch, Yaw, and Total Thrust, defining the action space as a

4 \times 1

matrix. Referring to Section 3.1, the numActions is thus set to 4, which establishes the target entropy,

α_{T} = - 4

. This prevents the agent from converging to local optima by gradually reducing exploration, favoring policy optimization and stabilization. Notably, in this case, Yaw is exclusively governed by the FIS, leaving only three control variables for the RL agent. Further details regarding Yaw control are discussed in Section 3.3.

Meanwhile, in the chasing and interception problem, the observation space (S) is characterized by high dimensionality, resulting in prolonged training times and potential challenges in achieving convergence. To address this problem, we developed a target selection algorithm. Given that the positions and orientations of the ownship and intruder aircraft within the 3D environment are randomized at the beginning of each training episode, the algorithm initially selects the nearest intruder to the ownship as the target, treating other intruders as dynamic obstacles. It then continuously selects a new target based on proximity, only after the current target has been successfully intercepted. The intercepted intruders will be stopped and remain hovering at their captured location. This approach simplifies S by considering only the parameters of the selected target and ignoring those of other intruders. The algorithm, detailed in Appendix A.1, produces an output ranging from 1 to n, representing the nth intruder (

n \geq 1

) selected as the target. It updates S with the target ship’s relative position to the ownship and its velocities. Upon target update, a switch replaces these parameters with those of the new target.

Furthermore, a counter

n_{captured}

(ranging from 0 to n) was introduced to represent the number of captured intruders. This element plays a crucial role in S, eliminating the ’Agent Confusion’ phenomenon in most multi-target scenarios, where the agent might repetitively complete identical tasks. Overall, S can be described as follows:

S = (d_{ot}; ϕ_{Z Y X}; v_{o}; ω_{o}; Δ x_{ot}; v_{t}; ω_{xy, o}; a_{input}; u_{FIS}; p_{‖}; p_{⊥}; n_{captured})

(2)

These items constitute a

28 \times 1

matrix, where each element has been detailed by their descriptions and respective sizes:

$d_{ot}$ : The Euclidean distance between the ownship and the selected target ship ( $1 \times 1$ ).
$ϕ_{Z Y X}$ : Ownship’s Euler angles in ZYX order, represented as $ψ, θ, ϕ$ ( $3 \times 1$ ).
$v_{o}$ : Ownship’s linear velocities. $v_{o} = [v x; v y; v z]$ ( $3 \times 1$ ).
$ω_{o}$ : Ownship’s angular velocities, represented as p, q, r ( $3 \times 1$ ).
$Δ x_{ot}$ : Coordinate differences between ownship and the selected target ship in the XYZ directions, represented as $x_{e}$ , $y_{e}$ , and $z_{e}$ ( $3 \times 1$ ).
$v_{t}$ : The selected target ship’s linear velocities ( $3 \times 1$ ).
$ω_{xy, o}$ : Ownship’s angular velocities about the world coordinate system’s X and Y axes, represented by $ω_{X}$ and $ω_{Y}$ ( $2 \times 1$ ).
$a_{input}$ : Actions of roll ( $u_{ϕ}$ ), pitch ( $u_{θ}$ ), yaw ( $u_{ψ}$ ), thrust ( $u_{T h r u s t}$ ) ( $4 \times 1$ ).
$u_{FIS}$ : Inputs of the FIS, represented as ownship–target angle ( $θ_{target}$ ), front distance ( $d_{front}$ ), and lateral distance error ( $d_{left-right}$ ) ( $3 \times 1$ ).
$p_{‖}$ : Speed vector’s projection onto the direction (vector) formed between the ownship and the selected target ship ( $1 \times 1$ ).
$p_{⊥}$ : The component of the speed vector that is perpendicular to the vector formed between the ownship and the selected target ship ( $1 \times 1$ ).
$n_{captured}$ : Number of successfully intercepted intruders ( $1 \times 1$ ).

Most observations in S are continuous, leading to a theoretically infinite observation space. To tackle this, we discretized each observation factor. For instance, heterogeneous discretization was implemented for

d_{ot}

: states beyond 5 m from the target were rounded to 0.5 m, while states within 5 m were rounded to 0.1 m. The precision is up to one decimal place for velocities, position errors, and control inputs. Through such discretization, we successfully reduced the size of S by

10^{31}

, significantly reducing the training time and assisting the agent in avoiding convergence failures.

3.2.3. Reward Function (R)

This part represents the core of the proposed hybrid control system, delineating the RL environment’s feedback mechanism in response to the ownship’s actions and guiding the agent to learn how to make optimal decisions. It is crucial that if the reward function (R) is sparse or biased, it can result in the agent failing to have the desired outcomes. For complex and high-dimensional state-space tasks, the reward function must be sufficiently sensitive to reflect minor state changes. Furthermore, a task often encompasses multiple objectives, such as target tracking, obstacle avoidance, speed control, etc. This necessitates that R can adjust the weights of the specified terms corresponding to each objective based on their importance and priority. Therefore, exponential terms are employed to ensure the sensitivity of each component within R. Moreover, the exponential function enables bounded outputs by defining the inputs’ range, which facilitates simple adjustments of coefficients of different terms to satisfy the specific needs of the task. The following explains the reward function designed for the hybrid control scheme:

\begin{matrix} r_{t} = & α_{1} \cdot e^{α_{2} x_{e}^{2} + α_{3} y_{e}^{2} + α_{4} z_{e}^{2}} + α_{5} \cdot (1 - e^{α_{6} \cdot p_{‖}}) + α_{7} \cdot (1 - e^{α_{8} \cdot p_{⊥}}) \\ + α_{9} \cdot (1 - e^{α_{10} \cdot (α_{11} \cdot u_{ϕ}^{2} + α_{12} \cdot u_{θ}^{2})}) + α_{13} \cdot (1 - e^{α_{14} \cdot ω_{X}^{2}}) + α_{13} \cdot (1 - e^{α_{15} \cdot ω_{Y}^{2}}) \\ + β_{1} \cdot Δ n_{captured} + β_{2} \cdot g_{t} \end{matrix}

(3)

where

α_{1}

to

α_{15}

and

β_{1}

to

β_{2}

are weight coefficients. They were designed as follows:

\begin{matrix} α_{1} = 0.4, α_{2} = α_{3} = α_{4} = - 0.0105, α_{5} = 0.4 if p_{‖} \geq 0 \lor α_{5} = - 0.4 if p_{‖} < 0, \\ α_{6} = - 0.3 if p_{‖} \geq 0 \lor α_{6} = 0.3 if p_{‖} < 0, α_{7} = - 0.25, \\ α_{8} = - 0.4 if p_{⊥} \geq 0 \lor α_{8} = 0.4 if p_{⊥} < 0, α_{9} = - 0.2, α_{10} = - 8, α_{11} = 2, α_{12} = 3, \\ α_{13} = - 0.2, α_{14} = α_{15} = - 0.03 . \end{matrix}

and,

\begin{matrix} β_{1} = 80, β_{2} = - 30 . \end{matrix}

The reward function of the agent comprises several components: the initial item adopts the relative positions (instead of absolute positions) to the current target, encourages the ownship to minimize this distance and helps in avoiding ’Agent Confusion’ when the target changes. The second and third components incentivize the agent to navigate toward the target, imposing penalties for movements in the other directions. Subsequent penalizing components associated with control signals (

u_{ϕ} & u_{θ}

) and angular velocities (

ω_{X} & ω_{Y}

) are designed to prevent oscillations and mitigate large variations, thus enhancing safety and energy efficiency. The term with

Δ n_{captured}

rewards the agent for each successful interception of an intruder. Additionally, the function also integrates a condition

g_{t}

, where

g_{t} = 1

signals the end of an unsuccessful episode, triggering a corresponding penalty. These aspects are elaborated further in Section 3.2.5.

3.2.4. Reset Function

To enhance the agent’s generalization capability, a flexible and efficient reset function is indispensable. Algorithm 2 is developed to ensure that at the beginning of each training episode, the position of each intruder UAV randomly appears on its predefined path. Subsequently, the position of the ownship (

q

) is determined randomly based on the relative distances between the intruders

(r_{n})

, ensuring that the ownship is within the range of 2.5 m to 20 m of any intruder UAV (

2.5 m \leq ‖q - r_{n}‖ \leq 20 m

). Similarly, the initial orientation and velocities

(v_{x}; v_{y}; v_{z})

of the ownship are also generated randomly.

Algorithm 2 Reset mechanisms for ownship and intruder aircraft during training.

Require: Number of intruder UAVs N, array of waypoint sets

{W a y p o i n t s_{n}}

for each intruder n, where

n \geq 1

.
Ensure: Initial states for intruder aircraft are updated based on their paths;
the ownship’s initial state is updated based on the relative distance to intruders.

1:: Initialize $S c e n a r i o$ with update rate (100 Hz) and reference location.
Step 1: Reset initial positions for intruder UAVs $(r_{1} \to r_{N})$ .
2:: for $n \leftarrow 1$ to N do
3:: $n u m P o i n t s \leftarrow 2000$ : Number of points for interpolation on the waypoint-based trajectory.
4:: $W a y p o i n t s_{n} \leftarrow$ Define waypoints for the nth intruder.
5:: $M \leftarrow$ Generate a sequence M consisting of $n u m P o i n t s$ evenly spaced points that span the whole trajectory defined by $W a y p o i n t s_{n}$ .

$M = \{m_{i} ∣ m_{i} = 1 + (i - 1) \cdot \frac{Length (W a y p o i n t s_{n}) - 1}{n u m P o i n t s - 1}, i = 1, \dots, n u m P o i n t s\}$
6:: $I n t e r p o l a t e d W_{n} \leftarrow$ Use M for linear interpolation on the trajectory.
7:: $r a n d I d x \leftarrow$ random integer from 1 to $n u m P o i n t s$ .
8:: $I n i t i a l P o s i t i o n_{n} \leftarrow$ Select a random interpolation point from $I n t e r p o l a t e d W_{n}$ as the initial position for $r_{n}$ .
9:: Initialize $r_{n}$ with $I n i t i a l P o s i t i o n_{n}$ .
10:: end for
Step 2: Reset ownship’s initial state.
11:: Determine the ownship’s initial position based on the relative distances to each intruder UAV.
12:: Select the orientation and velocities randomly, e.g., For initial velocities on x, y, z directions:

$[v_{x}; v_{y}; v_{z}] = - 1.5 + 3 \times rand (3, 1) (m / s)$

3.2.5. Termination Conditions (D)

Unlike 2D maps, which can be easily bounded, 3D spaces are theoretically infinite, and it is challenging to define a unique Geofence for various scenarios. Therefore, it is essential to establish relevant constraints to control the spatial range and temporal duration of each test. A training episode or a flight test concludes under one of the following conditions:

The ownship is not positioned within 30 m of the selected target ship:

$If ‖q - r_{target}‖ > 30 m .$
The duration surpasses the predefined maximum time threshold:

$If t \geq T_{final} = 30 s .$
All intruders have been successfully intercepted:

$A target ship is successfully intercepted if ‖q - r_{n}‖ \leq 2 m (is_Captured) .$

$All intruder aircraft are intercepted when : n_{captured} = N .$
The ownship collides with any obstacle, identified when the minimal LiDAR reading drops below 0.75 m (in this study, the distance from the center of the ownship to its edge should not exceed 0.4 m):

$If d_{{Detection}_{\min}} \leq 0.75 m$

3.3. The Fuzzy Inference System

For training cases in 2D environments, drones often exhibit aimless rotational movements during the early training phases, spanning 200–300 episodes [47]. This consumes significant time and decreases the efficiency of testing new algorithms. Such challenges are further exacerbated in 3D environments due to the increased degrees of freedom. To compensate for this, a FIS has been designed to assist the ownship UAV in avoiding extensive trials of irrational poses and movements during the training process. Additionally, this FIS system also supports collision avoidance maneuvers, further simplifying the training complexity of the agent.

In essence, the primary objective of the proposed FIS is to guide the ownship towards the target ships along an unobstructed path, a functionality that has been partially embedded within the reward function. In practical applications, sensors such as LiDAR or cameras are strategically mounted to face forward on UAVs to ensure optimal data collection; also, for ease of mathematical representation, the FIS will be applied to adjust the orientation of the UAV (by controlling yaw

u_{ψ}

to steer left or right) towards the target, thus facilitating desired movements in cooperation with the reward functions. The choice of FIS for controlling

u_{ψ}

is based on the fact that yaw adjustments do not impact the quadrotor’s velocities in any direction [48], thus preserving its maneuverability.

The FIS incorporates three inputs: the angle (

θ_{target}

) between the ownship’s body coordinate system’s x-axis and the ownship–target vector, the LiDAR reading in the direction of ownship’s motion (

d_{front}

), and the difference between the LiDAR readings to the left and right sides of the motion direction (

d_{left-right}

). Additionally, since the ownship’s speed vector may not align with its x-axis, it is advisable to use LiDAR readings from the speed vector’s projection on the ownship’s XY plane for the front distance (

d_{front}

), along with readings at 25 degrees to the left and right to calculate

d_{left-right}

. Detailed calculations for acquiring these input variables are provided in Appendix A.2.

Furthermore, for the onboard LiDAR, all readings exceeding the maximum range will be converted to that of the maximum range (20 m). The membership functions (MFs) and fuzzy rules of this FIS are as follows:

Input MFs:

$d_{front}$ : Small [0, 3.5 m), Medium [3.5 m, 7.5 m], and Large (7.5 m, 20 m].
$d_{left-right}$ : Small [−20 m, −7.5 m], Medium [−5 m, 5 m], and Large [7.5 m, 20 m].
$θ_{target}$ : Extremely Small $[- π, - 0.5 π)$ , Small $[- 0.5 π, - 0.167 π)$ ,
Moderately Small $[- 0.167 π, - 0.016 π)$ , Neutral $[- 0.016 π, 0.016 π)$ ,
Moderately Large $[0.016 π, 0.167 π)$ , Large $[0.167 π, 0.5 π)$ , and Extremely Large $[0.5 π, π]$ .

Output MFs:

$u_{ψ}$ : Left [−1, −0.7], Soft Left [−0.5, −0.1], Straight [−0.2, 0.2], Soft Right [0.1, 0.5], and Right [0.7, 1].

Fuzzy Rules:

IF $d_{front}$ IS Small AND $d_{left-right}$ IS Small THEN $u_{ψ}$ IS Right.
IF $d_{front}$ IS Small AND $d_{left-right}$ IS Medium THEN $u_{ψ}$ IS Right.
IF $d_{front}$ IS Small AND $d_{left-right}$ IS Large THEN $u_{ψ}$ IS Left.
IF $d_{front}$ IS Medium AND $d_{left-right}$ IS Small THEN $u_{ψ}$ IS Right.
IF $d_{front}$ IS Medium AND $d_{left-right}$ IS Large THEN $u_{ψ}$ IS Left.
IF $d_{front}$ IS Large AND $d_{left-right}$ IS Small THEN $u_{ψ}$ IS Soft Right.
IF $d_{front}$ IS Large AND $d_{left-right}$ IS Large THEN $u_{ψ}$ IS Soft Left.
IF $d_{front}$ IS Medium AND $d_{left-right}$ IS Medium AND $θ_{target}$ IS Extremely Small THEN $u_{ψ}$ IS Left.
IF $d_{front}$ IS Medium AND $d_{left-right}$ IS Medium AND $θ_{target}$ IS Small THEN $u_{ψ}$ IS Left.
IF $d_{front}$ IS Medium AND $d_{left-right}$ IS Medium AND $θ_{target}$ IS Moderately Small THEN $u_{ψ}$ IS Soft Left.
IF $d_{front}$ IS Medium AND $d_{left-right}$ IS Medium AND $θ_{target}$ IS Neutral THEN $u_{ψ}$ IS Straight.
IF $d_{front}$ IS Medium AND $d_{left-right}$ IS Medium AND $θ_{target}$ IS Moderately Large THEN $u_{ψ}$ IS Soft Right.
IF $d_{front}$ IS Medium AND $d_{left-right}$ IS Medium AND $θ_{target}$ IS Large THEN $u_{ψ}$ IS Right.
IF $d_{front}$ IS Medium AND $d_{left-right}$ IS Medium AND $θ_{target}$ IS Extremely Large THEN $u_{ψ}$ IS Right.
IF $d_{front}$ IS Large AND $d_{left-right}$ IS Medium AND $θ_{target}$ IS Extremely Small THEN $u_{ψ}$ IS Left.
IF $d_{front}$ IS Large AND $d_{left-right}$ IS Medium AND $θ_{target}$ IS Small THEN $u_{ψ}$ IS Left.
IF $d_{front}$ IS Large AND $d_{left-right}$ IS Medium AND $θ_{target}$ IS Moderately Small THEN $u_{ψ}$ IS Soft Left.
IF $d_{front}$ IS Large AND $d_{left-right}$ IS Medium AND $θ_{target}$ IS Neutral THEN $u_{ψ}$ IS Straight.
IF $d_{front}$ IS Large AND $d_{left-right}$ IS Medium AND $θ_{target}$ IS Moderately Large THEN $u_{ψ}$ IS Soft Right.
IF $d_{front}$ IS Large AND $d_{left-right}$ IS Medium AND $θ_{target}$ IS Large THEN $u_{ψ}$ IS Right.
IF $d_{front}$ IS Large AND $d_{left-right}$ IS Medium AND $θ_{target}$ IS Extremely Large THEN $u_{ψ}$ IS Right.

In summary, this section introduces the components and algorithms of the proposed SAC-FIS control framework for UAV multiple dynamic target interception, based on the system architecture shown in Figure 1. With a DRL agent as the foundation, the controller provides the ability to continuously learn and adapt to the task in a model-free manner. This learning process includes progressively understanding the dynamic model of the UAV to achieve six-degree-of-freedom motion control for the underactuated quadrotor system using only four control inputs: Roll, Pitch, Yaw, and Thrust. Furthermore, the FIS leverages onboard sensor data (3D LiDAR) to incorporate universal expert experience and human cognition, assisting the RL agent in avoiding numerous unrealistic trials, thereby improving training efficiency. Concurrently, the DRL agent persistently learns and interprets the logic provided by the FIS, gradually enhancing coordination for more flexible and efficient motions.

The system setup primarily involves the following steps: Initially, as per Algorithm 1, the dynamic model of the specific robot (in this study, a quadrotor UAV) must be thoroughly analyzed. Following this, a FIS is designed accordingly (detailed in Section 3.3). Subsequently, the neural network is configured, and the relevant training parameters are fine-tuned (discussed in Section 3.1). Additionally, a series of environment interface functions are developed (elaborated in Section 3.2), which include the reward function, reset function, observation space, among others. In the subsequent section, the training results applied to various scenarios will be illustrated.

4. Results

This section presents a comprehensive display of the simulation results for the proposed hybrid control system, underscoring its robust generalization capabilities. Additionally, comparisons with other approaches are discussed, emphasizing the superiority of the hybrid controller. Moreover, a feasibility analysis of applying the hybrid controller to actual quadcopters is conducted based on extensive testing data.

It is important to highlight that the Matlab UAV Toolbox (2023b) serves as the simulation platform for this study, owing to its incorporation of realistic UAV dynamics and consideration of environmental factors. In the simulation results presented, the ’ownship’ is depicted as a blue quadrotor, whereas other colors, such as orange, purple, and green, represent the intruder aircraft (quadrotors). The simulation space employs the NED (north, east, down) coordinate system. However, the simulation environment displays the north, east, and up directions. This means that the z-coordinates are negative, as the UAV operates above ground level (

z \leq 0

) and will not fly underground (

z > 0

). This convention aligns with the common practice of displaying positive values for the altitude in the simulation environment. Additionally, all line graphs feature the horizontal axis as time, with the unit in seconds and a duration of

T_{final} = 30

s.

4.1. Simulation Results of The Hybrid Controller

Figure 3 demonstrates the application of a well-trained SAC-FIS agent across three scenarios, each featuring variations in initial positions, the number of intruder aircraft, intruders’ trajectories, and their motion patterns. Additionally, each intruder UAV operates at a distinct speed: the speed of the purple intruder is set to 0.6 m/s, the orange intruder at 0.9 m/s, and the green intruder at 1.2 m/s. In contrast, the ownship’s speed can reach up to 10 m/s, due to the bounding of each action (control variables). In Scenario 1, the ownship’s initial position

q_{initial} = (14, 8, 0)

, and the initial positions of the intruder UAVs are

r_{1 initial} = (20, 16, - 10)

and

r_{2 initial} = (15, 20, - 7)

. The flight paths of the two intruder aircraft are defined by the waypoints

[0, 0, - 7; 0, 20, - 7; 20, 20, - 7; 20, 0, - 7]

and

[0, 0, - 10; 0, 20, - 10; 20, 20, - 10; 20, 0, - 10]

, respectively. In Scenario 2, the ownship’s

q_{initial} = (10, 0, - 8.5)

, with the intruders’

r_{1 initial} = (20, 5, - 7)

and

r_{2 initial} = (5, 0, - 7.75)

.

Their flight paths are determined by

[0, 0, - 10; 0, 20, - 10; 20, 20, - 7; 20, 0, - 7]

and

[0, 0, - 7; 0, 20, - 7; 20, 20, - 10; 20, 0, - 10]

, respectively. Scenario 3 incorporates three intruder UAVs, with the initial position of the ownship at

q_{initial} = (10, 10, - 17)

and the intruders at

r_{1 initial} = (0, 0, - 5)

,

r_{2 initial} = (20, 20, - 8)

, and

r_{3 initial} = (0, 20, - 11)

. The flight paths are given by

[0, 0, - 5; 0, 20, - 5; 20, 20, - 5; 20, 0, - 5]

,

[0, 0, - 8; 0, 20, - 8; 20, 20, - 8; 20, 0, - 8]

, and

[0, 0, - 11; 0, 20, - 11; 20, 20, - 11; 20, 0, - 11]

, respectively. The simulation results for these scenarios are depicted in Figure 4 and Figure 5, which detail the ownship’s and intruder UAVs’ trajectories during the interception process from both perspective and top–down views.

Figure 6, Figure 7 and Figure 8 provide an analysis of some key metrics for three scenarios. Specifically, Figure 6 depicts the distance from the ownship to the current target ship. In Scenario 1, the ownship successfully intercepted the first target within 5.1 s and captured both intruder UAVs in 7.7 s, with each intruder hovering at its position upon successful interception. In Scenario 2, the ownship took 2 s to intercept the first target and 12.8 s for the second, totaling 14.8 s. The interception times for all targets were 4.8 s, 7.6 s, and 7 s, respectively, with a total engagement time of 19.4 s for Scenario 3. None of the scenarios exceeded the maximum allotted time of 30 s.

Figure 7 and Figure 8 show the reward values for the first and second terms in the reward function. The first term encompasses the coordinate differences between the ownship and the current target in the X, Y, and Z directions, with each component bounded between 0 and 1. The smaller the difference, the higher the reward. Similarly, Figure 8 assesses both the magnitude and direction of the speed vector’s projection on the ownship–target vector, which is bounded between −0.4 and 0.4. If this component aligns with the direction from the ownship to the target, the reward is positive; otherwise, it is negative.

These results demonstrate that the proposed SAC-FIS method can excellently perform the task of tracking multiple dynamic targets under various configurations, showcasing its efficiency and generalization capability.

4.2. Comparison Results

In order to demonstrate the advantages of the proposed hybrid SAC-FIS controller, comparisons were made with an SAC-only approach, which represents using only the DRL model to control all variables. Concurrently, a comprehensive evaluation was conducted to assess the impact of incorporating the captured-intruder-count factor

n_{captured}

into the observation space on tracking performance.

Figure 9 presents the training processes of three distinct approaches on a desktop equipped with an NVIDIA GeForce RTX 4090 GPU. The training result for the SAC-FIS method, with

n_{captured}

included in the observation space, reaching the required average return after 434 episodes over 6 h and 47 min, is depicted in Figure 9a. The SAC-only approach, also incorporating

n_{captured}

, took 8 h and 53 min for 471 episodes as illustrated in Figure 9b. This purely DRL method necessitated additional training time due to the need of the agent to simultaneously understand four interrelated control variables and gradually improve this more complicated control process. In contrast, by the same 434th episode in Figure 9a, the training time already amounted to 8 h and 9 min. Moreover, Figure 9c represents the SAC-FIS method’s training process without

n_{captured}

, which was completed through 341 episodes in 4 h and 26 min. This method’s shortest training duration is attributed to a smaller state space, facilitating a relatively easier understanding of the environment. Therefore, we designed a unified scenario with more complex configurations to test the performance of these approaches. As shown in Figure 10, the trajectories of two intruders were determined by

[0, 20, 0; 20, 20, - 4; 20, 0, - 8; 0, 20, - 14]

and

[20, 0, 0; 0, 0, - 4; 0, 20, - 8; 20, 0, - 14]

, respectively, with the ownship’s

q_{initial} = (16, 8, - 15)

, and

r_{1 initial} = (10, 10, - 11)

,

r_{2 initial} = (20, 0, - 12)

for the intruder UAVs. The purple intruder drone incrementally elevated its altitude as it moved away from the ownship, while the orange intruder was in the landing mode.

The simulation results of these approaches are showcased in Figure 11, Figure 12, Figure 13, Figure 14 and Figure 15. The SAC-FIS agent incorporating

n_{captured}

successfully captured both intruders within 14 s, whereas the SAC-only method took 15.8 s due to a noticeably longer flight path of the ownship in this case, with most of the flight path being redundant. This is because, without the guidance of relevant experience, the DRL-only method would require an extensive period to reduce the cost gradually, and achieving the minimum cost performance is basically impossible. Figure 11c, Figure 12c, Figure 13c, Figure 14c and Figure 15c show the outcomes for the SAC-FIS agent without

n_{captured}

, indicating a failed attempt. After successfully capturing the first target within 2.5 s, the ownship mechanically endeavored to complete the same task, ultimately succumbing to the ’Confusion’ issue after several failed attempts since it had already achieved the tracking objective, leading to the interception failure of the second target ship.

These comparison results highlight the superiority of the SAC-FIS method, as it possesses certain universal experiences from the outset, significantly reducing training time and aiding the agent in completing tasks at a lower cost. Meanwhile, all three agents successfully performed tasks involving only one intruder UAV. However, for tasks involving multiple dynamic targets, incorporating

n_{captured}

into the observation space (S) is crucial, as it straightforwardly resolves the ’Agent Confusion’ phenomenon.

We conducted an in-depth assessment of both the SAC-FIS and SAC-only methods, evaluating their performance with and without the inclusion of

n_{captured}

in S. Additionally, we tested these methods across scenarios involving two- and three-intruder aircraft, with each scenario undergoing a hundred trials. The initial positions of the intruders and the various initial parameters for the ownship were randomized for each trial at the beginning using the reset function developed in Section 3.2.4. The success rates of the various approaches are summarized in Table 1. It is evident that the success rate of the SAC-only method is markedly lower than that of SAC-FIS, primarily because the SAC-only agent is more likely to surpass the time limit (

T_{final}

). Moreover, the importance of introducing

n_{captured}

is reaffirmed.

4.3. The Hybrid Controller Feasibility Analysis

For the simulations, the mass of the ownship is set to 0.1 kg, with gravity specified at

9.81 {m / s}^{2}

. The quadrotor’s four control inputs—Roll, Pitch, Yaw, and Total Thrust—are bounded, and the reward function also includes mechanisms to prevent the UAV from experiencing jitter and sudden motions, thus ensuring stable flight and avoiding unreasonable poses. However, more importantly, evaluating the safety performance in real-world applications is crucial. After analyzing a vast array of results, we select Scenario 3 from Section 4.1 as an example. As illustrated in Figure 16, the Yaw input, regulated by the FIS, results in relatively smooth signals with fewer oscillations. In contrast, the other control variables, which are governed by the SAC agent, exhibit persistent fluctuations. When applying this hybrid controller for real flight tests, even though most quadrotors may not be as sensitive to fluctuating signals, such volatility could still cause mechanical wear and potentially pose a risk. Cordeiro et al. [49] designed a sliding-mode controller (SMC) for fixed-wing UAVs and effectively smoothed the highly fluctuated control signals by incorporating a low-pass filter, reducing common chattering effects while ensuring robustness. In [50], the authors employed an extended Kalman filter (EKF) within their proposed controller. The results demonstrate that the controller, augmented by the EKF, maintains robust tracking performance even when Gaussian white noise is introduced to the state variables. Therefore, for this study, it is beneficial to add a low-pass or Kalman filter to smooth the control signal curve before applying it to real UAVs.

Furthermore, it is essential to monitor changes in the acceleration of the ownship. As seen in Figure 17, the maximum fluctuation range of acceleration is within

12 {m / s}^{2}

, less than two times that of the gravity (2 g). By consulting handbooks and technical specifications for small quadrotors of similar weight, they can typically withstand acceleration changes from 2 g to 5 g [51]. Thus, the hybrid controller is applicable to real quadrotor UAVs in terms of acceleration changes.

In short, the hybrid SAC-FIS controller demonstrated exceptional simulation performance and can be applied to real quadcopters after applying filtration to smooth the control signals.

5. Conclusions

This study introduces a novel hybrid UAV motion control scheme that combines a SAC-based deep reinforcement learning strategy with a fuzzy inference system for the multiple dynamic target interception problem. A comprehensive analysis of the simulation results and comparisons with alternative approaches underscore the effectiveness of the proposed method. The design of the control framework employs a modular architecture, facilitating straightforward adaptation, either in part or in entirety, to varied problems, thereby augmenting its scalability. Additionally, by adopting a fuzzy logic model to integrate selected universal expert experience, in conjunction with a highly sensitive reward function and a flexible reset function, this approach markedly improves the training efficiency, boosts the generalization ability of the trained agent, and reduces costs simultaneously. Furthermore, dynamic-environment and multi-target cases have always been two significant challenges in RL. This paper addresses these difficulties by redesigning the observation space. The steps taken include focusing exclusively on the current target information, employing relative (instead of absolute) coordinates between the ownship and the selected target aircraft, discretizing each state, and incorporating a counting factor. In the future, we plan to upgrade specific modules within the system to address more complicated problems, such as integrating dense and randomly localized obstacles, and deploying a swarm of ownships operating under a cooperative protocol. On the other hand, by smoothing the control signals and conducting comprehensive safety tests, applying our method to real-world flight trials will enable the identification and correction of potential issues, further enhancing the performance of the hybrid control system.

Author Contributions

Conceptualization, B.X.; methodology, B.X.; software, B.X.; validation, B.X.; formal analysis, B.X., I.M. and W.X.; investigation, B.X.; resources, B.X., I.M. and W.X.; data curation, B.X.; writing—original draft preparation, B.X.; writing—review and editing, B.X., I.M. and W.X.; visualization, B.X.; supervision, I.M. and W.X.; project administration, I.M. and W.X.; funding acquisition, I.M. and W.X. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to acknowledge the financial support from National Research Council Canada (Integrated Autonomous Mobility program) and NSERC (Grant No.400003917) for the work reported in this paper.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Complementary Algorithms

Appendix A.1. Target Selection Algorithm for Ownship in Autonomous Navigation and Target Interception

Algorithm A1 Target selection algorithm.

Require: Ownship’s current position

q

, array of intruder aircraft positions

r [1 \dots n]

, the number of already captured intruders

N

.
Ensure: Index of the nearest intruder not yet captured relative to the ownship.

1:: persistent $capturedIndices$ {Boolean array to track captured intruders.}
2:: persistent $lastCapturedCount \leftarrow 0$ {Counts the number of intruders captured till the last check.}
3:: if $lastCapturedCount \neq N$ then
4:: if $N = 0$ or $capturedIndices$ is uninitialized then
5:: $capturedIndices \leftarrow array of False [1 \dots n]$ {Initially, no intruder is captured.}
6:: end if
7:: $lastCapturedCount \leftarrow N$ {Update the count of captured intruder drones.}
8:: end if
9:: $nearestIndex \leftarrow - 1$
10:: $minDistance \leftarrow \infty$
11:: for $i \leftarrow 1$ to n do
12:: if not $capturedIndices [i]$ then
13:: $distance \leftarrow norm (q - r [i])$
14:: if $distance < minDistance$ then
15:: $minDistance \leftarrow distance$
16:: $nearestIndex \leftarrow i$
17:: end if
18:: end if
19:: end for
20:: if $nearestIndex \neq - 1$ then
21:: $capturedIndices [nearestIndex] \leftarrow True$
22:: end if
23:: return $nearestIndex$ {Return the index of the nearest uncaptured intruder.}

Appendix A.2. Algorithm for Deriving FIS Inputs Based on Onboard 3D LiDAR Data

Algorithm A2 Computation of inputs for the fuzzy inference system.

Require: Ownship’s position

q

, velocity

(v_{x}, v_{y}, v_{z})

, Euler angles

(ψ, ϕ, θ)

, Target ship’s position

r_{t}

, LiDAR sensor’s azimuth

(θ_{a z})

and elevation

(θ_{e l})

limits.
Ensure: Angle

θ_{target}

between ownship’s x-axis and the vector towards the target ship; Three LiDAR readings: the first in the direction of the ownship’s speed vector projection onto its own XY plane, and the subsequent two at angles of

25^{\circ}

to the left and right of this direction, all within the ownship’s XY plane, considering its orientation.
Part 1: Calculation for $θ_{target}$

1:: Normalize $ψ$ to the range $[- 2 π, 2 π]$ .
2:: Initialize $θ_{target}$ to 0.
3:: if $x < x_{t} AND y < y_{t}$ then
4:: $θ_{target} = atan (\frac{x_{t} - x}{y_{t} - y}) + ψ$
5:: else if $x \geq x_{t} AND y < y_{t}$ then
6:: $θ_{target} = atan (\frac{x_{t} - x}{y_{t} - y}) + ψ$
7:: else if $x < x_{t} AND y \geq y_{t}$ then
8:: $θ_{target} = π - abs (atan (\frac{x_{t} - x}{y_{t} - y})) + ψ$
9:: else if $x \geq x_{t} AND y \geq y_{t}$ then
10:: $θ_{target} = - (π - abs (atan (\frac{x_{t} - x}{y_{t} - y}))) + ψ$
11:: end if
12:: if $x = = x_{t} AND y \neq y_{t}$ then
13:: if $y > y_{t}$ then
14:: if $ψ \geq 0$ then
15:: $θ_{target} = π + ψ$
16:: else
17:: $θ_{target} = - π + ψ$
18:: end if
19:: else
20:: $θ_{target} = ψ$
21:: end if
22:: else if $x \neq x_{t} AND y = = y_{t}$ then
23:: if $x > x_{t}$ then
24:: $θ_{target} = - 0.5 π + ψ$
25:: else
26:: $θ_{target} = 0.5 π + ψ$
27:: end if
28:: else if $x = = x_{t} AND y = = y_{t}$ then
29:: $θ_{target} = 0$
30:: end if
Part 2: Sampling LiDAR Readings for $d_{front}$ and $d_{left-right}$
31:: Transform velocity vector from world frame to ownship’s body frame using Euler angles $(ϕ, θ, ψ)$ to get $(v_{x}^{'}, v_{y}^{'}, v_{z}^{'})$ .
32:: Calculate the projection of the transformed speed vector onto the ownship’s XY plane: $(v_{x y_x}, v_{x y_y}) = (v_{x}^{'}, v_{y}^{'})$ .
33:: Calculate the direction angle $α_{x y}$ of the projection vector with respect to the ownship’s X-axis in the body frame: $α_{x y} = arctan 2 (v_{x y_y}, v_{x y_x})$ .
34:: Determine the scanning angles within the LiDAR’s point cloud: ${angle}_{c e n t e r} = α_{x y}, {angle}_{l e f t} = α_{x y} - 25^{\circ}, {angle}_{r i g h t} = α_{x y} + 25^{\circ}$ .
35:: Sample these three LiDAR readings from the point cloud based on the LiDAR resolution and sampling density ( $0 . 01^{\circ}$ ).
36:: Convert to corresponding LiDAR readings and calculate $d_{left-right}$ .
37:: return $θ_{target}$ , $d_{front}$ , and $d_{left-right}$ .

References

Murugan, D.; Garg, A.; Ahmed, T.; Singh, D. Fusion of drone and satellite data for precision agriculture monitoring. In Proceedings of the 2016 11th International Conference on Industrial and Information Systems (ICIIS), Roorkee, India, 3–4 December 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 910–914. [Google Scholar]
Aljehani, M.; Inoue, M. Communication and autonomous control of multi-UAV system in disaster response tasks. In Agent and Multi-Agent Systems: Technology and Applications, Proceedings of the 11th KES International Conference, KES-AMSTA 2017 Vilamoura, Algarve, Portugal, June 2017 Proceedings 11; Springer International Publishing: Cham, Switzerland, 2017; pp. 123–132. [Google Scholar]
NVIDIA GTC 2024 Keynote. Available online: https://www.nvidia.com/gtc/keynote/ (accessed on 1 January 2020).
Li, A.; Peizi, L. Introduction to A* from Amit’s Thoughts on Path Finding. 2012. Available online: http://theory.stanford.edu/~amitp/GameProgramming/AStarComparison.html (accessed on 1 January 2020).
Viet, P.Q.; Romero, D. Probabilistic roadmaps for aerial relay path planning. In Proceedings of the GLOBECOM 2023–2023 IEEE Global Communications Conference, Kuala Lumpur, Malaysia, 4–8 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1650–1655. [Google Scholar]
Wang, J.; Zhang, T.; Ma, N.; Li, Z.; Ma, H.; Meng, F.; Meng, M.Q.H. A survey of learning-based robot motion planning. IET Cyber-Syst. Robot. 2021, 3, 302–314. [Google Scholar] [CrossRef]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef] [PubMed]
** Control with Extended Kalman Filter for UAVs. Electronics 2023, 12, 3079. [Google Scholar] [CrossRef]
DJI Official Website. Available online: https://www.dji.com/ (accessed on 1 January 2020).

Figure 1. The schematic of the hybrid control system and the environment structure.

Figure 2. A ’+’ type quadrotor axes and sign convention [46], and a simulation environment featuring the ownship (a blue quadrotor in the middle) equipped with an onboard LiDAR sensor for obstacle detection.

Figure 3. Initial configurations across diverse scenarios.

Figure 4. Simulation results with trajectories in perspective view.

Figure 5. Simulation results with trajectories in top view.

Figure 6. Euclidean distance from ownship to target ship.

Figure 7. Rewards for coordinate errors in X, Y, and Z directions.

Figure 8. Rewards for speed vector’s projection on the ownship–target direction.

Figure 9. Training processes of different approaches.

Figure 10. Initial configurations of a unified scenario for testing different approaches.

Figure 11. Simulation results with trajectories of different approaches in perspective view.

Figure 12. Simulation results with trajectories of different approaches in top view.

Figure 13. Euclidean distance from ownship to target ship: comparing different approaches.

Figure 14. Rewards for coordinate errors in X, Y, and Z directions: comparing different approaches.

Figure 15. Rewards for speed vector’s projection: comparing different approaches.

Figure 16. Temporal dynamics of ownship control variables in Scenario 3 from Section 4.1.

Figure 17. Temporal dynamics of ownship’s acceleration.

Table 1. Comparative success rates of SAC-FIS and SAC-only approaches: impact of including captured intruder number in observation space.

	Success Rate * (Incorporating n_captured in S)		Success Rate * (Excluding n_captured in S)
	Two-Intruder Scenarios	Three-Intruder Scenarios	Two-Intruder Scenarios	Three-Intruder Scenarios
SAC-FIS	100%	95%	61%	44%
SAC only	92%	74%	48%	8%

* The success rate is calculated based on 100 tests for each method.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

**a, B.; Mantegh, I.; **e, W. UAV Multi-Dynamic Target Interception: A Hybrid Intelligent Method Using Deep Reinforcement Learning and Fuzzy Logic. Drones 2024, 8, 226. https://doi.org/10.3390/drones8060226

AMA Style

**a B, Mantegh I, **e W. UAV Multi-Dynamic Target Interception: A Hybrid Intelligent Method Using Deep Reinforcement Learning and Fuzzy Logic. Drones. 2024; 8(6):226. https://doi.org/10.3390/drones8060226

Chicago/Turabian Style

**a, Bingze, Iraj Mantegh, and Wenfang **e. 2024. "UAV Multi-Dynamic Target Interception: A Hybrid Intelligent Method Using Deep Reinforcement Learning and Fuzzy Logic" Drones 8, no. 6: 226. https://doi.org/10.3390/drones8060226

Article Menu

UAV Multi-Dynamic Target Interception: A Hybrid Intelligent Method Using Deep Reinforcement Learning and Fuzzy Logic

Abstract

1. Introduction

2. Problem Definition

2.1. Related Prior Work

2.2. Contributions

3. Methodology

3.1. The RL Agent Design

3.2. RL Environment Design

3.2.1. The UAV and Sensor Model Design

3.2.2. Action and Observation Spaces (A & S)

3.2.3. Reward Function (R)

3.2.4. Reset Function

3.2.5. Termination Conditions (D)

3.3. The Fuzzy Inference System

4. Results

4.1. Simulation Results of The Hybrid Controller

4.2. Comparison Results

4.3. The Hybrid Controller Feasibility Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Complementary Algorithms

Appendix A.1. Target Selection Algorithm for Ownship in Autonomous Navigation and Target Interception

Appendix A.2. Algorithm for Deriving FIS Inputs Based on Onboard 3D LiDAR Data

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI