Over the last years, promising results to parameter robustness have been published using these four basic approaches. We will discuss the different ideas in detail in this section, including some of their variations presented over the years. We try to show how the approaches connect to the basic concepts introduced in the previous section and other related research to the best of our understanding.
3.1. Transition and Reward Robust Designs
Given the underlying stochastic nature of an MDP, every interaction between agent and environment causes a shift in the environment’s state. This change is governed by the transition probability model
. A consequence of this stochastic process is the risk of potentially encountering critical states. To minimize this risk in traditional RL, Heger [
76] adopted a two-player framework similar to Littman [
95]. In Heger [
76], however, the adversary takes control over
and hence is modeled in the environment. This framework provides a guaranteed minimum performance by considering transitions with the worst outcome at each time step. As a result, the optimal agent behaves risk-averse but also very conservatively [
23]. However, this approach abandons the underlying stochastic properties of the environment, replacing the stochastic nature with a deterministic worst-case design [
23]. Further, Heger [
76] does not seek robustness w.r.t. to errors in the approximation, disturbances, or other outside influences. Instead, the goal should be to derive optimization approaches that are robust against these external uncertainties while also retaining the underlying stochastic nature of the environment. As such, robust reinforcement learning rather considers a layered approach of robust stochastic optimization. This consideration first gained attention in the 1970s where it was applied in
Markov decision processes with imprecisely known transition probabilities (MDPIP) [
23,
128,
130,
131]. The unknown transition matrix
is assumed to be fixed and lies within a known uncertainty set of possible transition matrices
. In the case of an infinite horizon
with discrete finite state
and action
spaces, the MDPIP framework can be formulated as a zero-sum stochastic game between protagonist and adversary [
112]. Thus, compared to [
76], the goal is then changed to optimizing over the worst possible transition matrix
instead of a single worst-case transition, a mini-max problem of the form
In this formulation, optimization over
is still a deterministic process, but the optimization over
remains stochastic as each
is a stochastic transition matrix. In general, however, it is NP-hard to find an optimal memoryless policy for Equation (
18) with brute force [
132]. In contrast to a two-player simultaneous game in which both agents interact simultaneously, the protagonist first tries to maximize the expected return. Then, based on the protagonist’s action, the adversary selects the worst possible transition matrix
within an uncertainty set
. The interaction of this two-player game is shown in
Figure 5. Intuitively, due to this additional information, the behavior of an optimal policy for the adversary becomes static. Each time the adversary encounters the
-pair, it will select the same transition matrix
out of
. Thus, it is also possible to find an optimal stationary policy for the protagonist, which behaves deterministically [
128,
130]. Nevertheless, finding a solution for statically behaving adversaries is computationally expensive.
Therefore, Bagnell et al. [
23] considered a two-player zero-sum dynamic game of complete information defined by
as a lower bound for the static game. The gap between the solution of the dynamic game Equation (
19) and the static game Equation (
18) goes to zero if the horizon
T goes to ∞ [
23,
24]. In the case of the dynamic game, the adversary’s policy is not restricted to remaining static during the game. The adversary is thus able to explore the uncertainty set to find the worst possible transition matrix for each
-pair. Finding a solution for general uncertainty sets, however, is still NP-hard [
45].
To improve tractability, Nilim and El Ghaoui [
24] and Iyengar [
25] have modeled the uncertainty set as a Cartesian product of individual
-dependent uncertainty sets
Here,
represents the probability simplex for each
that describes the uncertainty of state
given action
[
24]. This property (Equation (
20)) is known as the
-rectangularity property. Intuitively, it permits the adversary to select a transition matrix from
without considering transition matrices from other state–action pairs. This property provides the foundation of the
robust Markov decision processes (RMDP), defined by a tuple
.
Initially, uncertainty sets were constructed based on polytopes or interval matrices [
23,
128,
130,
131]. Inspired by game theory, Bagnell et al. [
23] have proven that there are optimal stationary deterministic policies both for the protagonist and for the adversary if the uncertainty set is a compact and convex polytope. Therefore, the authors have stated that there has to be a
robust value iteration to find optimal policies. However, solving this bi-level optimization problem is computationally expensive [
23]. Although these sets can satisfy the
-rectangularity property, they are not statistically accurate enough to represent uncertainties [
24,
25]. For this reason, Nilim and El Ghaoui [
24] and Iyengar [
25] have constructed uncertainty sets, which are described by likelihood or entropy bounds, which also have a significantly lower computational effort. Using these bounds and an estimated or known reference distribution
is a natural way to construct statistically accurate uncertainty sets. Although these sets are not convex, Nilim and El Ghaoui [
24] and Iyengar [
25] have finally proven that the robust dynamic programming algorithm will find an optimal stationary deterministic policy for all uncertainty sets as long as the
-rectangularity property is satisfied. Moreover, given this uncertainty set, the complexity of solving the problem is only slightly higher than for a classical MDP with a fixed transition matrix [
24,
25].
Subsequently, Wiesemann et al. [
45] has presented a generalization of the
-rectangular uncertainty set
which is known as
-rectangularity. Robust dynamic programming algorithms constraint to this less restrictive property can still find optimal stationary policies. While optimal policies are deterministic in Equation (
20), now they may as well be stochastic [
45].
A slightly different setting from standard robust MDPs is presented by Lim et al. [
44], who aim to solve the robust MDP in the presence of an
unknown adversary, meaning that the full extent of nature’s ability to change is unknown. An MDP described by the tuple
with finite state
and action space
is considered. As in standard robust MDPs, a possibly history dependent compact uncertainty set
over transition matrices
is defined for every state–action pair. However, only a subset
of state–action pairs is truly adversarial while all others behave purely stochastically, i.e., with a fixed
, as in non-robust MDPs. By optimizing a regret, one can determine a policy as good as the mini-max policy without knowing either
or
. Such solutions slightly deviate from the common solution to robust MDPs as they are more optimistic and hence indirectly address a major issue of robust approaches based on worst-case analysis.
The worst-case analysis is prone to produce overly conservative policies that achieve only mediocre performance across all possible model parameters in exchange for increasing the worst-case performance [
39]. As such, in cases where the nominal model parameters already provide a reasonable representation of the real physical system, a non-robust approach leads to higher-performing policies on that system. In other words, applying worst-case analysis to problems with only a small sim-to-real gap may lead to worse performance. Xu and Mannor [
39], therefore, propose a trade-off as a weighted sum between a nominal and a robust performance criterion. In their paper, the authors consider an MDP with an uncertain reward function
. The performance criteria are then defined as the expected return at step
t for their respective parameters
where
is the nominal reward,
P is the nominal criterion, and
R is the robust criterion. Therefore, the weighted sum is
with
being the weighting parameter. A policy
is said to be
Pareto efficient if it obtains the maximum of
among all policies with a certain value of
. Xu and Mannor [
39] show that this problem can then be solved using parametric linear programming for the whole set of Pareto efficient policies. Unfortunately, this approach only considers uncertainties in the reward function. In the case of uncertain transitions, as presented in [
23,
24], the authors, Xu and Mannor [
39], prove that a solution is not Markovian and, as such, may be intractable. However, the idea of trading-off nominal and robust performance has been adopted for a robust policy optimization algorithm for unknown, noisy system dynamics with possibly noisy observations [
52]. The algorithm is based on multi-objective Bayesian optimization, where the objectives represent a nominal performance measure
and a robust performance measure
, with
being the policy parameters. The solution to this optimization problem is defined as a Pareto set
that contains all parameters
with
iff
, such that
[
52]. Their work has shown that the underlying concept of a trade-off between nominal and robust performance is viable for uncertain transition functions.
A different realization of such trade-offs is presented in [
41]. The authors propose a chance constraint formulation based on risk assessments for the expected performance of a policy. Assuming that the transition function
p and reward function
r are drawn from probability distributions
and
, Delage and Mannor [
41] define the chance constraint optimization problem as
which describes the risk-adjusted discounted performance of an uncertain MDP. The constraint in the optimization guarantees with a probability of
that the expected performance of
will be greater or equal to
y given
and
. For
, this problem becomes equivalent to the worst-case analysis of robust MDPs [
41]. In this sense, the presented approach relaxes the constraint of worst-case analysis, which is to guarantee minimum performance at all times.
While the trade-off in [
39,
52] is an intuitive counter to the conservatism of the worst-case analysis, other research targets the rectangularity assumption as the main source of the problem [
42,
47,
48]. The rectangularity assumption is the necessary assumption that the uncertainty in each state is uncoupled from all other states as general coupled uncertainty sets are intractable [
42,
47,
48]. The consensus is that defining tractable uncertainty sets with coupled uncertainty across states mitigates the problem of overly conservative solutions.
Mannor et al. [
42] initially proposed the idea of
lightning does not strike twice (LDST). The algorithm targets systems where the parameters of transition and reward function deviate from their nominal representations
,
only in a small number of states
. This consideration is made under the assumption that a deviation has a low probability. The total number of states allowed to deviate from their nominal parameters is hence bounded by
[
42]. Following the example of [
24], the authors consider a stationary and a time-variant model for applying LDST. In the stationary model, all deviations from the nominal parameters are chosen at the beginning of an episode and kept fixed thereafter. The resulting optimization problem is defined as
where
and
refer to state-specific representations of the transition and reward function. Accordingly, the time-variant model describes a sequential game, where the deviation is chosen upon entering the corresponding state. It follows that the optimization is then defined as
Rather than the number of states, here, the number of decision stages in which deviations occur is bounded by
. It is, in this case, also possible to visit the same state multiple times, where each time the parameters are different. Their experiments show improved performance compared to not only worst-case robust policies but also to nominal non-robust policies.
In [
47], the authors propose the concept of
k-rectangular uncertainty sets, of which LDST is a special case. The
k-rectangularity is a generalization of the standard rectangularity concept while capturing the computational difficulty of uncertainty sets. If the projection
P of an uncertainty set onto
is one of at most
k different possible sets, the uncertainty set is considered
k-rectangular. Here
k describes the upper bound of an integer that compactly encodes the coupling of uncertainty among different sets [
47]. If
is a nonempty subset of states, then the projection
P of an uncertainty set
onto
is defined as
Mannor et al. [
47] define an uncertainty set as
k-rectangular if
, where
k is an integer. The term
denotes the class of conditional projection sets, defined for all
as
This definition limits the number of sets the class of projection sets can contain to
k sets. Mannor et al. [
47] noted that
k-rectangularity generalizes the standard rectangularity concept such that the standard rectangularity equals 1-rectangularity. With LDST being a special case of
k-rectangularity, the authors focused simulations on a comparison to LDST showing further improvements in performance. Their work is a first attempt at providing coupled uncertainty sets flexible enough to overcome conservatism while remaining computationally tractable for the wider applicability of robust MDPs.
Another concept for tractable coupled uncertainty is
factor matrix uncertainty sets [
48]. A factor matrix
comprised of
r factors is defined, where each factor is chosen from a corresponding uncertainty set
[
48]. The factors are chosen independently to retain tractability despite coupled uncertainty such that
with
is a Cartesian product. This property is referred to as
(r)-rectangularity [
48]. Each factor
represents a probability distribution over the next state
. It follows the factor matrix uncertainty set
as
with
being coefficients for the convex combination
describing all possible transitions probabilities for a specific state–action pair. Goyal and Grand-Clement [
48] show that if
is a Cartesian product, any (
s,
a)-rectangular uncertainty set can be reformulated as (r)-rectangular uncertainty set. The authors further propose a
robust value iteration algorithm based on (r)-rectangular uncertainty sets for finite-state MDPs. The provided experiments show significantly less conservative behavior compared to the (s)-rectangular approach in [
45] while still achieving improved robust performance w.r.t. nominal non-robust MDPs.
A third viable solution to the conservatism problem of worst-case analysis is a distributional robust design (see
Section 2.1). Robust MDPs take uncertainties related to the assumed transition probability model into account. Nevertheless, potential ambiguities within the matrix
itself remain untouched. The distributional robust optimization framework adds these uncertainties in
itself. The focus shifts to distributions
over transition probability matrices that, on average, lead to the worst-case models
but tolerate deviations. Consequently, less conservative agents are implied by this formulation. From a different perspective, a common interpretation of this distributional information is that of a prior in Bayesian formulations [
40,
46,
53]. A prior incorporates an additional layer of uncertainty and prevents overfitting to, in this case, a deterministic worst-case optimization over possible transition matrices. A typical choice of prior information is that all distributions
are within a certain range
of a nominal distribution
. This range is defined by some difference measure
, i.e., Kl-Divergence or Wasserstein metric,
giving rise to a
ball of possible distributions surrounding the nominal distribution [
38,
43,
49,
50,
51,
53,
54]. Other approaches define the ambiguity set in terms of constraints on the first and second moments of distributions. The first constraint ensures that the first moment lies within an ellipse, while the second criterion enforces that the second-moment matrix lies within a positive semi-definite cone [
41,
43]. Such constraints may, for example, be confidence regions placing more plausible transition matrices into higher confidence intervals [
40,
41,
43,
46]. Further works include ambiguity sets based on a
reproducing kernel Hilbert space metric [
133] or near-optimal Bayesian ambiguity sets [
134]. While this survey will not further detail distributional approaches, we advise the reader to look into [
135] for a comprehensive review focusing solely on distributional robust optimization.
So far, it has been shown that robust MDPs can be solved using a robust dynamic programming approach under convergence guarantees [
24,
25]. Dynamic programming, however, becomes intractable in large state-space, a common occurrence for practical problems due to the curse of dimensionality. Solutions to this problem have already been researched for non-robust MDPs as linear and non-linear function approximations. While non-linear function approximations, such as deep neural networks, are more versatile, there are no longer guarantees of convergence to global optimal value functions [
136]. Linear approximations, on the other hand, retain convergence guarantees while mitigating the curse of dimensionality [
136]. One of the first approaches applying linear function approximation to robust MDPs is presented by Tamar et al. [
10]. The authors propose a robust variant of
approximate dynamic programming (ADP). Given a standard robust MDP with an uncertain transition function and a known uncertainty set
, the
Q-function is derived as
This
Q-function is then approximated by
, where
are weights related to a feature representation
of the states and actions. Optimizing over the linear approximation of the
Q-function yields a greedy policy
for a given
. Tamar et al. [
10] further provide convergence guarantees within certain conditions. Only recently, Badrinath and Kalathil [
136] showed further development in linear approximations for robust MDPs. The authors derive a robust variant of
least squares policy evaluation and
least squares policy iteration by defining an approximate robust TD(
) operator as a more general model-free learning framework, an aspect lacking in [
10].
A similar effort has been pursuit by Abdullah et al. [
137]. It is argued that most robust learning algorithms fail to extend to a generalized robust learning framework as they are often bound to exploitation of task-specific or other properties such as low-dimensional discrete state and action spaces. To mitigate this problem, Abdullah et al. [
137] propose
Wasserstein robust reinforcement learning (WR
L). The algorithm is designed to work in both discrete and continuous spaces as well as in low and high dimensional problems. Their framework relies on the
Wasserstein metric, which, compared to other metrics measuring distance between distributions, such as the KL-Divergence, is a genuine distance, exhibiting symmetry. Assuming a possibly unknown reference dynamics model
, a set of candidate dynamics
is defined as the
-Wasserstein ball around
, where
denotes the
degree of robustness. This set represents the action space of an adversary in a two-player zero-sum game similar to uncertainty sets. Following this notion, the optimization problem is described as
for all possible candidate dynamics
bounded by the expected Wasserstein distance. Abdullah et al. [
137] show promising results for improved robustness compared to nominal non-robust RL and other robust algorithms in both low- and high-dimensional MuJoCo Robotics environments [
138].
A well-fitting but rarely used methodology when encountering uncertainties in RL, whether internal or external, is a Bayesian treatment. One of the traditional Bayesian formulations of RL is posterior sampling methods. While such sampling methods are typically designed in low-dimensional tabular settings, solutions like the
uncertainty Bellman equation (UBE) [
139] scale posterior sampling methods up to large domains [
140]. The extension to robust MDPs is presented as the
uncertainty robust Bellman equation (URBE) [
140]. Following the insights of [
24], the recursive robust
Q-function of a robust MDP with finite horizon
T is
Accordingly, Derman et al. [
140] derive a posterior of this
Q-function
for a posterior uncertainty set
with
being state–action-dependent confidence levels. This uncertainty set is constructed for each episode based on the observed data from all previous ones. It follows a solution
to the URBE
with
This approach offers a trade-off between robustness and conservatism for robust policies. Derman et al. [
140] propose a DQN-URBE algorithm for which they show that it can adapt significantly faster to changing dynamics online compared to existing robust techniques with fixed uncertainty sets. A slightly out-of-scope approach is
ensemble policy optimization (EPOpt) [
141], an algorithm that uses an ensemble of simulated source domains representing different parameter settings and slight variations of the true target environment. These source domains
contain parameterized stochastic transition and reward functions
,
whose parameters are drawn from a distribution
. The goal is to learn an optimal policy
with good performance for all source domains while simultaneously adapting
to approximate the target domain
better. The algorithm is split into two alternating steps: (i) given a source distribution, find a robust policy; (ii) gather data from the target domain using said robust policy and adapt the source distribution [
141]. There are two nested evaluation metrics for the parameterized policy
optimized for the
conditional value at risk (cVaR) to find soft robust policies following the work of [
142]. The adaption step of the source domain distribution is defined as a Bayesian update using data acquired by applying the current policy to the target domain. Experiments on the OpenAI hopper environments [
143] show no performance losses over a wide range of torso masses. While the source domains were specifically chosen to include the target domain in these experiments, further experiments have shown that even badly initialized sets of source domains only require a few iterations to adapt to the target domain [
141].
All of the discussed approaches follow the traditional discrete-time reinforcement learning paradigm. However, a few approaches aim for continuous-time designs [
7,
144,
145]. As one of the first to derive a robust reinforcement learning approach, Morimoto and Doya [
7] rely heavily on the concept of
-control and differential games (see
Section 2.2). Their approach will be discussed in more detail in
Section 3.2 in the context of the disturbance robust design. Mankowitz et al. [
144], however, focus on uncertainties in the transition function, presenting a robust variant of
maximum a posteriori policy optimization (MPO). Instead of optimizing the squared TD error, the authors propose an optimization of the worst-case squared TD error
with
being a state–action-dependent uncertainty set. To further deal with the problem of overly conservative policies, Mankowitz et al. [
144] suggest an entropy regularization of the robust Bellman operator from [
25]. Most approaches consider uncertainties only in transitions, actions, observation, or disturbances. Lutter et al. [
145], instead, propose a robust procedure akin to dynamic programming for continuous state–action spaces and continuous-time formulations, which accounts for perturbations in states, actions, observations, and model parameters all at once. The authors present their algorithm as
robust fitted value iteration (rFIR), a robust variant of their previously presented algorithm
continuous fitted value iteration (cFIR). Assuming a priori known or learned transition dynamics that are non-linear to the system state but affine to the action, and a separable reward function, the optimal policy and perturbations are analytically calculable in closed-form for each type of perturbation. A separable reward function is a function that decomposes into the sum of an action-dependent and a state-dependent reward function, where both summands are non-linear, positively defined, and strictly convex [
145]. Following that assumption, Lutter et al. [
145] extend the policy evaluation step of the cFIR algorithm to a closed-form mini-max optimization.
3.3. Action Robust Designs
So far, we have discussed a substantial amount of research on robust MDPs in the context of transition and reward uncertainty. While the amount of research on that topic is impressive, a majority is presented in the tabular case for mainly low-dimensional finite spaces [
23,
24,
25,
39,
42,
45]. Few contributions have touched upon linear [
10] and non-linear function approximation, while especially non-linear approximations have only been addressed in recent years. It further is often unclear how to obtain mentioned uncertainty sets [
11]. There have been further advances to non-linear robust reinforcement learning in the context of disturbance-based robustness but partly without any theoretical guarantees [
9,
11]. Tessler et al. [
11] instead argue that a more natural approach to introduce robustness is action perturbations. Naturally, a deviation in the action space also simulates environmental changes to a certain extend. Consider a magnitude reduction for a chosen continuous action that encodes some force or speed. Such a reduction has the same effect as increasing friction or introducing an opposing force. As such, action perturbations change the expected behavior of an environment. A schematic representation of the methods discussed in action robust designs is shown in
Figure 7.
Following this general idea, Tessler et al. [
11] propose two types of action robust MDPs, the
noisy action robust MDP (NR-MDP) and the
probabilistic action robust MDP (PR-MDP). Both MDPs are defined by the tuple
with some joint policy
between the protagonist
and adversary
. In the case of the NR-MDP, the action space
denotes a compact and convex metric space for the joint actions to ensure that the mixture actions are valid. The reason behind proposing two different MDP formulations lies in the inherent nature of robustness they are encoding.
The NR-MDP is designed to represent constant interrupting forces applied to the agent, e.g., through unexpected weight of a robot arm constantly applying a downward force. This constant force takes on the form of adversarially chosen noise added by the adversary through the joint policy. As such, Tessler et al. [
11] define a noisy joint policy
as
where
scales the impact of the constant disturbance applied by the adversary. It follows the optimal
-noisy robust policy as
with
, where
is a set of stationary stochastic policies and
is a set of stationary deterministic policies. The optimal policy will then be robust w.r.t. any bounded perturbations added by the adversary. Naturally, by choosing
, the NR-MDP collapses back to the standard non-robust MDP.
In contrast, the PR-MDP describes interruptions of the protagonist’s movements through e.g., sudden pushes. In this type of MDP, there is a certain probability
that the adversary takes control over the decision-making process to perform a worst-case action. This probability is encoded into the probabilistic joint policy
defined as
The optimal probabilistic robust policy is given by
with
. For their experiments, Tessler et al. [
11] introduce a robust variant of
deep deterministic policy gradient (DDPG) [
101] based on the soft policy iteration to train two deterministic policy networks, for protagonist and adversary, respectively. Similar to standard DDPG, a critic is trained for estimating the
Q-function of the joint policy. The experiments showed that with few exceptions, their approaches performed better than a baseline in several MuJoCo robotics environments, while with increasing probability, random noise is applied instead of the chosen action. Surprisingly their approaches also outperformed the baseline during the absence of any perturbations. Compared to the classical robustness approach based on known uncertainty sets, the PR-MDP and NR-MDP approach does not require any knowledge on the uncertainty sets as they are implicitly given through
. However, this advantage also restricts the MDPs as they cannot handle any worst-case perturbations. Tessler et al. [
11] further show that the PR-MDP is a specific case of the robust MDP formulation.
A similar idea as the PR-MDP was presented by Klima et al. [
55], who extended TD learning algorithms by a new robust operator
to improve robustness against potential attacks and perturbations in critical control domains. The approach has similarities to mini-max-Q learning in two-player zero-sum games as proposed by Littman [
95] but does not assume a minimization over the opponent’s action space. Instead, attacks are defined to minimize over the protagonist’s action space, such that both policies learn but not enact simultaneously. Klima et al. [
55] formalize this idea by replacing the mini-max simultaneous actions with stochastic transitions between multiple controllers with arbitrary objectives to take control in the next state
. This formulation is similar to an adversary taking control with probability
in the PR-MDP setting. The value of the next state
then depends on who is in control. Thus, the TD Error (see Equation (
15)) changes to
with
either known a priori or estimated by the agent. This framework allows learning of robust policies in the presence of an adversary without ever executing critical adversarial actions on the system as only the future estimate is affected by the adversary. Klima et al. [
55] apply this approach to off-policy
-learning and on-policy expected SARSA
as follows
where both algorithms are in between the worst-case design and risk-sensitivity. Klima et al. [
55] show that the algorithms converge to the optimal
Q function
and the robust
Q function
. Both algorithms, Expected SARSA
and
Q-learning, are compared to nominal
Q-learning, SARSA, and Expected SARSA. All experiments show improved performance and robustness compared to the baseline methods.
In general, worst-case scenarios, in which an adversary attempts to minimize the expected return, do not automatically cause the agent to experience catastrophic but highly unlikely events as the adversary does not actively pursue such outcomes. Pan et al. [
56] argue that a robust policy should not only aim to maximize the expected return but also be
risk-averse. The authors improve the framework of Pinto et al. [
9] to provide increasingly harder challenges such that the protagonist learns a policy with minimal variance in its rewards. Pan et al. [
56] present the
risk-averse robust adversarial reinforcement learning algorithm (RARARL) assuming a two-player zero-sum sequential game, where the protagonist and adversary take turns in controlling the environment. The protagonist takes a sequence of
m actions followed by a sequence of
n actions chosen by the adversary. The game is formulated as a tuple of the form
with
defining a possibly infinite state space while the agents share the same action space
. Pan et al. [
56] introduce a risk-averse term
to the
Q-function of the protagonist as
where
is the variance of the ensemble of
k Q-functions
and
is a constant. Similarly, the objective for the adversary is formulated with a
risk-seeking term as
The variance for an action
is calculated according to
where
is realized as the
head of a Q-value network. The same formula is applied for the adversary using
. The models of the ensemble are trained across different sets of data. Pan et al. [
56] utilize an asymmetric reward function to increase the probability of catastrophic events. Good behavior receives small positive rewards while catastrophic outcomes result in highly negative rewards. An important observation is that the protagonist needs to be trained separately first to achieve rudimentary control as the protagonist is otherwise unable to keep up with the adversary. Pan et al. [
56] show that by introducing risk-averse behavior in the presence of a risk-seeking adversary, far fewer catastrophic events occur during the test phase.
While most robust designs formulate some bi-level optimization problem for which a Nash equilibrium needs to be found, Tan et al. [
57] propose a new perspective in the context of action robust design. In their paper, Tan et al. [
57] discuss the importance of robustness in deep reinforcement learning for the validity of DRL, especially in safety-critical applications. The authors approach robustness from the perspective of adversarial attacks [
59,
146] to deep neural networks as they are known from image classification. It has been discovered that neural networks are highly susceptible to perturbations of their input vectors [
147]. Tan et al. [
57] project adversarial attacks onto action robust control for bounded white-box attacks on the protagonist’s actions. The robust optimization is defined as
where
denotes the adversarial attack on the action space. The inner optimization problem is solved by
projected gradient descent, while the outer problem is optimized through standard policy gradient techniques [
57]. An important factor in this formulation is that the adversarial perturbations
on the action space are optimized until convergence for each episode of the training process. The approach has been shown to improve the performance of DRL based controllers in OpenAi Gym environments. Using adversarial attacks to formulate a robust reinforcement learning framework, however, is not new. Prior works have utilized this very idea in the context of an AI’s perception of its environment [
60,
62].
3.4. Observation Robust Designs
Adversarial attacks on observations in reinforcement learning are closely related to
generative adversarial networks (GAN) [
58]. GANs are a framework for estimating generative models via an adversarial process. The approach describes a mini-max game between a
discriminative model (DM) and a
generative model (GM). The goal is to train generative models capable of creating data indistinguishable from a given data set. The GM generates samples that get mixed into the true data set while the adversary tries to identify the artificially created data of the GM. As such, the GM minimizes the probability of samples being distinguished from the true data, while the DM maximizes the probability of correctly separating true from generated data [
58]. Loosely speaking, a GM is trying to trick a classifier into falsely classifying generated samples as true samples. However, considering the nature of classification tasks with more than two classes, there is no way to determine a truly worst-case classification as there is only right and wrong. This problem does not exist in reinforcement learning, i.e., control tasks, where a worst-case action or state can be identified.
In a similar fashion to GANs, adversarial attacks on observations aim to trick a protagonist into perceiving the true state of the environment as another (see
Figure 8). The goal is to warp the protagonist’s perception to the point that the perceived state causes the protagonist to act adversarial to its own interest [
62]. Interestingly, such an approach is not aimed at robustness against sensory errors or sensor noise but against perturbations in the environmental parameters. The protagonist is forced into unwanted and unexpected transitions of the environment states, which can be seen as a change of the system dynamics. A significant advantage of adversarial attacks, compared to the max–min formulation most research has followed thus far, is the need for equilibrium solutions in the max–min formulation [
62]. Adversarial attacks are optimized for each step of the learning process such that from the perspective of the protagonist, a traditional single-player MDP still exists. As such, traditional RL algorithms can be applied [
60,
62].
Huang et al. [
60] has presented early results in the context of image-based reinforcement learning for Atari games using the
fast signed gradient method (FSGM) [
59] to find the worst possible actions. However, these results focus on the vulnerability of deep neural networks to perturbations in high-dimensional image data, where this vulnerability has been found initially [
62]. Pattanaik et al. [
62] show that this vulnerability is not restricted to high dimensions but also applies to standard lower-dimensional state observations known from control environments such as the Cartpole or Hopper. Further, while the objective function used in [
60] for the FSGM to identify adversarial attacks leads to a decrease in the probability of taking the best possible action, it does not necessarily increase the probability of taking the worst possible action [
62]. As such, Pattanaik et al. [
62] propose a more efficient objective function
as the cross-entropy loss between the optimal policy of the protagonist
and an adversarial probability distribution
. The adversarial distribution has a probability of 1 if
is the worst possible action and 0 otherwise. The FSGM is incorporated as an additional step into traditional RL methods that identifies the worst possible perceived state according to the objective function in Equation (
21) before queering the protagonist for the next action. Experiments on several OpenAi Gym and MuJoCo environments have shown that the presented approach adds robustness to environmental parameter perturbations in
double deep q networks (DDQN) [
148] and DDPG. However, despite the promising results, a theoretical connection of these adversarial attacks to robustness against parameter perturbations has not been provided [
62].
Parallel work by Mandlekar et al. [
61] not only includes white-box adversarial attacks on the input space but also on the dynamics. The authors propose an extension of the training process in TRPO with a
curriculum learning-based procedure that considers physically plausible perturbations. The approach targets dynamic systems of the form
where
,
, and
correspond to the dynamic noise, process noise, and observation noise, respectively. While
and
directly perturb the state or observation of the dynamical system,
represents the uncertainty of the physical parameters of the model, e.g., mass and friction. During learning, the frequency with which perturbations occur in the system is increased over time through a parameter
following the curriculum learning paradigm. Through this process, the agent becomes increasingly more robust to both modeling errors and adversarial perturbations of the input space. In contrast to [
62], Mandlekar et al. [
61] use the full gradient to optimize the adversarial attacks. They argue that the FSGM is designed for high-dimensional image spaces and may cause scaling issues when applied in low-dimensional dynamic systems. Experiments on the MuJoCo robotics simulator show significant robustness improvements compared to nominal TRPO across all environments. While better results are achieved for adversarial perturbations during training, even random perturbations already show significant improvement.
A more formal approach is presented in the form of
state-adversarial MDPs (SA-MDP) [
64]. The SA-MDP is defined as a tuple
, where the protagonist’s observations are perturbed by a bounded function
. With these perturbations, the authors aim for robustness against sensory errors and sensor noise. Naturally, the protagonist policy is then described as
. The SA-MDP produces a bi-level optimization problem for which finding a solution is challenging. However, leveraging
stochastic gradient langevin dynamics [
149] allow for a separate optimization of the inner problem up to a local optimum first. As a consequence, a possibly large gap to the global optima remains. This gap is bounded by the
total variation distance or KL-Divergence
, where
is the perturbed state observation, using
convex relaxations for neural networks [
150]. These bounds then allow for an outer optimization of an inner lower or upper bound, which provides robustness certificates for guaranteed minimum performance. The total variation distance and KL-Divergence are integrated as a robust policy regularization for various traditional DRL algorithms, such as DDPG,
proximal policy optimization (PPO) [
108], and
deep q networks (DQN) [
100].
Further extending existing work in robustness guarantees and certification bound algorithms from computer vision, Lütjens et al. [
65] propose the
certified adversarially-robust reinforcement learning (CARRL) to solidify the value of deep reinforcement learning in safety critical domains. Certification bounds theoretically provide guaranteed deviation bounds on the output of a neural network given an input perturbation. For Deep Learning, these bounds are relevant mainly because they guarantee bounded output even when non-linear activation functions are used. CARRL extends classical deep reinforcement learning algorithms such as DQN to guarantee protection against adversarially perturbed observations or sensor noise. In CARRL, the agent does not rely on the given observed state but rather assumes that the observed state is corrupted. Lütjens et al. [
65] instead propose to exploit the possibility that the true state lies somewhere within an
-ball
around the assumed corrupted observed state
. Based on the parameter
, an upper and lower bound, the certification bounds, are calculated for the predicted
Q value. Assuming that the training process induces the network to converge to the optimal value function given the lower bound, CARRL handles perturbed observations. While for
CARRL reduces to nominal DQN, an increase in
corresponds to an increasingly conservative behavior of the policy. Experiments have shown that the proposed method outperforms nominal DQN on benchmarks with perturbed observations.
Most of the work evolving around observation robust designs considers that the adversary has direct access to the protagonist’s observations, which Gleave et al. [
63] argue is a critical assumption not present in realistic situations. An example given by Gleave et al. [
63] is autonomous driving, where pedestrians and other drivers can take actions that affect an AI’s input but cannot directly manipulate any sensor data. Gleave et al. [
63] instead propose
adversarial policies acting in a multi-agent environment. This environment is described by a classical two-player zero-sum Markov game of the form
where
denotes a finite state space while
and
denote the finite action space of protagonist and adversary. This formulation is not designed to target the protagonist’s observations specifically. The experiments are designed as MuJoCo robotics simulations, where the protagonist and adversary are the same type of robot, just with different goals. The authors consider a win–loss reward function typically known from games such as chess or go. The protagonist wins if he successfully completes a given task while the adversary wins by preventing the completion of the given task such that the reward function is given as
with
. An interesting result and the reason why this work is placed into observation robust design is the unique approach of winning of the trained adversaries. Instead of solving the game for two evolving players, Gleave et al. [
63] investigate the players learning behavior for a fixed opponent in MuJoCo robotics simulations. First, the authors show that even without directly manipulating the protagonist’s perception, an adversary can be trained with PPO that decimates a fixed protagonist’s performance simply by being present in the observations. These results further solidify what is known about the vulnerability of neural network policies to adversarial settings. Second, Gleave et al. [
63] present a fine-tuning of the protagonist for a fixed adversary to improve the robustness properties of the protagonist policy against such adversarial settings. However, as either one of the players is kept fixed, the learning process will eventually overfit to the fixed opponent and thus be again susceptible to new attacks [
63].
Especially for neural network input and hence observation robustness, research has proven how vulnerable neural network policies are to adversarial attacks and settings. Reaching a better understanding of adversarial training and behavior is crucial in achieving robustness, security, and a better understanding of deep reinforcement learning [
63].
3.5. Relations to Maximum Entropy RL and Risk Sensitivity
While research in robust reinforcement learning has progressed far and issues regarding conservatism, the construction of ambiguity sets, and, in part, convergence have been addressed. The latter still remains a concern, i.e., the existence of multiple Nash equilibria destabilizes convergence to meaningful robust policies. In recent years, however, an interesting alternative has been proven to provide robust properties. With their research, Eysenbach and Levine [
68] and Eysenbach and Levine [
69] have investigated the relationship between
maximum entropy reinforcement learning (MaxEnt RL) and robustness. This research dates back to work by Grünwald et al. [
66], where it was shown that maximizing the entropy of a distribution is equivalent to maximizing the worst-case log-loss in prediction problems, which also translates to conditional distributions [
68]. In [
68], these insights are extended to reinforcement learning for certain classes of uncertain reward functions. The objective of MaxEnt RL is given as an extension of the classical reinforcement learning objective by an entropy term
The term
refers to the Shannon entropy. This objective is therefore equivalent to the reward robust objective given an uncertainty set
. This uncertainty set is specified by the definition of the entropy term in the objective, which, in this case, relates to logarithmic functions due to the Shannon entropy
However, given other entropy formulations, the definition for the uncertainty set changes [
68]. These results have further been extended to consider uncertainties in the dynamics [
69]. First, the authors propose a different definition of the uncertainty set for the reward robust objective as all functions
satisfying the constraint
where
is some positive constant. For adversarial dynamics, the authors then consider an MDP with a transformed reward function
, where the entropy for the state transition probability is taken into account. It follows that the MaxEnt RL objective now defines a lower bound on the robust objective for adversarial dynamics from an uncertainty set
comprising functions
satisfying the constraint
According to Eysenbach and Levine [
69], Ziebart [
151], this definition can be interpreted as all dynamics
sufficiently close to the original dynamics. For both cases, the authors provide formal proofs that these relations hold. Considering the difficulty of solving bi-level optimization problems like those defined in robust reinforcement learning, MaxEnt RL becomes an attractive and easier-to-solve alternative while still providing robustness properties to a certain extend.
Further alternatives for achieving robustness have been found in the context of risk sensitivity. Contrary to the robust MDP, the risk-sensitive MDP [
152,
153,
154] does not consider any uncertainty in its parameters. Instead, the objective aims to optimize some risk measure of the cumulative cost [
67]. As in standard MDPs, early approaches build upon the dynamic programming approach with finite state and action spaces [
152,
153,
154,
155,
156]. Since then, research has progressed to more refined formulations with continuous spaces, such as Linear Quadratic Gaussian control [
157,
158] and relative entropy policy search [
159]. A connection of risk-sensitive MDPs to robust MDPs has been found for two specific formulations, iterated risk measures and expected exponential utility functions [
67].
Osogami [
67] has found that robust MDPs, whose uncertainty is governed by a parameter
, such that the uncertainty set
over the transition functions
p is described by
equivalent to risk-sensitive MDPs with an
iterated conditional tail expectation (ICTE) as objective. Here,
refers to the nominal transition function. As such, Osogami [
67] states that
where the ICTE involves
N recursive applications of cVaR, with
being the cumulative reward. Osogami [
67] further extends his proofs to simultaneous transition and reward uncertainty and a more general class of robust MDPs using
coherent risk measures. Xu and Mannor [
40] also state that the Coherent Risk Measure is equivalent to a distributionally robust formulation. For specific details, we refer the reader to [
67].
The second equivalence to robust MDPs is discussed for expected exponential utility functions [
67]. The exponential utility function is defined as
where the risk-sensitivity factor
governs the resulting shape of the function, as convex, linear, or concave. As such, this factor also determines the behavior of the optimization as risk-seeking, neutral, or risk-averse [
158,
159]. Minimizing the expectation of this utility function is equivalent to minimizing the
entropic risk measure (ERM)
for a risk-sensitivity factor
. The relation to robust MDPs relies on the property of the ERM to be expressed as
where
is a probability mass function for
. Further,
denotes a set of probability mass functions, whose support is contained in the support of
[
67,
160]. Osogami [
67] then derives that risk-sensitive MDPs, which minimize the expected exponential utility, are equivalent to robust MDPs with the objective
This formulation has also been extended to simultaneous transition and reward uncertainty [
67]. Even though there exist restrictions on the involved uncertainty sets, as in the MaxEnt RL approach, this connection of risk-sensitivity and robustness provides efficient and easier to solve alternative algorithms to achieve parameter robustness.