0. Title
- A Review of Safe Reinforcement Learning: Methods, Theory and Applications
1. Authors
- Shangding Gu
2. Abstract
- When it comes to deploying RL in the real world, safety concerns are usually raised, leading to a growing demand for safe RL algorithms, such as in autonomous driving and robotics scenarios.
3. Motivation
(1) Safety Policy. How can we perform policy optimization to search for a safe policy?
(2) Safety Complexity. How much training data is required to find a safe policy?
(3) Safety Applications. What is the up to date progress of safe RL applications?
(4) Safety Benchmarks. What benchmarks can we use to fairly and holistically examine safe RL performance?
(5) Safety Challenges. What are the challenges faced in future safe RL research?
4. Contributions(Findings)
-
5. Methodology
- As for the optimisation criteria, several methods consider cost as one of the optimization objectives to achieve safety: [24, 36, 109, 116, 126, 173, 201, 240, 264]
- Some methods consider safety in RL exploration process by leveraging external knowledge: [2, 56, 60, 93, 194, 267, 273]
- Policy-based safe RL: [4, 71, 99, 302, 307, 313],
- Value-based safe RL: [20, 70, 78, 108, 127, 164, 191, 293].
- Model-Based Safe Reinforcement Learning: Model-based DRLs have better learning efficiency than Model-free DRLs, and there are many scenarios where DRL methods such as robotics traffic planning can be applied.
In safe model-based RL based on control theory settings, Berkenkamp and Schoellig [27] develop a safe model-based RL algorithm by leveraging Lyapunov functions to guarantee stability with the assumptions of Gaussian process prior; their method can ensure a high probability safe policy for an agent in a continuous process. However, Lyapunov functions are usually hand-crafted, and it is difficult to find a principle to construct Lyapunov functions for an agent’s safety and performance [56]. Moreover, some safe RL methods are proposed from the perspective of Model Predictive Control (MPC) [84], e.g., MPC is used to make robust decisions in CMDPs [17] by leveraging a constrained MPC method [182], which also introduces a general safety framework to make decisions [84]. Different from primal-dual methods, methods of trust-region optimisation with safety constraints, Lyapunov based methods, and Gaussian Process based methods, formal methods [15] normally try to ensure safety without unsafe probabilities, however, most of formal methods may heavily rely on the model knowledge and might not show the better reward performance than other methods, and the verification computation might be expensive for each neural networks [15]. More generally, the curse of dimensionality problem is challenging to be solved, which appears when formal methods are deployed for RL safety [25], since formal methods may be intractable to verify RL safety in continuous and high-dimension space settings [25].
model based safe RL methods present excellent performance in most of the challenging tasks. Nonetheless, the safety of training or stability may need to be further investigated rigorously, and a unified framework may need to be proposed so that we can better examine safe RL performance.
- Model-Free Safe Reinforcement Learning:
Constrained Policy Optimisation (CPO) [4] is the first policy gradient method to solve the CMDP problem. In particular, Function (11) and Function (12) have to be optimised to guarantee the reward of a monotonic improvement while satisfying safety constraints. Their methods can almost converge to safety bound and produce more comparable performance than the primal-dual method [55] on some tasks. However, CPO’s computation is more expensive than PPO-Lagrangian, since it needs to compute the Fisher information matrix and uses the second Taylor expansion to optimise objectives.
- Theory of Safe RL
1. Primal-Dual Approaches
- A standard way to solve a CMDP problem is the Lagrangian approach that is also known as primal-dual policy optimization:
Extensive canonical algorithms are proposed to solve problem (24), e.g., [51, 152, 207, 160, 235, 239, 271, 301].
2. Constrained Policy Optimization
- Constrained Policy Optimization suggests computing the cost constraint using a surrogate cost function which evaluates the constraint according to the samples collected from the current policy 𝜋
Existing recent works (e.g., [4, 32, 33, 106, 283, 307]) try to find some convex approximations to replace the terms $𝐴_𝜋_θ_𝑘 (𝑠, 𝑎)$ and $\bar{𝐷}_{KL} (𝜋_θ, 𝜋_{θ_𝑘} )$ Eq.(25)-(27).
6. Measurements
-
7. Limitations(If it's not written, think about it)
Challenges of Safe RL
1. Human-Compatible Safe RL
- Modern safe RL algorithms rely on humans to state the objectives of safety. While humans understand the dangers, the potential risks are less noticed.
- Human Preference Statement: Humans may maliciously or unintentionally mis-state their preference, leading the safe RL agent to perform unexpected implementations.
- Ethics and morality concerns: the decisions made by agents always involve ethical issues. How to leverage the different value systems to enable safe agents to make ethical decisions is an open question.
2. Industrial Deployment Standard for Safe RL
- Although safe RL has been developed with a wealth of well-understood methods and algorithms in recent years, to our knowledge, there is no RL deployment standard for industrial applications, including technical standard and morality standard, etc. We have to pay more attention to the standard, and align the RL deployment standard both from academics and industries.
- Technical standard: We need to think about how much efficiency RL can generate, how much time and money can be saved using RL methods, what environments can be handled with RL, how to design cost and reward functions considering the balance between RL reward, performance and safety, etc.
- Law standard: Human-machine interaction needs to be considered in legal judgements We need to determine how responsibilities are divided, e.g., do programmers of robots need to take more responsibility, or do robot users need to take more responsibility?
3. Safety Guarantee Efficiently in Large Number of Agents’ Environments
- we don't consider this
8. Potential Gap
-
You can refer this: https://medium.com/@harshitsikchi/towards-safe-reinforcement-learning-88b7caa5702e