0. Title
- Safe Exploration in Continuous Action Spaces
1. Authors
- Joshua Achiam
2. Abstract
- This paper is based on learning the policy in presence of hard constraints i.e the constraints should never be violated. In this work, they directly add to the policy a safety layer that analytically solves an action correction formulation per each state. This method largely builds upon the work of OptLayer by T.H.Pham et al.
- We demonstrate the efficacy of our approach on new representative physics-based environments, and prevail where reward shaping fails by maintaining zero constraint violations.
3. Motivation
- unless safe operation is addressed thoroughly and ensured from the first moment of deployment, RL is deemed incompatible for them.
- In real-world applications, constraints are an integral part of the problem description, and never violating them is often a strict necessity. Therefore, in this work, we define our goal to be maintaining zero constraint violations throughout the whole learning process.
4. Contributions(Findings)
- it merely is a simple calculation that is not limited to the nowadays popular deep policy networks and can be applied to any continuous-control algorithm (not necessarily RL-based).
5. Methodology
We study safe exploration in the context of policy optimization, where at each state, all safety signals $\bar{c_i}$ are upper bounded by corresponding constants $C_i \in \mathbb{R}$:
Without prior knowledge on its environment, an RL agent initialized with a random policy cannot ensure per-state constraint satisfaction during the initial training stages. For an RL agent to learn to avoid undesired behavior it will have to violate the constraints enough times for the negative effect to propagate in our dynamic programming scheme.
1. Linear Safety-Signal Model
We do not attempt to learn the full transition model, but solely the immediate-constraint functions $c_i(s, a)$.
where $w_i$ are weights of a NN, g(s;w). This model is a first-order approximation to ci(s, a) with respect to a; i.e., an explicit representation of sensitivity of changes in the safety signal to the action using features of the state.
We train $g(s;w_i)$ by solving
Through this learning, the estimated value is learned similarly to $\bar{c_i}$ through a neural network. In our experiments, to generate D we merely initialize the agent in a uniformly random location and let it perform uniformly random actions for multiple episodes. The episodes terminate when a time limit is reached or upon constraint violation.
Training $g(s;w_i)$ on D is performed once per task as a pretraining phase that precedes the RL training.
2. Safety Layer via Analytical Optimization
Denote by µθ(s) the deterministic action selected by the deep policy network. Then, on top of the policy network we compose an additional, last layer, whose role is to solve
To solve (3) we now substitute our linear model for $c_i (s,a)$, and obtain the quadratic program
Thanks to the positive-definite quadratic objective and linear constraints, we can now find the global solution to this convex problem. To solve it one can implement an in-graph iterative QP-solver.
The assumption is that no more than a single constraint is active at a time.
We now provide the closed-form solution to (4)
The solution (6) is essentially a linear projection of the original action to the "safe" hyperplane with slope $g(s;w_i *)$ and intercept $\bar{c_i}*(s) - C_i *$
6. Experiments & Measurements
We construct D with 1000 random-action episodes per task. We then add our pre-trained safety layer to the policy network. In these domains, an object is located in some feasible bounded region. Each of the constraints, therefore, lower bounds the object’s distance to each of the few boundaries. In all simulations, the episode immediately terminates in the case of a constraint violation.
Before exhibiting the performance of our safety layer, we first relate to a natural alternative approach for ensuring safety: manipulate the agent to avoid undesired areas by artificially shaping the reward. This can be done by setting the reward to large negative values in subsets of the state-space.
The most prominent insight from Fig. 6 is that the constraints were never violated with the safety layer. This is true for all 10 seeds of each of the four tasks. Secondly, the safety layer dramatically expedited convergence. For Spaceship, it is, in fact, the only algorithm that enabled convergence. In contrast, without the safety layer, a significant amount of episodes ended with a constraint violation and convergence was often not attained. This is due to the nature of our tasks: frequent episode terminations upon boundary crossing impede the learning process in our sparse reward environments. However, with the safety layer, these terminations never occur, allowing the agent to maneuver as if there were no boundaries.
7. Limitations(If it's not written, think about it)
- jointly modeling the dynamics of more than a few safety signals with a single g(·; ·) network is a topic requiring careful attention, which we leave for future work
8. Potential Gap
- I think it is possible to compare with MPC.