본문 바로가기

논문 리뷰

[Optimal Control] Paper Review: A Comprehensive Survey on Safe Reinforcement Learning

728x90

0. Title

- A Comprehensive Survey on Safe Reinforcement Learning

1. Authors

- Javier Garcia

2. Abstract

- Safe RL의 정의: 어떠한 시스템의 성능을 보장하는 문제에서 return 값을 최대화하고, 학습하는 동안에 안정성을 위한 제약을 지키는 것

Definition of Safe RL: Maximizing the return in a problem that guarantees the performance of any system, and maintaining the constraints for stability during learning

 

- Safe RL의 정의에 의한 두 가지 접근

1. 최적화 기준: 어떻게 하면 안정성 요인을 이용하여 안전하게 최적화할 수 있을까?

2. 탐험 프로세스: 어떻게 하면 더 안전하게 탐험할 수 있을까?

- Two Approaches by Definition of Safe RL
1. Optimization Criteria: How can we safely optimize using stability factors?
2. Exploration Process: How can we explore more safely?

3. Motivation

- Safety RL은 safety constrant를 지키는 것이 매우 중요한 Control Problem에 사용된다.

- 현재 로봇 분야의 발전을 통해 로보틱스에서는 강화학습을 이용한 제어가 많이 활용되고 있으며 Learning을 사용한 기술들이 안전해야 하기 때문에 반드시 Safe RL이 필요한 상황이다.

- Safety RL is used in Control Problems, where it is very important to keep the safety constrant.
- Currently, through the development of the robot field, control using reinforcement learning is widely used in robotics, and safe RL is essential because technologies using learning must be safe.

4. Contributions(Findings)

5. Methodology

Worse Case Criterion: The first criterion is based on the Worst Case Criterion where a policy is considered to be optimal if it has the maximum worst-case return (Section 3.1). This criterion is used to mitigate the effects of the variability induced by a given policy, since this variability can lead to risk or undesirable situations. This variability can be due to two types of uncertainties: the inherent uncertainty related to the stochastic nature of the system (Heger, 1994b,a; Gaskett, 2003), and the parameter uncertainty related to some of the parameters of the MDP are not known exactly (Nilim and El Ghaoui, 2005; Tamar et al., 2013).

 

Risk-Sensitive Criterion: In other approaches, the optimization criterion is transformed so as to reflect a subjective measure balancing the return and the risk. These approaches are known as risk-sensitive approaches and are characterized by the presence of a parameter that allows the sensitivity to the risk to be controlled (Section 3.2). In these cases, the optimization criterion is transformed into an exponential utility function (Howard and Matheson, 1972), or a linear combination of return and risk, where risk can be defined as the variance of the return (Markowitz, 1952; Sato et al., 2002), or as the probability of entering into an error state (Geibel and Wysotzki, 2005).

 

Constrained Criterion: The purpose of this objective is to maximize the return subject to one or more constraints resulting in the constrained optimization criterion (Section 3.3). In such a case, we want to maximize the return while keeping other types of expected measures higher (or lower) than some given bounds (Kadota et al., 2006; Moldovan and Abbeel, 2012a).

 

Other Optimization Criteria: Finally, other approaches are based on the use of optimization criteria falling into the area of financial engineering, such as the r-squared, value-at-risk (VaR) (Mausser and Rosen, 1998; Kashima, 2007; Luenberger, 2013), or the density of the return (Morimura et al., 2010a,b) (Section 3.4).

Providing Initial Knowledge: To mitigate the aforementioned exploration difficulties, examples gathered from a teacher or previous information on the task can be used to provide initial knowledge for the learning algorithm (Section 4.1.1). This knowledge can be used to bootstrap the learning algorithm (i.e., a type of initialization procedure). Following this initialization, the system can switch to a Boltzmann or fully greedy exploration based on the values predicted in the initial training phase (Driessens and Dˇzeroski, 2004). In this way, the learning algorithm is exposed to the most relevant regions of the state and action spaces from the earliest steps of the learning process, thereby eliminating the time needed in random exploration for the discovery of these regions.

 

Deriving a policy from a finite set of demonstrations: In a similar way, a set of examples provided by a teacher can be used to derive a policy from demonstrations (Section 4.1.2). In this case, the examples provided by the random exploration policy are replaced by the examples provided by the teacher. In contrast to the previous category, this external knowledge is not used to bootstrap the learning algorithm, but is used to learn a model from which to derive a policy in an off-line and, hence, safe manner (Abbeel et al., 2010; Tang et al., 2010).

 

Providing Teach Advice: Other approaches based on teacher advice assist the exploration during the learning process (Section 4.1.3). They assume the availability of a teacher for the learning agent. The teacher may be a human or a simple controller, but in both cases it does not need to be an expert in the task. At every step, the agent observes the state, chooses an action, and receives the reward with the objective of maximizing the return or other optimization criterion. The teacher shares this goal, and provides actions or information to the learner agent. Both the agent and the teacher can initiate this interaction during the learning process. In the ask for help approaches (Section 4.1.3.1), the learner agent requests advice from the teacher when it considers it necessary (Clouse, 1997; Garc´ıa and Fern´andez, 2012). In other words, the teacher only provides advice to the learner agent when it is explicitly asked to. In other approaches (Section 4.1.3.2), it is the teacher who provides actions whenever it feels it is necessary (Thomaz and Breazeal, 2008; Vidal et al., 2013). In another group of approaches (Section 4.1.3.3), the main role in this interaction is not so clear (Rosenstein and Barto, 2004; Torrey and Taylor, 2012).

 

Risk-directed Exploration: In these approaches a risk measure is used to determine the probability of selecting different actions during the exploration process (Section 4.2) while the classic optimization criterion remains (Gehring and Precup, 2013; Law, 2005).

6. Measurements

7. Limitations(If it's not written, think about it)

8. Potential Gap

- Although Safe RL has proved successful for the learning policy considering risk, there are still many areas to be studied.

반응형