본문 바로가기

논문 리뷰

[Optimal Control] Paper Review: Constrained Policy Optimization

728x90

0. Title

- Constrained Policy Optimization

1. Authors

- Joshua Achiam

2. Abstract

- systems that physically interact with or around humans should satisfy safety constraints. Recent advances in policy search algorithms have enabled new capabilities in highdimensional control, but do not consider the constrained setting.

- We propose Constrained Policy Optimization (CPO), the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration.

- Our method allows us to train neural network policies for high-dimensional control while making guarantees about policy behavior all throughout training. 

3. Motivation

- The motivation of the paper is to prove new bounds on the difference in returns (or constraint returns) between two arbitrary stochastic policies in terms of an average divergence between them.

4. Contributions(Findings)

- The motivation of the paper is to prove new bounds on the difference in returns (or constraint returns) between two arbitrary stochastic policies in terms of an average divergence between them.

5. Methodology

Constrained MDP

A constrained Markov decision process (CMDP) is an MDP augmented with constraints that restrict the set of allowable policies for that MDP.

 

6. Measurements

- Does CPO succeed at enforcing behavioral constraints when training neural network policies with thousands of parameters?

- How does CPO compare with a baseline that uses primal-dual optimization? Does CPO behave better with respect to constraints?

- How much does it help to constrain a cost upper bound, instead of directly constraining the cost?

- What benefits are conferred by using constraints instead of fixed penalties?

7. Limitations(If it's not written, think about it)

- It is unsuitable for use-cases, where safety must be ensured for all visited states and during training.

8. Potential Gap

- 제약조건을 완전히 지키지는 못하기 때문에 잘 모르겠다.

반응형