A Comprehensive Survey on Safe Reinforcement Learning

논문 제목 : A Comprehensive Survey on Safe Reinforcement Learning (2015)

● 논문 저자 : Javier Garcia, Fernando Fernandez

● 논문 링크 : http://www.jmlr.org/papers/volume16/garcia15a/garcia15a.pdf

1 Abstract

Safe Reinforcement Learning(Safe RL)의 정의

maximize the expectation of the return in problems in which it is important to ensure reasonable system performance
or/and respect safety constraints during the learning and/or deployment processes.

정리하자면, 어떠한 시스템의 성능을 보장하는 문제에서 return값을 최대화 하고, 학습하는 동안에 안전성을 위한 제약을 지키는 것

Safe RL의 정의에 의한 두 가지 접근

the modification of the optimality criterion, the classic discounted finite/infinite horizon, with a safety factor.
the modification of the exploration process through the incorporation of external knowledge or the guidance of a risk metric

최적화 기준 -> 어떻게 하면 어떠한 성능을 안전성 요인을 이용하여 더 안전하게 최적화할 수 있을까?

탐험 프로세스 -> 어떻게 하면 더 안전하게 탐험을 할 수 있을까?

2 Conclusions

Safe RL은 safety constraint를 지키는 것이 중요한 control problem에 사용된다.

이 논문은 두 가지 fundamental trend에 대해 말한다.

optimization criterion
exploration process
또한 두 가지의 장점과 단점에 대해서 다뤘다.

현재 robot의 급증으로 robotics 분야에서는 learning task에 사용된 기술들이 안전해야 하기 때문에 반드시 이러한 Safe RL 기술들은 필요하다.

왜냐하면 simulation에서 학습된 parameter들이 현실(물리적 세계)에 직접적으로 변환되지 않으며, 특히 simulation에서 heavy한 optimization은 불가피하게도 단순하게 사용해야만 했기 때문에 simulation과 application 사이의 gap은 자연스럽게 멀어져 가고있다고 볼 수 있다.
또한 autonomous robotic controller는 environmental complexity 뿐만 아니라 robot에 대한 mechanical system과 electrical characteristics와 같이 많은 요인들을 다루어야한다.
결론적으로, damage(risk)의 양을 줄이는 Safe RL algorithm과 같이 로봇에 직접적으로 적용되는 learning algorithm을 개발하는 것이 중요하다.

비록 Safe RL이 risk를 고려한 learning policy에 대해 성공적인 tool이 있다고 증명해왔지만, 여전히 연구되어야 할 부분들이 많다.

3 Introduction

RL에서는 agent가 환경의 상태를 받고, long-term return을 maximize하기 위해 행동한다.

그러나 agent의 safety가 특별히 중요한 환경들에서, 최근 연구자들은 long-term reward maximization뿐만 아니라 damage avoidance에 대해서도 높은 관심을 보였다.

safety의 개념은 risk의 반대말이며, RL 논문들에 많은 형태로 연구되어왔지만 필수적으로 나타나지는 않았다. 많은 연구들에서 risk는 다음과 같은 부분들과 관련되어 있다.

stochasticity of the environment와 관련되어 있다.
많은 환경들에서, return값에 관한 optimal policy는 어쩌면 poorly하게 실행될 수도 있다.
또한 inherent uncertainty of the environment (i.e., with its stochastic nature)와도 관련되어 있다.

이러한 risk에 대한 두 가지 연구가 있는데 다음과 같다.

optimization criterion

error states을 만나는 것에 대한 확률을 transform
unexpectedly bad한 많고 중요한 events에 대해서 temporal differences를 transform

exploration/exploitation strategies

environment을 sampling하는 것에 대해 statistics적인 지식을 사용한 heuristics 방법
random exploratory component(e.g., $\epsilon$-greedy)를 사용. 사용하므로써 효율적으로 state space를 explore
prior knowledge 사용(teacher를 이용한 방법) -> 모든 risky domain을 다루진 못함 -> 그래도 대부분 dangerous or catastrophic state를 피하기 위한 효과적인 방법
또 다른 exploration process 연구들

temporal differences에 기반한 risk metric의 형태를 사용
entropy 측정과 expected return의 weighted sum을 기반한 risk metric의 형태를 사용

그래서 앞으로 주로 다룰 내용 : Safe RL algorithm의 두 가지 fundamental tendency

optimization criterion을 transform
아래를 통한 exploration process을 modifying

external knowledge를 포함하는 것을 통해
risk metric을 사용하는 것을 통해

그런데 결국에는 optimization criterion도 exploration process를 transform(or modification)해야하기 때문에 exploration process가 optimization criterion을 포함된다고 볼 수 있다.

Section 설명

Section 2 -> presents an overview and a categorization of Safe RL algorithms existing in the literature.
Section 3 -> The methods based on the transformation of the optimization criterion are examined in Section 3.
Section 4 -> The methods that modify the exploration process by the use of prior knowledge or a risk metric are considered in Section 4.
Section 5 -> we discuss the surveyed methods and identify open areas of research for future work.
Section 6 -> Finally, we conclude with Section 6.

4 Overview of Safe RL

Optimization Criterion

RL 알고리즘의 목적은 optimal control policy를 찾는 것이다. 다시 말해 하나의 criterion을 optimize하기 위해 system의 state들에 대해서 하나의 action or 하나의 strategy를 구체화하는 function을 찾는 것이다.

여기서 하나의 criterion이란, time or any other cost metric을 minimize하는 것, or reward를 maximize하는 것을 말한다.

return에 대한 다양한 표현이 있다. the expected return, expected sum of rewards, cumulative reward, cumulative discounted reward or return들이 있는데 여기서는 그냥 return이라고 하겠다.

Optimization Criterion을 세부적으로 더 쪼갤 수 있다.

Worst Case Criterion

만약 어떠한 policy가 the maximum worst-case return을 가진다면 그 policy를 optimal하게 만드는 것이다.
주어진 policy에 의해 유발된 variability의 영향을 완화하는 데에 사용한다.

variability은 risk or undesirable situations를 만든다.
variability는 uncertainty의 두 가지 타입때문에 생긴다.

system의 stochastic 환경과 관련된 inherent uncertainty
MDP의 parameters와 관련된 parameter uncertainty

Risk-Sensitive Criterion

control하기 위해서 risk에 대해 sensitivity(민감도)를 가지는 parameter를 이용한다.
exponential utility function 또는 return and risk의 linear combination으로 transform

여기서 risk는 return의 variance 또는 error state를 만날 때의 probability로 정의된다.

Constrained Criterion

constrained optimization criterion에서 그에 대한 결과로 나타나는 one or more constraints으로 하여 return을 maximize
some given bounds보다 higher (or lower)하게 expected한 측정을 하도록 return값을 maximize

Other optimization Criteria

financial engineering쪽에 속하는 optimization criteria를 사용

r-squared, value-at-risk (VaR), or the density of the return

다음은 Exploration Process를 세부적으로 살펴보자.

대부분 RL 알고리즘은 external knowledge(외부의 정보) 없이 learning한다. 다시말해 agent 혼자 trial and error를 통해서 학습한다.
그리고 보통 $\epsilon$-greedy를 사용한다.
이러한 전략의 문제점은 task에서 정보를 모으기 위해 state와 action space의 random한 exploration을 유발한다.

그러나 randomized exploration strategy는 state와 action spaces의 관계없는 state들을 exploration하는 데에 많은 양의 time을 낭비한다.
또한 agent가 undesirable한 state로 가도록한다.

그래서 risk한 상황을 피하기 위한 방법으로 두 가지가 있다.

External Knowledge (three ways of the incorporation of external knowledge)

Providing Initial Knowledge (약간 역강화학습처럼 학습하기 전에 정보를 미리 제공하고 bootstrap으로 학습)

task에서 teacher or previous information으로부터 모아진 정보들은 learning algorithm에 대해 초기의 정보(지식)를 제공하는 데에 사용된다. (학습하기전에 미리 제공)
이 정보들은 learning algorithm을 bootstrap하는 데에 사용된다.
이렇게 initialization을 하기 때문에 Boltzmann or fully greedy exploration으로 스위치할 수 있다.
이렇게 함으로써 random exploration에 필요한 시간을 줄일 수 있다.

Deriving a policy from a finite set of demonstrations (완전히 모방학습)

위의 개념과 비슷하게, teacher에 의해 제공된 정보의 set은 하나의 policy를 끌어내는 데에 사용될 수 있다.
random exploration에 의해 제공된 정보는 teacher에 의해 제공된 정보에 의해 대체된다.
위의 개념과 다른 점은 external knowledge이 learning algorithm을 bootstrap하는 데에 사용되지 않는다. (중요)
그러나 하나의 model을 학습하는 데에 사용된다. (모방학습)

Providing Teach Advice (agent랑 teacher랑 상호작용해서 요구할 때만! advice 제공)

이 개념은 학습하는 agent에 대해 teacher를 이용가능하다고 가정한다.
이 teacher는 사람일 수도 있고, 간단한 controller일 수도 있다. (전문가를 필요로 하는 것은 아니다.)
teacher는 목표를 공유하고, agent에게 action or information을 제공한다.
agent와 teacher 둘다 learning process동안 interaction하도록 만든다.
teacher는 agent가 명확하게 요구할 때 advice를 제공한다.

Risk-directed Exploration

risk를 나타내는 metric을 측정하여 exploration process에서 다른 action들을 선택하는 확률을 결정하는 데에 사용된다.

5 Modifying the Optimization Criterion

정리하고 싶은 부분들만 정리하고자 한다. 그래서 이 section에서는 Worst Case Criterion만 정리하겠다.

Risk-Neutral Criterion의 정의

risk-neutral control에서 objective function은 return의 expectation을 maximize하는 control policy를 계산(학습)하는 것이다. risk의 개념이 들어가기 때문에 기존에 return의 expectation과는 다르다.

Worst-Case Criterion

Worst-Case or Minimax Criterion under Inherent Uncertainty의 정의

worst-case or minimax control에서 objective function은 worst case에 관한 return의 expectation을 maximize하는 control policy를 계산(학습)하는 것이다.

개인적으로 생각해보면, worst-case로서 trajectory에 대한 return을 minimize하는 것을 maximize하는 것이라고 생각했다.

여기서 $\Omega^\pi$는 policy $\pi$에서 발생한 $s_0, a_0, s_1, a_1$ 형태의 trajectory들에 대한 하나의 set이다.
$\mathbb{E}_{\pi, w} (\cdot)$는 policy $\pi$와 trajectory $w$와 관련된 expectation을 의미한다.

추가적으로 objective function이 아닌 Q-learning에서도 minimax criterion으로 하는 방법이 있다. 기존에 알던 Q-learning에서 변형한 것인데, $\beta$-pessimistic Q-learning이라고 하며 기존 Q-learning의 extreme optimism와 minimax 접근의 extreme pessimism사이의 compromise한 것이다. 수식은 아래와 같다.

여기서 $\beta$는 $\beta \in [0,1]$이고, 0.5로 설정을 했을 때 가장 pessimistic했고, 0.1로 했을 때 safe path를 찾았다고 한다.

Worst-Case or Minimax Criterion under Parameter Uncertainty의 정의

여기서 P는 possible transition matrices의 하나의 set(uncertainty set)이다.
objective function은 모든 possible models $p \in P$에 대하여 worst case policy에 대한 return의 expectation을 maximize하는 것이다.
$\mathbb{E}_{\pi, p} (\cdot)$는 policy $\pi$와 transition model $p$와 관련된 expectation을 의미한다.

이 경우는 parameterizing하는 부분을 두개 $\pi$와 $p$로 두고, $p$는 worst case policy로서 return을 minimize하는 것이고, $\pi$는 minimizing하는 것을 maximize하는 것이다.

Worst-Case를 알면 무엇이 좋을까? 한번 정리해보자.

이렇게 worst-case가 구해지면 risk or undesirable situations로 인한 variability를 알 수 있고, 주어진 policy에 의해 유발된 variability의 영향을 완하하는 데에 사용된다. 여기서 variability란, stochastic 환경에서의 expected return의 variability를 말한다.
또한 효과적으로 value에 대한 lower bound 또는 total cost에 대한 upper bound를 알 수 있다.
결국에는 경우에 따라서 사용을 해야 하는데 최악의 상황만 아니면 되는 환경이라면 worst-case or minimax criterion만 구하면 편리하다. 예를 들어 설명해보겠다. 옆 차가 내 차 앞으로 끼어드는 상황일 때, 그 끼어드는 궤적은 다 다를 것이다. 좀 더 완만하게 끼어드는 것이 있고, 좀 더 급격하게 끼어드는 것이 있을 것이다. 이 경우에 가장 급격하게 끼어들어서 내 차와의 거리가 가장 가까울 경우 worst-case가 된다. 이 경우만 피하면 나머지 완만한 case에 대해서는 생각을 하지 않아도 된다.

6 Modifying the Exploration Process

이 section도 정리하고 싶은 부분만 정리하겠다. 여기서는 Providing Teach Advice부분이 궁금하기 때문에 이 부분만 정리하고자 한다.

Using Teacher Advice

치명적인 state를 피하면서 환경을 exploration하는 것은 좋지 않은 decision이 agent를 위험한 situation으로 이끌 수 있기 때문에 learning에 있어서 중요하다.
따라서 이러한 문제점 때문에 teacher advice 방법이 고안되었고, teacher를 통해 safe exploration이 가능한 방법은 두 가지가 있다.

teacher는 teacher의 policy에 의해 제안된 state space에서 좋은 state를 원할 때 learner를 가이드한다. 이렇게 함으로써 learning의 sample complexity를 줄여준다.
teacher는 learner에게 learner 또는 teacher가 catastrophic situation을 막기 위해서 필요하다고 생각할 때 advice(e.g safe actions)를 제공한다.

Teacher Advising의 정의

decision을 하거나 exploration의 progress를 modify하기 위해 agent에 의해 사용될 수 있는 control algorithm으로 input을 주는 어떠한 external entity이다.

teacher는 일반적으로 learner와 같은 state를 받는다. 그리고 learner 또는 teacher는 teacher가 advice를 주는 것이 적절할 때 advising한다. 그러나 여기서 teacher와 agent가 observe된 state를 받을 때(agent는 $s$, teacher는 $s'$), 이 둘은 다를 수 있다.

추가적으로 advice는 다양한 형태를 가진다.

learner가 행동하는 single action
learner agent가 의도적으로 replay할 수 있는 action들의 완성된 결과
상호작용적으로 agent의 행동을 판단하기 위해 사용되는 reward
agent가 random 또는 greedy하게 하나를 선택해야하는 action의 set

이 논문에서 다루는 teacher의 advice에 대한 framework는 advice를 requesting하거나 receiving하는 것이다.

The Learner Agent Asks for Advice

여기서의 learner agent는 confidence parameter를 사용한다. 그리고 state에서 confidence가 낮을 때, learner agent는 teacher를 통해 advice를 요청한다. 전형적으로 advice는 action이다.

The Teacher Provides Advice

이 접근에서의 teacher는 teacher가 help가 필요하다고 생각될 때 action 또는 information을 제공한다.

'Artificial Intelligence > Reinforcement Learning' 카테고리의 다른 글

High-Dimensional Continuous Control using Generalized Advantage Estimation (0)	2018.07.03
n-Step Return vs. Lambda-Return (2)	2018.07.02
Imagination-Augmented Agents for Deep Reinforcement Learning (1)	2018.06.04
Playing Atari with Deep Reinforcement Learning (0)	2018.05.03
Anticipatory Asynchronous Advantage Actor-Critic (A4C) (0)	2018.04.24

Research Beginner

A Comprehensive Survey on Safe Reinforcement Learning

'Artificial Intelligence > Reinforcement Learning' 카테고리의 다른 글

티스토리툴바

A Comprehensive Survey on Safe Reinforcement Learning

'Artificial Intelligence > Reinforcement Learning' 카테고리의 다른 글

'Artificial Intelligence/Reinforcement Learning' Related Articles

티스토리툴바