Human-level control through deep reinforcement learning


Here, I’ll talk about the red-hot topic: deep reinforcement learning. After the trend of the AlphaGo. Many people have been stunned by the smartness of the machine, which makes the topic of human versus artificial intelligence rises again. This blog will mainly focus on this work:Human-level control through deep reinforcement learning, published in Nature. And I'll briefly talked about the recent released open source: gym, release by OpenAI

Before digging into reinforcement learning, let’s talk about markov decision process, aka MDP. MDP is a model that help us make a good choice. MDP has several components, including state(s), action(a), and so on. It has a very important assumption: every move only depends on the current condition, namely P(a_t|s_t).
Here are the components of MDP:


  1. State(s): the green circle
  2. Q-state: the orange circle
  3. Action(a)
  4. Transition func: T(s,a,s')
  5. Reward: R(s,a,s')

Reinforcement learning(short as RL in the following) is a learning process between the agent and the environment. The agent will do some actions depends on the current situation. Then the environment will give you the reward(either positive or negative) after each transition, and tells agent the next state, which is used in the next decision section. This concept is quite simple and reasonable, right? In our daily life, we make decisions all the times, e.g. deciding when I should get to go, what to do in my leisure time, and so on.


In RL, we focus on approximating the optimal Q-value(s,a), the expectimax of utility(accumulated reward) of taking action "a" in state "s", since the Q-value connects the relation between action and values. Namely, once we have information about Q values, we got the corresponding policy. If the term "Q-learning" just flashed through your mind, you definitely understand why we call it "Q-learning".

We can use sample to iteratively update our Q-values. The following equation is based on the concept of Temporal Difference Learning. The alpha denotes how much we update our Q-value in each iteration.


However, iterative updates isn’t a smart way, since it treat different state and action pairs as totally different individuals, which makes the state-action space become nearly infinite. Thus, instead of treating each pairs as different individuals, we use feature-based representation. To put it briefly, we can define a function, which maps the similar state action pair to similar values.

To define a function, we can use a naïve linear function to approximate the optimal Q-function.

Image from UCBerkeley CS188


However, we have a better tool: deep learning! Deep learning is very good at approximating a complex function. It has received lots of success in image recognition, image detection, language translation, speech recognition and so on. So, here we resort to deep neural network. To learn more about the concept of deep learning, you can refer to this paper.

In the paper, Human-level control through deep reinforcement learning, they used neural network like the following architecture, which looks simpler than the networks in other domains. However, reinforcement learning is known to be unstable or even diverge when non-linear function approximator (CNN) is used to represent Q value. There are many papers presenting fashion ways to avoid this, but we're not talking about it here.


Here we are going to talk about two strategies for training deep reinforcement network mentioned in the paper,

  1. Experience replay: the agent will continually recall what it has been through. In this way, we can let our agent frequently recall which is good and which is bad, which help our Q-function converge.
  2. Iterative update Q-value: in machine learning, we often use gradient descent to train our model. And in most case we don’t want our model frequently changed or oscillate frequently. Sometimes it will make us fall into local minimum. Iterative update Q-value is somewhat a method to solve the similar condition.

Both of these strategies are now viewed as the basic algorithm in DQN.

Intrduction to OpenAI gym

It was quite a big news that OpenAI released their open source for the environment of reinforcement learning. When developing a new algorithm, we often need some benchmark; however, since everyone is doing their work on different environment, it’s difficult to have a fair comparison. Right now, OpenAI creates a playground with the multiple environments for developers to evaluate their algorithm under the same circumstances.

Image from OpenAI Gym


OpenAI Gym consists of two parts:

  1. The gym open-source library: a collection of test problems — environments — that you can use to work out your reinforcement learning algorithms.
  2. The OpenAI Gym service: a site and API allowing people to meaningfully compare performance of their trained agents.

It contains various of games or tasks:



UC Berkeley CS188 Intro to AI