- introduction
- Optimal control
- solve the optimal control problem
- Belman equation for reinforcement learning

## introduction

Speaking of reinforcement learning, a key technology which is enable machines to learn automatically with try and error to control a environment is expected to be lead to artificial general intelligence. However, reinforcement learning is not magic. It is cleary fomulated and related to optimal control which is used in Real-World industory.

In this article, I will explain reinforcement learning in relation to optimal control.

## Optimal control

### What is control problem?

In a case of control a speed of a car, the rotational speed of the tires are required to be controlled. However we don't control rotational speed of tires but only handle with the accelerator and brake pedals of a car.

Control problem determines the "inputs"(pushing pedals) to a system that will allow the target system to perform the desired state(speed of car).

### optimal control

Optimal control determine the inputs $\bf u$ with solving optimization problem which minimize an evaluation function $J$ which indicates how the state $\bf x$ should be.

For example, a typecal evaluation is as below.

$$ {\rm L} = ({\mathbf x _ {ref} - \mathbf x )^ {\bf T} \mathbf Q(\mathbf x _ {ref} - \mathbf x) + \mathbf u ^ {\bf T} \mathbf R \mathbf u} $$

Where $\mathbf x _ {ref}$ is a target of $\mathbf x$ in each time, therefore quadratic form of $(\mathbf x _ {ref} - \mathbf x )^ {\bf T} \mathbf Q(\mathbf x _ {ref} - \mathbf x)$ term indicates an error between target and actual state. The second term $\mathbf u ^ {\bf T} \mathbf R \mathbf u $ which is often consume some energy is a penalty of amount of input $\bf u$.

In the fact, $\mathbf x$ and $\mathbf u$ are time dependent signal $\mathbf x=\mathbf x(t)$ and $\mathbf u=\mathbf u(t)$, therefore we should consider sum of $L$ for all time and evaluation function $J$ is represented as below.

$$ J = \int _ 0 ^ \infty \left( {\bf (x _ {ref} - x )^ T Q(x _ {ref} - x) + u ^ T R u} \right) {\rm d} t $$

A solution of a optimization problem $\mathbf u ^ * (t) = {\rm argmin} _ {\mathbf u(t)} J $ is called optimal control rule.

### system equation

Now, I have not mention the relation between input and state which is called system equation yet. Let's back to the example about to control a speed of a car. When an accel pedal is pushed, the car raise up the running speed. What is happening? The explanation of this phenomenon using mathmetical equation is the system equation.

The phisical phenomenon's detail is so difficult that I can't explane here, but amount of pushing accel pedal affect the rate of change of speed, which can be imagend. Therefore that phenomenon is written as below In the interim

$$ \frac {dx}{dt} = f(x, u) $$

where $x$ is speed of a car and $u$ is an amount of pushing accel pedal. In an actual case, this differential equation is need to be given explicitly (Concretely, the function of $f$ need to be explicit. The $f$ is determined when the car is designed, so in actual case, I think system equation is known). If the system equation is not explicit, it is required identified by some methods.

## solve the optimal control problem

If that system equation is known, we are able to control the state with optimal control method, with the two equation

$$ \frac {d \mathbf x}{dt} = f(\mathbf x, \mathbf u) $$

and

$$ J = \int _ 0 ^ \infty \left( {\bf (x _ {ref} - x )^ T Q(x _ {ref} - x) + u ^ T R u} \right) {\rm d} t $$

The state $\bf x$ can be known using system equation when some $u$ is inputed, so the evaluation value can be determined that case of input $u$. Especially, when the system equation is as below

$$ \frac {d \mathbf x}{dt} = \bf Ax + Bu $$

which called linear system equation, that optimal control problem is solved analytically with Riccati equation.

In the fact, in real-industry there are non-linear system and various type of evaluaton function $J$, which often includes state $\bf x$ and input $\bf u$, so in this article, evaluation function $J$ is written as below, after this, for the sake of generality.

$$ J = \int _ 0 ^ {\infty} L(\mathbf x, \mathbf u, t) \mathrm dt $$

This evaluation function should be designed to fit the control you want to achieve. If you want to evaluate the goodness of the control rule $\bf u$, the optimization problem is represented ${\rm max} _ {\bf u} J$.

### value function

In this section, the evaluation function $J$ evaluate goodness of input $\bf u$. Now, let's look at the evaluation function from a different perspective. What is the point of view is that, in fact, the only time $t$ has been already elapsed since control started. Since we can no longer go back in time, we consider evaluation functions only for time t and later. It would simply look something like

$$ J _ t = \int _ t ^ \infty l({\bf x}, {\bf u}, t ){\rm d} t $$

It's simply that the integration starts at t. If you want to put in as much effort as possible after t has already passed, it is natural to consider the above evaluation function. Considering the case that the optimized $J _ t$ already had be finded with optimal rule $\bf u$, the value of $J ^ * _ t = {\rm max} _ {\bf u} J$ is called value function which is often written as below.

$$ V _ t = \max _ {\bf u} J _ t $$

In other words, the value function is the maximum value of the evaluation function starting at time t. Now, to avoid confusion, I'll clarify what the value function $V _ t$ takes as its argument.

$$ V ({\bf x}, t) = \max _ {\bf u} J({\bf x}, {\bf u}, t) $$

Note that the value function is no longer a function of u, as it has already been optimized by u for the evaluation function. It evaluate the state $x$ is good or not.

### belman principle of optimality

Now rewrite the value function to look like this It is "the sum of the value function of time from time t to t+dt and the value function after time t+dt".

$$ \begin{align} V ({\bf x}, t) & = \max _ {\bf u} J({\bf x}, {\bf u}, t) \\ & = \max _ {\bf u} \int _ t ^ \infty L({\bf x}, {\bf u}, t ){\rm d} t \\ & = \max _ {\bf u} \int _ t ^ {t+{\rm d} t} L({\bf x}, {\bf u}, t ){\rm d} t + \max _ {\bf u} \int _ {t + {\rm d} t}^ \infty L({\bf x}, {\bf u}, t ){\rm d} t \\ & = \max _ {\bf u} \int _ t ^ {t+{\rm d} t} L({\bf x}, {\bf u}, t ){\rm d} t + V ({\bf x}, t + {\rm d}t) \end{align} $$

Here, the last equation transformation focuses on the fact that the starting point of the time has only changed from t to t+dt. This equation is called belman principle of optimality which represetned that If there is an optimal state $x(t)$ at all times, then it is also an optimal function when cropped at some partial time.

## Belman equation for reinforcement learning

### Prepare for belman equation

Now, we have earned the important represent of value function as below.

$$ V ({\bf x}, t) = \max _ {\bf u} \int _ t ^ {t+{\rm d} t} L({\bf x}, {\bf u}, t ){\rm d} t + V ({\bf x}, t + {\rm d}t) $$

We assume that the time is continuous.Next, we transform it into a number that can be counted as $t,t+1,t+2$ for a given sample time. Then the above equation is

$$ V ({\bf x}, t) = \max _ {\bf u} L({\bf x}, {\bf u}, t ) + V ({\bf x}, t + 1) $$

where, in function $L$, the integration of the first term cannot be evaluated because it is discretized. So, roughly speaking, the value of $L$ at an appropriate time in the interval $[t,t+dt)$ is adopted as representative of the value of $L$.

After this, the value function is represented as $V(\mathbf x _ t)$ instead of $V ({\bf x}, t)$ because the value function evalueates state $\bf x$ mainly, and stating time point $t$ can be known from index of state $\mathbf x _ t$. And, we evaluate goodness throught function $L$ which mean initial of Loss, so using $R$ which mean initial Reward instead of $L$. Therefore the above equation is represented as

$$ V ({\bf x} _ t) = \max _ {\bf u} R({\bf x} _ t, {\bf u} _ t) + V ({\bf x} _ {t + 1}) $$

Let's call simple belman equation after now.

### system equation for random variables

The systems we've been working with so far can be represented as below differential equation.

$$ \frac {dx}{dt} = f(x, u) $$

However, reinforcement learning often handle a state which is a random variable, so the system equation is not able to be represented by differential equation. The systems are represented as stochastic process, especially, markov decision process.

$$ \mathbf x _ {t + 1} \sim p(\mathbf x _ {t + 1} \mid \mathbf x _ t, \mathbf u _ t) $$

In strictly, markov decision process includes other components, but now, we don't enter the details. That equation represents that the next state $\mathbf x _ {t+1}$ depends on current state $\mathbf x _ t$ and current input $\mathbf u _ t$ and not determistic but only probabilistic although there is a rule $p$ in fact, not mean that it's bullshit.

### simple belman equation to actual belman equation

Now, simple belman equation is below.

$$ V ({\bf x} _ t) = \max _ {\bf u} R({\bf x} _ t, {\bf u} _ t) + V ({\bf x} _ {t + 1}) $$

Can you determine the value of this function with not determistic system $p(\mathbf x _ {t+1} \mid \mathbf x _ t , \mathbf u _ t)$ ? The reason why it can be done in optimal control problem to solve the problem is that we can evaluate the value of evaluation function given input $\bf u$. However, now, when system is given input $\mathbf u _ t$, the state process is not determistic, so we can not able earn fixed evaluated value.

We can only estimate the second term for instance using expected value. So, simple belman equation is rewritten as below.

$$ \begin{align} V ({\bf x} _ t) &= \max _ {\bf u} R({\bf x} _ t, {\bf u} _ t) + V ({\bf x} _ {t + 1}) \\ &\simeq \max _ {\bf u} R({\bf x} _ t, {\bf u} _ t) + \max _ {\bf u} \mathbb E _ {\mathbf x _ {t+1} \sim p}[V ({\bf x} _ {t + 1})] \\ &= \max _ {\bf u} R({\bf x} _ t, {\bf u} _ t) + \max _ {\bf u} \sum _ {\mathbf x _ {t+1}} p(\mathbf x _ {t+1} \mid \mathbf x _ t, \mathbf u _ t) V ({\bf x} _ {t + 1}) \end{align} $$

The reason why the second term includes $\rm max$ operation is that dispite the value function which has been optimized about input $\mathbf u$ should has only state $x$ arguments, the operator $\mathbb E$ includes input $\mathbf u$ with stohastic process system $p$.

In addition, the value function $V$ includes infinite sum of time in the future because originaly value function is represented integral from $t$ to $\infty$ (Now, that represent has been transformed into recursive style). Therefore, if the evaluated value of value function with concrete $\mathbf x _ t$ and $\mathbf u _ t$ is sumed in the future, the value function can be diverge. It is prevented by below equation using special hyperparameter which called discount rate $\gamma$.

$$ V ({\bf x} _ t) = \max _ {\bf u} R({\bf x} _ t, {\bf u} _ t) + \gamma \max _ {\bf u} \sum _ {\mathbf x _ {t+1}} p(\mathbf x _ {t+1} \mid \mathbf x _ t, \mathbf u _ t) V ({\bf x} _ {t + 1}) $$

This is called optimal belman equation which is important basis in reinforcement learning.

Please remind the optimal control problem, the value function represents optimized evaluation function from time step $t$. Therefore the belman equation which is a time recursive style of value function represents the equation about evaluation function of optimal control problem optimized by input $\mathbf u$.

In other words, belman equation indicates what evaluation function should become when optimal control problem is solved. Reinforcement learning use belman equation (or belman operator) to estimate the value function with rewards earned from trying and error to marcov dicision process we want to control.