Andre Weiner
TU Dresden, Institute of fluid mechanics, PSM
These slides and most
of the linked resources are licensed under a
Creative Commons Attribution 4.0
International License.
Reduced-order modeling of flow fields
Controlling the flow past a cylinder
Goals of flow control:
Categories of flow control:
Active flow control can be more effective but requires energy.
Categories of active flow control:
Closed-loop flow control can be more effective but defining the control law is extremely challenging.
Flow past a circular cylinder at $Re=100$.
Drag and lift coefficients for the cylinder surface; source: figure 2.9 in D. Thummar 2021.
Typical steps in open-loop control:
Open-loop example:
Loss landscape for periodic control; source: figure 3.5 in D. Thummar 2021.
Closed-loop example: $\omega = f(\mathbf{p})$.
Closed-loop flow control with variable Reynolds number; source: F. Gabriel 2021.
How to find the control law?
Favorable attributes of DRL:
Why CFD-based closed-loop control via DRL?
Main challenge: CFD environments are expensive!
Reinforcement learning cycle.
The agent:
The environment at time step $n$:
State $S_n\in \mathcal{S}$:
Note that state and observation are typically used as synonyms.
Action $ A_n\in \mathcal{A}(s)$:
Even more definitions:
Describing uncertainty: "slippery walk" presented as Markov decision process (MDP).
Markov property:
$$ P(S_{n+1}|S_n, A_n) = P(S_{n+1}|S_n, A_n, S_{n-1}, A_{n-1}, ...) $$
May be relaxed in practice by adding several time levels to the state.
Transition function:
$$ p(s^\prime | s, a) = P(S_{n}=s^\prime | S_{n-1} = s, A_{n-1} = a) $$
Reward function:
$$ r(s, a) = \mathbb{E}[R_{n} | S_{n-1}=s, A_{n-1}=a] $$
Slippery walk in gym/gym_walk:
env = gym.make('SlipperyWalkSeven-v0')
init_state = env.reset()
P = env.env.P
list(P.items())[-2:]
# output
[(7,
{0: [(0.5000000000000001, 6, 0.0, False),
(0.3333333333333333, 7, 0.0, False),
(0.16666666666666666, 8, 1.0, True)],
1: [(0.5000000000000001, 8, 1.0, True),
(0.3333333333333333, 7, 0.0, False),
(0.16666666666666666, 6, 0.0, False)]}),
(8,
{0: [(0.5000000000000001, 8, 0.0, True),
(0.3333333333333333, 8, 0.0, True),
(0.16666666666666666, 8, 0.0, True)],
1: [(0.5000000000000001, 8, 0.0, True),
(0.3333333333333333, 8, 0.0, True),
(0.16666666666666666, 8, 0.0, True)]})]
What is the meaning of $(3, 0, 0, 4)$?
Why might the CFD environment be uncertain?
Intuitively, what is the optimal policy?
Dealing with sequential feedback:
return: $$ G_n = R_{n+1} + R_{n+2} + ... + R_N $$
discounted return: $$ G_n = R_{n+1} + \gamma R_{n+2} + \gamma^2 R_{n+3} + ... \gamma^{N-1}R_N $$
$N$ - number of time steps, $\gamma$ - discounting factor, typically $\gamma = 0.99$
recursive definition: $$ G_n = R_{n+1} + \gamma R_{n+2} + \gamma^2 R_{n+3} + ... \gamma^{N-1}R_N $$ $$ G_{n+1} = R_{n+2} + \gamma R_{n+3} + \gamma^2 R_{n+4} + ... \gamma^{N-1}R_N $$ $$ G_n = R_{n+1} + \gamma G_{n+1} $$
Why should we discount?
rewards trajectory 1: $(0, 0, 0, 0, 1)$
rewards trajectory 2: $(0, 0, 0, 0, 0, 0, 1)$
Which one is better?
returns: 1 and 1
discounted returns: $0.99^5\cdot 1$ and $0.99^7\cdot 1$
$\rightarrow$ discounting expresses urgency
Returns combined with uncertainty: the state-value function
$$ v_\pi (s) = \mathbb{E}_\pi [G_n| S_n=s] = \mathbb{E}_\pi [R_{n+1} + \gamma G_{n+1}| S_n=s] $$ In words: the value function expresses the expected return at time step $n$ given that $S_n = s$ when following policy $\pi$.
How to compute the value-function if the MDP is known? - Bellman equation
$$ v_\pi (s) = \mathbb{E}_\pi [R_{n+1} + \gamma G_{n+1}| S_n=s] $$ $$ v_\pi (s) = \sum\limits_{s^\prime , r, a} p(s^\prime,r| s, a) [r + \gamma \mathbb{E}_\pi [G_{n+1} | S_{n+1} = s^\prime)]] $$ $$ v_\pi (s) = \sum\limits_{s^\prime , r, a} p(s^\prime,r| s, a) [r + \gamma v_\pi (s^\prime)]\quad \forall s\in S $$
$s^\prime$ - next state; deterministic actions assumed
Intuitively, which state has the highest value?
What action to take? The state-action function:
$$ q_\pi (s, a) = \mathbb{E}_\pi [G_n | S_n = s, A_n=a] $$ $$ q_\pi (s, a) = \mathbb{E}_\pi [R_{n+1} + \gamma G_{n+1} | S_n = s, A_n=a] $$ $$ q_\pi (s, a) = \sum\limits_{s^\prime , r} p(s^\prime,r| s, a)[r + \gamma v_\pi (s^\prime)]\quad \forall s\in S, \forall a\in A $$
Intuitively, what would you expect $q_\pi (s, a=0)$ vs. $q_\pi (s, a=1)$ ?
How much better is an action? - action-advantage function
$$ a_\pi (s,a) = q_\pi(s,a) - v_\pi(s) $$
The optimal policy:
$$ v^\ast (s) = \underset{\pi}{\mathrm{max}}\ v_\pi (s)\quad \forall s\in \mathcal{S} $$ $$ q^\ast (s, a) = \underset{\pi}{\mathrm{max}}\ q_\pi (s, a)\quad \forall s\in \mathcal{S}, \forall a\in \mathcal{A} $$
Planning: finding $\pi^\ast$ with value iterations
$$ v_{k+1}(s) = \underset{a}{\mathrm{max}}\sum\limits_{s^\prime, r} p(s^\prime , r | s, a)[r+\gamma v_k(s^\prime)] $$
$v_k$ - value estimate at iteration $k$
Value iteration implementation:
def value_iteration(P: Dict[int, Dict[int, tuple]], gamma: float=0.99, theta: float=1e-10):
V = pt.zeros(len(P))
while True:
Q = pt.zeros((len(P), len(P[0])))
for s in range(len(P)):
for a in range(len(P[s])):
for prob, next_state, reward, done in P[s][a]:
Q[s][a] += prob * (reward + gamma * V[next_state] * (not done))
if pt.max(pt.abs(V - pt.max(Q, dim=1).values)) < theta:
break
V = pt.max(Q, dim=1).values
pi = lambda s: {s:a for s, a in enumerate(pt.argmax(Q, dim=1))}[s]
return Q, V, pi
Optimal action-value function:
optimal_Q, optimal_V, optimal_pi = value_iteration(P)
optimal_Q[1:-1]
# output
tensor([[0.3704, 0.6668],
[0.7902, 0.8890],
[0.9302, 0.9631],
[0.9768, 0.9878],
[0.9924, 0.9960],
[0.9976, 0.9988],
[0.9993, 0.9997]])
Why is it called reinforcement learning and not planning?
Deep RL uses neural networks to learn function approximations of:
Some DRL lingo:
General learning strategy:
How do DRL agents explore?
PDF - probability density function
Closed-loop example: $\omega = f(\mathbf{p})$.
PPO overview; source: figure 4.6 in D. Thummar 2021.
Policy network; source: figure 4.4 in D. Thummar 2021.
Action of example trajectories in training mode; source: figure 5.2 in F. Gabriel 2021.
Comparison of Gauss and Beta distribution.
learning what to expect in a given state - value function loss
$$ L_V = \frac{1}{N_\tau N} \sum\limits_{\tau = 1}^{N_\tau}\sum\limits_{n = 1}^{N} \left( V(s_n^\tau) - G_n^\tau \right)^2 $$
Was the selected action a good one?
$$\delta_n = R_n + \gamma V(s_{n+1}) - V(s_n) $$ $$\delta_{n+1} = R_n + \gamma R_{n+1} + \gamma^2 V(s_{n+2}) - V(s_n) $$ $$ A_n^{GAE} = \sum\limits_{l=0}^{N-n} (\gamma \lambda)^l \delta_{n+l} $$
make good actions more likely - policy objective function
$$ J_\pi = \frac{1}{N_\tau N} \sum\limits_{\tau = 1}^{N_\tau}\sum\limits_{n = 1}^{N} \mathrm{min}\left[ \frac{\pi(a_n|s_n)}{\pi^{old}(a_n|s_n)} A^{GAE,\tau}_n, \mathrm{clamp}\left(\frac{\pi(a_n|s_n)}{\pi^{old}(a_n|s_n)}, 1-\epsilon, 1+\epsilon\right) A^{GAE,\tau}_n\right] $$
Effect of clipping in the PPO policy loss function.
Refer to R. Paris et al. 2021 and the references therein for similar works employing PPO.
Cumulative reward over the number of training episodes; source: figure 5.8 in F. Gabriel 2021.
Closed-loop flow control with variable Reynolds number; source: F. Gabriel 2021.
Drag coefficient with and without control; source: figure 6.5 in F. Gabriel 2021.
Pressure on the cylinder's surface; source: figure 5.7 in F. Gabriel 2021.