Andre Weiner
TU Dresden, Institute of fluid mechanics, PSM
These slides and most
of the linked resources are licensed under a
Creative Commons Attribution 4.0
International License.
Reduced-order modeling of flow fields
Controlling the flow past a cylinder
Goals of flow control:
Categories of flow control:
Active flow control can be more effective but requires energy.
Categories of active flow control:
Closed-loop flow control can be more effective but defining the control law is extremely challenging.
Flow past a circular cylinder at Re=100.
Drag and lift coefficients for the cylinder surface; source: figure 2.9 in D. Thummar 2021.
Typical steps in open-loop control:
Open-loop example:
Loss landscape for periodic control; source: figure 3.5 in D. Thummar 2021.
Closed-loop example: ω=f(p).
Closed-loop flow control with variable Reynolds number; source: F. Gabriel 2021.
How to find the control law?
Favorable attributes of DRL:
Why CFD-based closed-loop control via DRL?
Main challenge: CFD environments are expensive!
Reinforcement learning cycle.
The agent:
The environment at time step n:
State Sn∈S:
Note that state and observation are typically used as synonyms.
Action An∈A(s):
Even more definitions:
Describing uncertainty: "slippery walk" presented as Markov decision process (MDP).
Markov property:
P(Sn+1|Sn,An)=P(Sn+1|Sn,An,Sn−1,An−1,...)
May be relaxed in practice by adding several time levels to the state.
Transition function:
p(s′|s,a)=P(Sn=s′|Sn−1=s,An−1=a)
Reward function:
r(s,a)=E[Rn|Sn−1=s,An−1=a]
Slippery walk in gym/gym_walk:
env = gym.make('SlipperyWalkSeven-v0')
init_state = env.reset()
P = env.env.P
list(P.items())[-2:]
# output
[(7,
{0: [(0.5000000000000001, 6, 0.0, False),
(0.3333333333333333, 7, 0.0, False),
(0.16666666666666666, 8, 1.0, True)],
1: [(0.5000000000000001, 8, 1.0, True),
(0.3333333333333333, 7, 0.0, False),
(0.16666666666666666, 6, 0.0, False)]}),
(8,
{0: [(0.5000000000000001, 8, 0.0, True),
(0.3333333333333333, 8, 0.0, True),
(0.16666666666666666, 8, 0.0, True)],
1: [(0.5000000000000001, 8, 0.0, True),
(0.3333333333333333, 8, 0.0, True),
(0.16666666666666666, 8, 0.0, True)]})]
What is the meaning of (3,0,0,4)?
Why might the CFD environment be uncertain?
Intuitively, what is the optimal policy?
Dealing with sequential feedback:
return: Gn=Rn+1+Rn+2+...+RN
discounted return: Gn=Rn+1+γRn+2+γ2Rn+3+...γN−1RN
N - number of time steps, γ - discounting factor, typically γ=0.99
recursive definition: Gn=Rn+1+γRn+2+γ2Rn+3+...γN−1RN Gn+1=Rn+2+γRn+3+γ2Rn+4+...γN−1RN Gn=Rn+1+γGn+1
Why should we discount?
rewards trajectory 1: (0,0,0,0,1)
rewards trajectory 2: (0,0,0,0,0,0,1)
Which one is better?
returns: 1 and 1
discounted returns: 0.995⋅1 and 0.997⋅1
→ discounting expresses urgency
Returns combined with uncertainty: the state-value function
vπ(s)=Eπ[Gn|Sn=s]=Eπ[Rn+1+γGn+1|Sn=s] In words: the value function expresses the expected return at time step n given that Sn=s when following policy π.
How to compute the value-function if the MDP is known? - Bellman equation
vπ(s)=Eπ[Rn+1+γGn+1|Sn=s] vπ(s)=∑s′,r,ap(s′,r|s,a)[r+γEπ[Gn+1|Sn+1=s′)]] vπ(s)=∑s′,r,ap(s′,r|s,a)[r+γvπ(s′)]∀s∈S
s′ - next state; deterministic actions assumed
Intuitively, which state has the highest value?
What action to take? The state-action function:
qπ(s,a)=Eπ[Gn|Sn=s,An=a] qπ(s,a)=Eπ[Rn+1+γGn+1|Sn=s,An=a] qπ(s,a)=∑s′,rp(s′,r|s,a)[r+γvπ(s′)]∀s∈S,∀a∈A
Intuitively, what would you expect qπ(s,a=0) vs. qπ(s,a=1) ?
How much better is an action? - action-advantage function
aπ(s,a)=qπ(s,a)−vπ(s)
The optimal policy:
v∗(s)=maxπ vπ(s)∀s∈S q∗(s,a)=maxπ qπ(s,a)∀s∈S,∀a∈A
Planning: finding π∗ with value iterations
vk+1(s)=maxa∑s′,rp(s′,r|s,a)[r+γvk(s′)]
vk - value estimate at iteration k
Value iteration implementation:
def value_iteration(P: Dict[int, Dict[int, tuple]], gamma: float=0.99, theta: float=1e-10):
V = pt.zeros(len(P))
while True:
Q = pt.zeros((len(P), len(P[0])))
for s in range(len(P)):
for a in range(len(P[s])):
for prob, next_state, reward, done in P[s][a]:
Q[s][a] += prob * (reward + gamma * V[next_state] * (not done))
if pt.max(pt.abs(V - pt.max(Q, dim=1).values)) < theta:
break
V = pt.max(Q, dim=1).values
pi = lambda s: {s:a for s, a in enumerate(pt.argmax(Q, dim=1))}[s]
return Q, V, pi
Optimal action-value function:
optimal_Q, optimal_V, optimal_pi = value_iteration(P)
optimal_Q[1:-1]
# output
tensor([[0.3704, 0.6668],
[0.7902, 0.8890],
[0.9302, 0.9631],
[0.9768, 0.9878],
[0.9924, 0.9960],
[0.9976, 0.9988],
[0.9993, 0.9997]])
Why is it called reinforcement learning and not planning?
Deep RL uses neural networks to learn function approximations of:
Some DRL lingo:
General learning strategy:
How do DRL agents explore?
PDF - probability density function
Closed-loop example: ω=f(p).
PPO overview; source: figure 4.6 in D. Thummar 2021.
Policy network; source: figure 4.4 in D. Thummar 2021.
Action of example trajectories in training mode; source: figure 5.2 in F. Gabriel 2021.
Comparison of Gauss and Beta distribution.
learning what to expect in a given state - value function loss
LV=1NτNNτ∑τ=1N∑n=1(V(sτn)−Gτn)2
Was the selected action a good one?
δn=Rn+γV(sn+1)−V(sn) δn+1=Rn+γRn+1+γ2V(sn+2)−V(sn) AGAEn=N−n∑l=0(γλ)lδn+l
make good actions more likely - policy objective function
Jπ=1NτNNτ∑τ=1N∑n=1min[π(an|sn)πold(an|sn)AGAE,τn,clamp(π(an|sn)πold(an|sn),1−ϵ,1+ϵ)AGAE,τn]
Effect of clipping in the PPO policy loss function.
Refer to R. Paris et al. 2021 and the references therein for similar works employing PPO.
Cumulative reward over the number of training episodes; source: figure 5.8 in F. Gabriel 2021.
Closed-loop flow control with variable Reynolds number; source: F. Gabriel 2021.
Drag coefficient with and without control; source: figure 6.5 in F. Gabriel 2021.
Pressure on the cylinder's surface; source: figure 5.7 in F. Gabriel 2021.