Model-based DRL for accelerated learning from flow simulations

Andre Weiner, Janis Geise, Chair of Fluid Mechanics

01	simulation-based learning motivation and challenges
02	model-based learning DRL basics, model-based PPO
03	benchmark results flow past a cylinder, fluidic pinball

simulation-based learning

motivation and challenges

closed-loop control benchmark, $Re=100$

evaluation of optimal policy (control law)

optimal sensor placement - R. Paris et al. (2021)

optimal actuator placement - R. Paris et al. (2023)

separation control, B. Font et al. (2025)

LES, higher-order spectral elements
8 simulations in parallel
96 episodes (iterations)
6 days turnaround time
1152 GPUh (A100)
$4$ EUR/GPUh $\rightarrow 5$ kEUR

Training cost DrivAer model

$5$ hours/simulation (1000 MPI ranks)
$10$ parallel simulations
$100$ iterations $\rightarrow 20$ days turnaround time
$20\times 24\times 10\times 1000 \approx 5\times 10^6 $ CPUh
$0.01-0.05$ EUR/CPUh $\rightarrow 0.5-2$ mEUR

CFD simulations are expensive!

model-based learning

DRL basics, model-based PPO

reinforcement learning: sequential decision making (control) under uncertainty

experience tuple at step $n$ $$ (S_n, A_n, R_{n+1}, S_{n+1}) $$

trajectory over $N$ steps $$\tau = \left[ (S_0, A_0, R_1, S_1), \ldots ,(S_{N-1}, A_{N-1}, R_N, S_N)\right]$$

return - dealing with sequential feedback

$$ G_n = R_{n+1} + R_{n+2} + ... + R_N $$

discounted return $$ G_n = R_{n+1} + \gamma R_{n+2} + \gamma^2 R_{n+3} + ... \gamma^{N-1}R_N $$

$\gamma$ - discounting factor, typically $\gamma = 0.99$

learning what to expect in a given state

$$ L_V = \frac{1}{N_\tau N} \sum\limits_{\tau = 1}^{N_\tau}\sum\limits_{n = 1}^{N} \left( V_{\theta_v}(S_n^\tau) - G_n^\tau \right)^2 $$

$\tau$ - trajectory (single simulation)
$S_n$ - state/observation (pressure)
$V_{\theta_v}$ - parametrized value function
clipping not included

Was the selected action a good one?

$$\delta_n = R_n + \gamma V_{\theta_v}(S_{n+1}) - V_{\theta_v}(S_n) $$ $$\delta_{n+1} = R_n + \gamma R_{n+1} + \gamma^2 V_{\theta_v}(S_{n+2}) - V_{\theta_v}(S_n) $$

$$ A_n^{GAE} = \sum\limits_{l=0}^{N-n} (\gamma \lambda)^l \delta_{n+l} $$

$\delta_n$ - one-step advantage estimate
$A_n^{GAE}$ - generalized advantage estimate

make good actions more likely

$$ J_\pi = \frac{1}{N_\tau N} \sum\limits_{\tau = 1}^{N_\tau}\sum\limits_{n = 1}^{N} \frac{\pi_{\theta_\pi}(A_n|S_n)}{\pi_{\theta_\pi}^{old}(A_n|S_n)} A^{GAE,\tau}_n $$