Andre Weiner, Janis Geise
TU Dresden, Chair of Fluid
Mechanics
Flow past a cylinder in a narrow channel at $Re=\bar{U}_\mathrm{in}d/\nu = 100$.
Closed-loop control starts at $t=4s$.
motivation for closed-loop AFC
How to design the control system?
$\rightarrow $ end-to-end optimization via simulations
Training cost DrivAer model
CFD simulations are expensive!
Create an intelligent agent that learns to map states to actions such that expected returns are maximized.
experience tuple at step $n$ $$ (S_n, A_n, R_{n+1}, S_{n+1}) $$
trajectory over $N$ steps $$\tau = \left[ (S_0, A_0, R_1, S_1), \ldots ,(S_{N-1}, A_{N-1}, R_N, S_N)\right]$$
return - dealing with sequential feedback
$$ G_n = R_{n+1} + R_{n+2} + ... + R_N $$
discounted return $$ G_n = R_{n+1} + \gamma R_{n+2} + \gamma^2 R_{n+3} + ... \gamma^{N-1}R_N $$
$\gamma$ - discounting factor, typically $\gamma = 0.99$
learning what to expect in a given state - value function loss
$$ L_V = \frac{1}{N_\tau N} \sum\limits_{\tau = 1}^{N_\tau}\sum\limits_{n = 1}^{N} \left( V(S_n^\tau) - G_n^\tau \right)^2 $$
Was the selected action a good one?
$$\delta_n = R_n + \gamma V(S_{n+1}) - V(S_n) $$ $$\delta_{n+1} = R_n + \gamma R_{n+1} + \gamma^2 V(S_{n+2}) - V(S_n) $$
$$ A_n^{GAE} = \sum\limits_{l=0}^{N-n} (\gamma \lambda)^l \delta_{n+l} $$
The policy network parametrizes a PDF over possible actions.
make good actions more likely - policy objective function
$$ J_\pi = \frac{1}{N_\tau N} \sum\limits_{\tau = 1}^{N_\tau}\sum\limits_{n = 1}^{N} \mathrm{min}\left[ \frac{\pi(A_n|S_n)}{\pi^{old}(A_n|S_n)} A^{GAE,\tau}_n, \mathrm{clamp}\left(\frac{\pi(A_n|S_n)}{\pi^{old}(A_n|S_n)}, 1-\epsilon, 1+\epsilon\right) A^{GAE,\tau}_n\right] $$
Idea: replace CFD with model(s) in some episodes
Challenge: dealing with surrogate model errors
for e in episodes:
if not models_reliable():
sample_trajectories_from_simulation()
update_models()
else:
sample_trajectories_from_models()
update_policy()
Based on Model Ensemble TRPO.
auto-regressive surrogate models with weights $\theta_m$
$$ m_{\theta_m} : (\underbrace{S_{n-d}, \ldots, S_{n-1}, S_n}_{\hat{S}_n}, A_n) \rightarrow (S_{n+1}, R_{n+1}) $$
$\mathbf{x}_n = [\hat{S}_n, A_n]$ and $\mathbf{y}_n = [S_{n+1}, R_{n+1}]$
$$ L_m = \frac{1}{|D|}\sum\limits_{i}^{|D|} (\mathbf{y}_i - m_{\theta_m}(\mathbf{x}_i))^2 $$
How to sample from the ensemble?
When are the models reliable?
Fluidic pinball at $Re=\bar{U}_\mathrm{in}d/\nu = 100$; $\omega^\ast_i = \omega d/U_\mathrm{in} \in [-0.5, 0.5]$.
reward at step $n$
$$ c_x = \sum\limits_{i=1}^3 c_{x,i},\quad c_y = \sum\limits_{i=1}^3 c_{y,i} $$
$$ R_n = 1.5 - (c_{x,n} + 0.5 |c_{y,n}|) $$
Mean reward $R$ per episode; $N_\mathrm{m}$ - ensemble size; $N_\mathrm{thr}$ - switching criterion.
Composition of model-based training time $T_\mathrm{MB}$ relative to model-free training time $T_\mathrm{MF}$.
Closed-loop control starts at $t=200s$.
Snapshot of velocity fields (best policies).
toward realistic AFC applications
Slides
GitHub