Actor-Critic ·
Policy + Value
un
seel
.com · Advantage = Q − V · cuts gradient variance
Step
0
TD error δ
—
State
—
Actor (policy π)
Critic (value V)
Reward / advantage +
Advantage −
Play
←
→
Unmute
Reset
Un
seel
.com · Actor-Critic Methods