Actor-Critic · Policy + Value
unseel.com · Advantage = Q − V · cuts gradient variance
Step 0
TD error δ
State
Actor (policy π)
Critic (value V)
Reward / advantage +
Advantage −
Unseel.com · Actor-Critic Methods