Actor-Critic

Two main components in policy gradient are the policy model and the value function. It makes a lot of sense to learn the value function in addition to the policy, since knowing the value function can assist the policy update, such as by reducing gradient variance in vanilla policy gradients, and that is exactly what the Actor-Critic method does.

Actor-critic methods consist of two models, which may optionally share parameters:

Critic updates the value function parameters w and depending on the algorithm it could be action-value Qw(a|s) or state-value Vw(s). Actor updates the policy parameters θ for πθ(a|s), in the direction suggested by the critic. Let’s see how it works in a simple action-value actor-critic algorithm.

Initialize s,θ,w at random; sample a∼πθ(a|s). For t=1…T: Sample reward rt∼R(s,a) and next state s′∼P(s′|s,a); Then sample the next action a′∼πθ(a′|s′); Update the policy parameters: θ←θ+αθQw(s,a)∇θlnπθ(a|s); Compute the correction (TD error) for action-value at time t: δt=rt+γQw(s′,a′)−Qw(s,a) and use it to update the parameters of action-value function: w←w+αwδt∇wQw(s,a) Update a←a′ and s←s′. Two learning rates, αθ and αw, are predefined for policy and value function parameter updates respectively.


uid: 202009181412 tags: #knowledge


Date
February 22, 2023