Actor-Critic

Two main components in policy gradient are the policy model and the value function. It makes a lot of sense to learn the value function in addition to the policy, since knowing the value function can assist the policy update, such as by reducing gradient variance in vanilla policy gradients, and that is exactly what the Actor-Critic method does.

Actor-critic methods consist of two models, which may optionally share parameters:

Critic updates the value function parameters w and depending on the algorithm it could be action-value Qw(a|s) or state-value Vw(s). Actor updates the policy parameters θ for πθ(a|s), in the direction suggested by the critic. Let’s see how it works in a simple action-value actor-critic algorithm.

Initialize s,θ,w at random; sample a∼πθ(a|s). For t=1…T: Sample reward rt∼R(s,a) and next state s′∼P(s′|s,a); Then sample the next action a′∼πθ(a′|s′); Update the policy parameters: θ←θ+αθQw(s,a)∇θlnπθ(a|s); Compute the correction (TD error) for action-value at time t: δt=rt+γQw(s′,a′)−Qw(s,a) and use it to update the parameters of action-value function: w←w+αwδt∇wQw(s,a) Update a←a′ and s←s′. Two learning rates, αθ and αw, are predefined for policy and value function parameter updates respectively.

uid: 202009181412 tags: #knowledge

Date

February 22, 2023

Up next

Flatiron Application If there’s one thing I’ve learned from my internships, it’s this—I’m most productive and engaged with my work when my own morals and values are

Previously

Tuberculosis I didn’t know that tuberculosis can produce fusion of vertebrae and deformation of the spine - it has a much wider impact than I previously thought.