Imitation Learning

In driving, observations consist of observations of car’s camera, and actions are the way that you turn the steering wheel.

You will collect a large dataset consisting of (observation, action) tuples, and you can learn supervised learning to figure out actions given certain observations. (behavioral cloning)

hi world

Behavioral cloning doesn’t work (in theory)
- Small mistakes compound into larger mistakes, because small mistake veers the training trajectory off course, which makes it more likely to make bigger mistakes, because it is not used to anything off the training trajectory.
But sometimes, it works in practice (a lot of data and a few tricks)
- Why? Can be partly explained by the fact that they used three different camera angles (left, right, forward). Left and right cameras are teaching the car how to correct mistakes, which will help it correct itself after it has veered off course
Mathematically, this “training error drift” can be explained by the fact that the training and test distributions are not the same - when you take an action, it results in different observations, which changes the distribution.

Instead of trying to be clever about the policy, we can try to be clever about the underlying data distribution (so that the distribution of observations under the data is the same as the distribution of observations under the policy). This is a technique called DAgger (Dataset Aggregation).

Dagger steps:

Train policy from human data $D = {o1, a1, …, oN, aN}
Run the policy to get a policy dataset D_pi = {o1, …, oM} (These observations come from p_pi_Theta(oT))
Ask human to label D_pi with actions a_t
Aggregate the dataset
Go back to step 1, train policy on this aggregated dataset.

This isn’t that useful in practice sometimes, because step 3 is hard.

Problem with imitation learning

humans need to provide data, which is typically finite
- deep learning works best when data is plentiful
humans are not good at providing some kinds of actions
humans can learn autonomously; can our machines do the same?

uid: 202008311422 tags: #cs285

Date

February 22, 2023

Up next

Asking Nate About Stripe So, take all this with a hefty serving of salt. My opinions are all formed from 2 weeks of zoom/slack/email which is not necessarily the best

Previously

Daily review 07-24-20 My Things is organized at least. I generally don’t have enough time (or energy, I guess) to do enough things that are of my core list of todos