Imitation Learning
In driving, observations consist of observations of car’s camera, and actions are the way that you turn the steering wheel.
You will collect a large dataset consisting of (observation, action) tuples, and you can learn supervised learning to figure out actions given certain observations. (behavioral cloning)
hi world
- Behavioral cloning doesn’t work (in theory)
- Small mistakes compound into larger mistakes, because small mistake veers the training trajectory off course, which makes it more likely to make bigger mistakes, because it is not used to anything off the training trajectory.
- But sometimes, it works in practice (a lot of data and a few tricks)
- Why? Can be partly explained by the fact that they used three different camera angles (left, right, forward). Left and right cameras are teaching the car how to correct mistakes, which will help it correct itself after it has veered off course
- Mathematically, this “training error drift” can be explained by the fact that the training and test distributions are not the same - when you take an action, it results in different observations, which changes the distribution.
Instead of trying to be clever about the policy, we can try to be clever about the underlying data distribution (so that the distribution of observations under the data is the same as the distribution of observations under the policy). This is a technique called DAgger (Dataset Aggregation).
Dagger steps:
- Train policy from human data $D = {o1, a1, …, oN, aN}
- Run the policy to get a policy dataset D_pi = {o1, …, oM} (These observations come from p_pi_Theta(oT))
- Ask human to label D_pi with actions a_t
- Aggregate the dataset
- Go back to step 1, train policy on this aggregated dataset.
This isn’t that useful in practice sometimes, because step 3 is hard.
Problem with imitation learning
- humans need to provide data, which is typically finite
- deep learning works best when data is plentiful
- humans are not good at providing some kinds of actions
- humans can learn autonomously; can our machines do the same?
uid: 202008311422 tags: #cs285