Shared World Models
View the source of this content.
The goal of this article is to clarify the papers Dream to Control: Learning Behaviors by Latent Imagination and ...
In the first paper, the authors repeatedly write
or variants of that. I found that an unclear choice of words. Especially on the first reading. Because you don't backpropagate through neural network dynamics, do you? You simply backpropagate gradients through a neural network. So what do they mean?backpropagate through the neural network dynamics
Well, as it turns out, just explaining that phrase, explains a lot of the paper and the paper can be explained by explaining that phrase, touching on everything in the paper. Which is what I will try to do in this article.
And you might expect an IPython notebook here. But notebooks mainly run through source code, while here I try to run through the paper and clarify some things. Some of these things are references to external sources, some things are probably basic for many readers but may not be active knowledge for others. In the explanation here, I also have a few pointers into the source code, which is fairly concise. It won't help us achieve our main mission, which is to change ‘dm_control‘ to ‘RLBench‘ so this article doesn't have priority, but it will hopefully help getting started with the source code.
The network encodes the MDP into an abstract representation. That then has its own dynamics. It would seem that these dynamics follow the original dynamics in lockstep. Indeed, abstracting over time is future work as far as this paper goes.
However, the model is Markovian:
So the model can be sampled deep into the future.The latent dynamics define a Markov decision process (MDP; Sutton, 1991) that is fully observed because the compact model states s_t are Markovian."
In sum, "leveraging the neural network latent dynamics" is what it is.
The sparse reward didn't seem to produce the desired results. We implemented a dense reward. This experiment uses it, with one 64camera, gcloud, TargetReach.
Experiment 2The sparse reward didn't seem to produce the desired results. We implemented a dense reward. This experiment uses it, with one 64camera, gcloud, TargetReach. RLBench originally has objects that were detectable by a 128 camera. To make sure a 64 pixel camera can see the target, we doubled its size. We also removed two distractor objects, to make this task easier to learn. This does work.
Experiment 3We started implementation of imitation learning. It doesn't use the real reward. Instead, the prefill replay_buffer is filled with real actions and observations and the reward is based on the episode length: 0 if done is not reached, discounted iif it is.
Experiment 4We also began an implementation of shared world models, that is, the normal action model is a decoder,
in this experiment, we replaced the final layer of the decoder when the robot arm changed, where each of both
arms has its own action parameterization.
The actions in the original model are used in a few places:
Then there can be variables in the code called action, and we have to look carefully whether at that place in the code, it is a good idea to encode the actions. Also, actions form a sequence, we need to think carefully whether we want to encode sequences or individual actions, whether this matters, etc.
Experiment 5: Behavior LearningComparison of action selection schemes on the continuous control task 'ReachTarget' across 5 seeds. We compare Dreamer that learns both actions and values in rollout, to only learning actions in latent rollout.
Experiment 6: Representation LearningOn the sparse reward tasks, Hafner et al. report outperformance by D4PG (1e9 steps), but also competitive performance by use of Dreamer with a Constrastive loss for representation learning.
Experiment 7: Representation LearningOn the sparse reward tasks, Hafner et al. report competitive performance by use of Dreamer with an action repeat of 4.
Experiment 8: Behavior LearningThen we have:
latent imagination
hypothetical trajectories in the compact latent space of the world model
The answer, I think, mainly relates to the learning of the value model. The action model learning follows the value model (although vice versa) and uses analytic gradients (see below), but the action model is just a state -> action function. Value learning brings together the value estimates and the value model: Value learning is a regression of (neural) value model predictions onto value estimates and these estimates are the aforementioned constructions. In this case the whole construction process occurs in neural representation. Value estimates depend on the reward and value predictions, which depend on represented states, which in turn depend on represented actions. Despite all these operations in "abstract algebra", it is, at least in this iteration of PlaNet still a stepwise procedure (in 2021, Hafner abstracts over time).
Control with latent dynamics Embed 2 Control (Watter et al., 2015) and Robust locally-linear Controllable Embedding (Banijamali et al., 2017) embed images to predict forward in a compact space to solve simple tasks. World Models (Ha and Schmidhuber, 2018) learn latent dynamics in a two-stage process to evolve linear controllers in imagination. PlaNet (Hafner et al., 2018) learns them jointly and solves visual locomotion tasks by latent online planning. Stochastic Optimal control with LAtent Representations (Zhang et al., 2019) solves robotic tasks via guided policy search in latent space. Imagination A2ugmented Agents(Weber et al., 2017) hands imagined trajectories to a model-free policy, while Lee et al. (2019) and Gregor et al. (2019) learn belief representations to accelerate model-free agents. All these methods were unable to do everything in the represented RL-MDP or unable to do so over many time steps.
Reparametrization is a key trick for this paper and worthwhile to take note of. To understand this aspect of Dreamer, I paraphrase Bengio et al. Traditional neural networks implement a deterministic transformation of some input variables x. In case of sampling, it is not possible to attribute gradients properly. I think it was Kingma et al. who conceived of a method to make this possible. For the Gaussian it is simple, but for other distributions not so (it will be learned in time). I replace the Bernoulli distribution by a Continuous Bernoulli.
PlaNet, people, profit
Dreamer is a different name for a second iteration of PlaNet (This boy doesn't reach to the ground). I am understating it because Dreamer is an overstatement.
Value Prediction Network (Oh et al., 2017), Model-based Value Estimation (Feinberg et al., 2018), and STochastich Ensemble Value Estimation (Buckman et al., 2018) learn dynamics for multi-step Q-learning from a replay buffer. AlphaGo (Silver et al., 2017) combines predictions of actions and state values with planning, assuming access to the true dynamics. Also assuming access to the dynamics, Plan Onlin Learn Offline (Lowrey et al., 2018) plans to explore by learning a value ensemble. MuZero (Schrittwieser et al., 2019) learns task-specific reward and value models to solve challenging tasks but requires large amounts of experience. Probabilistic Ensembles with Trajectory Sampling (Chua et al., 2018), VisualModel Predictive Control (Ebert et al., 2017), and PlaNet (Hafner et al., 2018) plan online using derivative-free optimization. Model based POlicy PLannINg (Wang and Ba, 2019) improves over online planning by self-imitation. Piergiovanni et al. (2018) learn robot policies by imagination with a latent dynamics model. Planning with neural network gradients was shown on small problems (Schmidhuber, 1990; Henaff et al., 2018) but has been challenging to scale (Parmas et al., 2019).
Deterministic Policy Gradient (Silver et al., 2014), Deep Deterministic Policy Gradient (Lillicrap et al., 2015), and Soft Actor-Critic (Haarnoja et al., 2018) leverage gradients of learned immediate action values to learn a policy by experience replay.
Stochastic Value Gradient (Heess et al., 2015) reduces the variance of model-free on-policy algorithms by analytic value gradients of one-step model predictions.
Concurrent work by Byravan et al. (2019) uses latent imagination with deterministic models for navigation and manipulation tasks.
Model Ensemble - Trust Region Policy Optimization (Kurutach et al., 2018) accelerates an otherwise model-free agent via gradients of predicted rewards for proprioceptive inputs.
Distilled Gradient Based Planning (Henaff et al., 2017; 2019) uses model gradients for online planning in simple tasks.