Dreamer

Shared World Models

View the Project on GitHub orderedlist/minimal

Experiments

Dynamics

Latent Imagination

Reparameterization

RSSM

Imagination Horizon

Analytic Gradients

Dreamer V1

View the source of this content.

The goal of this article is to clarify the papers Dream to Control: Learning Behaviors by Latent Imagination and ...

In the first paper, the authors repeatedly write

backpropagate through the neural network dynamics

or variants of that. I found that an unclear choice of words. Especially on the first reading. Because you don't backpropagate through neural network dynamics, do you? You simply backpropagate gradients through a neural network. So what do they mean?

Well, as it turns out, just explaining that phrase, explains a lot of the paper and the paper can be explained by explaining that phrase, touching on everything in the paper. Which is what I will try to do in this article.

And you might expect an IPython notebook here. But notebooks mainly run through source code, while here I try to run through the paper and clarify some things. Some of these things are references to external sources, some things are probably basic for many readers but may not be active knowledge for others. In the explanation here, I also have a few pointers into the source code, which is fairly concise. It won't help us achieve our main mission, which is to change ‘dm_control‘ to ‘RLBench‘ so this article doesn't have priority, but it will hopefully help getting started with the source code.

Dynamics

The network encodes the MDP into an abstract representation. That then has its own dynamics. It would seem that these dynamics follow the original dynamics in lockstep. Indeed, abstracting over time is future work as far as this paper goes.

However, the model is Markovian:

The latent dynamics define a Markov decision process (MDP; Sutton, 1991) that is fully observed because the compact model states s_t are Markovian."

So the model can be sampled deep into the future.

In sum, "leveraging the neural network latent dynamics" is what it is.

Experiments

Experiment 1

The sparse reward didn't seem to produce the desired results. We implemented a dense reward. This experiment uses it, with one 64camera, gcloud, TargetReach.

Experiment 2

The sparse reward didn't seem to produce the desired results. We implemented a dense reward. This experiment uses it, with one 64camera, gcloud, TargetReach. RLBench originally has objects that were detectable by a 128 camera. To make sure a 64 pixel camera can see the target, we doubled its size. We also removed two distractor objects, to make this task easier to learn. This does work.

Experiment 3

We started implementation of imitation learning. It doesn't use the real reward. Instead, the prefill replay_buffer is filled with real actions and observations and the reward is based on the episode length: 0 if done is not reached, discounted iif it is.

Experiment 4

We also began an implementation of shared world models, that is, the normal action model is a decoder, in this experiment, we replaced the final layer of the decoder when the robot arm changed, where each of both arms has its own action parameterization.
The actions in the original model are used in a few places:

  • the representation model (a_{t-1})
  • the transition model (a_{t-1})
  • to fill the prefill buffer
  • to imagine trajectories {(s_tau, a_tau)^{t+H}_{tau=t}} from each s_t
  • to interact with the environment
  • when adding experience to the dataset
In this case, imagining trajectories can occur based on embedded actions. In fact this already happens. This is a little experimental. Sometimes, encoded actions cannot be used, like when interacting with the environment. When using the dynamics model to generate additional rollouts, it may not matter what the format of the action vectors is.

Then there can be variables in the code called action, and we have to look carefully whether at that place in the code, it is a good idea to encode the actions. Also, actions form a sequence, we need to think carefully whether we want to encode sequences or individual actions, whether this matters, etc.

Experiment 5: Behavior Learning

Comparison of action selection schemes on the continuous control task 'ReachTarget' across 5 seeds. We compare Dreamer that learns both actions and values in rollout, to only learning actions in latent rollout.

Experiment 6: Representation Learning

On the sparse reward tasks, Hafner et al. report outperformance by D4PG (1e9 steps), but also competitive performance by use of Dreamer with a Constrastive loss for representation learning.

Experiment 7: Representation Learning

On the sparse reward tasks, Hafner et al. report competitive performance by use of Dreamer with an action repeat of 4.

Experiment 8: Behavior Learning

Latent Imagination

Then we have:

  • latent imagination

  • hypothetical trajectories in the compact latent space of the world model

These are synonymous. And, given the lack of imagery, also a poor choice of words. Rollout would be better or "forward pass." "Latent imagination" is a phrase mainly used in the title. You might as well ignore it. Just like "Dreamer" (PlaNet 1.2). Imagination is effectively sampling a predictive model. Except that the samples drawn (for example vectors of ordinary scalar values (floats)) from a probability distribution, that are then filled in for model parameters, much like assigning values to mathematical function parameters such as x in y(x) = 2*x. For example setting x to 9, to obtain 18. In imagination, this happens repeatedly, every time collecting the outputs and reusing those in the next step, so sampling is simulation or a constructed sampling program. The question is how this sampling construction precisely proceeds?

The answer, I think, mainly relates to the learning of the value model. The action model learning follows the value model (although vice versa) and uses analytic gradients (see below), but the action model is just a state -> action function. Value learning brings together the value estimates and the value model: Value learning is a regression of (neural) value model predictions onto value estimates and these estimates are the aforementioned constructions. In this case the whole construction process occurs in neural representation. Value estimates depend on the reward and value predictions, which depend on represented states, which in turn depend on represented actions. Despite all these operations in "abstract algebra", it is, at least in this iteration of PlaNet still a stepwise procedure (in 2021, Hafner abstracts over time).

Control with latent dynamics Embed 2 Control (Watter et al., 2015) and Robust locally-linear Controllable Embedding (Banijamali et al., 2017) embed images to predict forward in a compact space to solve simple tasks. World Models (Ha and Schmidhuber, 2018) learn latent dynamics in a two-stage process to evolve linear controllers in imagination. PlaNet (Hafner et al., 2018) learns them jointly and solves visual locomotion tasks by latent online planning. Stochastic Optimal control with LAtent Representations (Zhang et al., 2019) solves robotic tasks via guided policy search in latent space. Imagination A2ugmented Agents(Weber et al., 2017) hands imagined trajectories to a model-free policy, while Lee et al. (2019) and Gregor et al. (2019) learn belief representations to accelerate model-free agents. All these methods were unable to do everything in the represented RL-MDP or unable to do so over many time steps.

Reparameterization

Reparametrization is a key trick for this paper and worthwhile to take note of. To understand this aspect of Dreamer, I paraphrase Bengio et al. Traditional neural networks implement a deterministic transformation of some input variables x. In case of sampling, it is not possible to attribute gradients properly. I think it was Kingma et al. who conceived of a method to make this possible. For the Gaussian it is simple, but for other distributions not so (it will be learned in time). I replace the Bernoulli distribution by a Continuous Bernoulli.

RSSM

PlaNet, people, profit

Dreamer is a different name for a second iteration of PlaNet (This boy doesn't reach to the ground). I am understating it because Dreamer is an overstatement.

Imagined multi-step returns

Value Prediction Network (Oh et al., 2017), Model-based Value Estimation (Feinberg et al., 2018), and STochastich Ensemble Value Estimation (Buckman et al., 2018) learn dynamics for multi-step Q-learning from a replay buffer. AlphaGo (Silver et al., 2017) combines predictions of actions and state values with planning, assuming access to the true dynamics. Also assuming access to the dynamics, Plan Onlin Learn Offline (Lowrey et al., 2018) plans to explore by learning a value ensemble. MuZero (Schrittwieser et al., 2019) learns task-specific reward and value models to solve challenging tasks but requires large amounts of experience. Probabilistic Ensembles with Trajectory Sampling (Chua et al., 2018), VisualModel Predictive Control (Ebert et al., 2017), and PlaNet (Hafner et al., 2018) plan online using derivative-free optimization. Model based POlicy PLannINg (Wang and Ba, 2019) improves over online planning by self-imitation. Piergiovanni et al. (2018) learn robot policies by imagination with a latent dynamics model. Planning with neural network gradients was shown on small problems (Schmidhuber, 1990; Henaff et al., 2018) but has been challenging to scale (Parmas et al., 2019).

Analytic Value Gradients

Deterministic Policy Gradient (Silver et al., 2014), Deep Deterministic Policy Gradient (Lillicrap et al., 2015), and Soft Actor-Critic (Haarnoja et al., 2018) leverage gradients of learned immediate action values to learn a policy by experience replay.

Stochastic Value Gradient (Heess et al., 2015) reduces the variance of model-free on-policy algorithms by analytic value gradients of one-step model predictions.

Concurrent work by Byravan et al. (2019) uses latent imagination with deterministic models for navigation and manipulation tasks.

Model Ensemble - Trust Region Policy Optimization (Kurutach et al., 2018) accelerates an otherwise model-free agent via gradients of predicted rewards for proprioceptive inputs.

Distilled Gradient Based Planning (Henaff et al., 2017; 2019) uses model gradients for online planning in simple tasks.