Much of model-based reinforcement learning involves learning a model of an agent’s world, and training an agent to leverage this model to perform a task more efficiently. While these models are demonstrably useful for agents, every naturally occurring model of the world of which we are aware – e.g., a brain – arose as the byproduct of competing evolutionary pressures for survival, not minimization of a supervised forward-predictive loss via gradient descent. That useful models can arise out of the messy and slow optimization process of evolution suggests that forward-predictive modeling can arise as a side-effect of optimization under the right circumstances. Crucially, this optimization process need not explicitly be a forward-predictive loss. In this work, we introduce a modification to traditional reinforcement learning which we call observational dropout, whereby we limit the agents ability to observe the real environment at each timestep. We show that the emerged world model, while not explicitly trained to predict the future, can help the agent learn key skills required to perform well in its environment.
We hypothesize that explicit forward prediction is not required to learn useful models of the world, and that prediction may arise as an emergent property if it is useful for an agent to perform its task. To encourage prediction to emerge, we introduce a constraint to our agent: at each timestep, the agent is only allowed to observe its environment with some probability p. To cope with this constraint, we give our agent an internal model which takes the previous observation and action taken as input, and outputs a generated observation. The agent’s internal observation will thus either be the ground truth observation with probability p (referred to as the peek probability) or the output of its model with 1-p. The agent’s policy will act on this internal observation without knowing whether it is real, or generated by its internal model.
In this work, we investigate to what extent world models trained with policy gradients behave like forward predictive models, by restricting the agent’s ability to observe its environment. By jointly learning both the policy and model to perform well on the given task, we can directly optimize the model without ever explicitly training it for forward prediction. This allows the model to focus on generating any “predictions” that are useful for the policy to perform well on the task, even if they are not realistic. The models that emerge under our constraints capture the essence of what the agent needs to see from its world. We conduct various experiments to show, under certain conditions, that the models learn to behave like imperfect forward predictors. We demonstrate that these models can be used to generate environments that do not follow the rules that govern the actual environment, but nonetheless can be used to teach the agent important skills needed to perform its task in the actual environment. We also examine the role of inductive biases in the world model, and show that the architecture of the model plays a role in not only its usefulness, but also its interpretability.
Cartpole Swingup Experiment
In this experiment, described in detail in the paper, our agent is given only infrequent observations of its environment, and must learn a world model to fill in the observation gaps. In the environment in the video below, p=5%.
The world model jointly learned with the policy to fill in the observation gaps in the swingup cartpole environment. The colorless cartpole represents the observation seen by the policy. Solid color cartpole denotes when the agent is allowed to observe the actual environment, and when its internal model is realigned with reality. Light color cartpole denotes actual observations that the agent cannot see.
We can check that the policy that is jointly learned with the world model learns a policy that works when observational dropout is disabled in the environment:
Section 4.1 of the paper also attempts to train a policy from scratch inside an open loop environment generated by the world model. The below video is the policy learned:
As seen in the above video, in this world model, the optimal policy learns to swing up the pole and only balance it for a short period of time, even in the open loop environment inside the world model. It should not surprise us then, that the most successful policies when deployed back to the actual environment can swing up and only balance the pole for a short while, before seeing the pole collapse:
We can visualize other open loop environments generated by world models at different peek probabilities. The following corresponds to Figure 3 in the paper:
It is interesting to note that the dynamics in this world model in the above video are not perfect. For instance, the optimal policy inside the world model can only swing up and balance the pole at an angle that is not perpendicular to the ground. But surprisingly, when this policy is deployed back into the actual environment, it manages to swing up the pole and balance it for a short period of time:
Car Racing Experiment
In more challenging environments, our observations are often expressed as high dimensional pixel images rather than state vectors. In this experiment, we apply observation dropout to learn a world model of a car racing game from pixel observations. We would like to know to what extent the world model can facilitate the policy at driving if the agent is only allowed to see the road only only a fraction of the time. We are also interested in the representations the model learns to facilitate driving, and in measuring the usefulness of its internal representation for the task.
To reduce the dimensionality of the pixel observations, we follow previous work in the literature and train a Variational Autoencoder (VAE) using on rollouts collected from a random policy, to compress a pixel observation into a small dimensional latent vector z. Our agent will use z instead as its observation.
The above video, corresponding to Figure 7 in the paper, is an animation that illustrates the world model’s prediction on the right, vs the ground truth pixel observation on the left. Frames that have a red border frames indicate actual observations from the environment the agent is allowed to see (p=10% for this environment). The policy is acting on the observations (real or generated) on the right.