1. Title of Publication

Discovered Policy Optimisation

2. Author Information

Name: Chris Lu
Institution/Company: University of Oxford
Email: christopher.lu@exeter.ox.ac.uk

Name: Jakub Grudzien Kuba
Institution/Company: UC Berkeley
Email: kuba@berkeley.edu

Name: Alistair Letcher
Institution/Company: N/A
Email: ahp.letcher@gmail.com

Name: Luke Metz
Institution/Company: OpenAI
Email: luke.s.metz@gmail.com

Name: Christian Schroeder de Witt
Institution/Company: University of Oxford
Email: cs@robots.ox.ac.uk

Name: Jakob Foerster
Institution/Company: University of Oxford
Email: jakob.foerster@eng.ox.ac.uk

3. Corresponding Author

Name: Chris Lu
Institution: University of Oxford
email: christopher.lu@exeter.ox.ac.uk

4. Paper Abstract:

Tremendous progress has been made in reinforcement learning (RL) over the past decade. Most of these advancements came through the continual development of new algorithms, which were designed using a combination of mathematical derivations, intuitions, and experimentation. Such an approach of creating algorithms manually is limited by human understanding and ingenuity. In contrast, meta-learning provides a toolkit for automatic machine learning method optimisation, potentially addressing this flaw. However, black-box approaches which attempt to discover RL algorithms with minimal prior structure have thus far not outperformed existing hand-crafted algorithms. Mirror Learning, which includes RL algorithms, such as PPO, offers a potential middle-ground starting point: while every method in this framework comes with theoretical guarantees, components that differentiate them are subject to design. In this paper we explore the Mirror Learning space by meta-learning a "drift" function. We refer to the immediate result as Learnt Policy Optimisation (LPO). By analysing LPO we gain original insights into policy optimisation which we use to formulate a novel, closed-form RL algorithm, Discovered Policy Optimisation (DPO). Our experiments in Brax environments confirm state-of-the-art performance of LPO and DPO, as well as their transfer to unseen settings.

5. Competition Criteria

(D) The result is publishable in its own right as a new scientific result  independent of the fact that the result was mechanically created.

(E) The result is equal to or better than the most recent human-created solution to a long-standing problem for which there has been a succession of increasingly better human-created solutions.

(F) The result is equal to or better than a result that was considered an achievement in its field at the time it was first discovered.

(G) The result solves a problem of indisputable difficulty in its field.

6. Statement of Why the Results Satisfy the Crieteria (D), (E), (F), and (G)

(E) The result is equal to or better than the most recent human-created solution to a long-standing problem for which there has been a succession of increasingly better human-created solutions.

One of the first widely-successful Deep RL algorithm was DQN [6], which was the first to achieve human-level performance across a range of Atari games. Following DQN was A3C [7] and other actor-critic approaches, which used policy-based methods to achieve superior results in Atari games, impressive initial results on continuous control motor problems, and also ran significantly faster. Concurrently, Trust Region Policy Optimization (TRPO) [8] was developed, which signifcantly stablilized the learning. Finally, Proximal Policy Optimization (PPO) [9] stood out as a simple and more performant approach to Deep Reinforcement Learning and is still one of the most widely used Deep RL algortihms today. While some algorithms have since been proposed, PPO remains as the go-to algorithm for most RL tasks because of its performance and speed. PPO was used to train GPT4, the recent and widely-used chatbot from OpenAI, from human feedback [5].

We evaluate our algorithms LPO and DPO on continuous control tasks and atari-like tasks and find that our method significantly outperforms PPO on them, despite only being meta-trained on a single task. At a high level, we build on theoretical work that unifies the aforementioned algorithms under a single broad framework of algorithms with theoretical guarantees and then evolve algortihms within that space.

(F) The result is equal to or better than a result that was considered an achievement in its field at the time it was first discovered.

Our closed-form algorithm discovered algorithm, DPO, outperforms PPO on continuous control tasks such as Ant, Humanoid, Walker2D, and HalfCheetah. PPO reported results in the same set of environments using the Mujoco simulator. While we use the Brax simulator, which was developed afterwards and runs significantly faster, the environments are designed to be very similar.

(D) The result is publishable in its own right as a new scientific result independent of the fact that the result was mechanically created.

New, simple algorithms, that significantly outperform PPO are extremely uncommon. While certain algorithms can outperform PPO on sample efficiency, they often do so by including hand-crafted and brittle components or by introducing vast amounts of additional complexity and extra computational costs. DPO matches PPO in its simplicity while being more performant. DPO, or future variants, could plausibly become the next widely-used standard Deep RL algorithm.

(G) The result solves a problem of indisputable difficulty in its field.

Previous attempts at meta-learning novel reinforcement learning algorithms have not yielded RL algorithms that outperform handcrafted alternatives like PPO across the board [10, 11]. Most prior attempts at learning reinforcement learning algorithms used meta-gradients for optimisation. However, using meta-gradients to optimise across such long horizons is notoriously challenging and high-variance. Instead, we used evolution strategies, which are uniquely suited for this task because they are agnostic to the length of the optimization horizon, are unbiased, and are highly parallelisable. To overcome the computational cost of ES, we take advantage of recent advancements in hardware acceleration to vectorise entire RL algorithms, allowing us to train thousands of agents in parallel on a single GPU, which results in training times that are over 4000x faster. This technique is itself novel and is a significant contribution of the paper which has the potential to radically democratize Deep RL research by vastly lowering the computational barrier to entry (we have an informal write-up that accompanies our open-source implementation https://chrislu.page/blog/meta-disco/).

7. Full Citation

Lu, Chris, Jakub Kuba, Alistair Letcher, Luke Metz, Christian Schroeder de Witt, and Jakob Foerster. "Discovered policy optimisation." Advances in Neural Information Processing Systems 35 (2022): 16455-16468.
DOI: https://doi.org/10.48550/arXiv.2210.05639

8. Prize Money Breakdown

Prize money is to be equally distributed amongst all co-authors.

9. A Statement Indicating Why this Entry Could Be the "Best"

Deep Reinforcement Learning (RL) has been the key component of many recent successes in machine learning. This ranges from impressive results in games like Go, Starcraft, and DOTA, [1,2,3] to more practical applications such as nuclear fusion plasma control and finetuning language models with human feedback (e.g. GPT4) [4,5]. The underlying RL algorithms for these tasks have gradually dveloped over the years; however, an RL algorithm that is six years old (PPO) was used to train the latest state-of-the-art large language model despite extended attempts to improve on it. Most efforts to improve on PPO have involved hand-crafted algorithms that introduce large amounts of additional complexity and assumptions, rendering them highly domain-specific. Efforts that attempt to meta-learn novel reinforcement learning algorithms have failed to outperform existing algorithms across the board [10, 11]. We instead combine recent theoretical results in RL [12] with a novel approach to vectorizing RL training to rapidly evolve new RL algorithms that outperform PPO. Furthermore, we are able to visualise the resulting discovered artifact to generate insights for policy optimisation, allowing us to construct a closed-form analytical approximation to the learned function. Discovered Policy Optimisation (DPO), or future versions of it, could plausibly become the next go-to RL algorithm used to train large systems like GPT-4.

10. Evolutionary Computation Type

Evolution Strategies (ES)

11. Publication Date

The publication was accepted on 14 September 2022 at the Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS). (https://openreview.net/forum?id=bVVIZjQ2AA)

[1] Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." nature 529.7587 (2016): 484-489.
[2] Vinyals, Oriol, et al. "Grandmaster level in StarCraft II using multi-agent reinforcement learning." Nature 575.7782 (2019): 350-354.
[3] Berner, Christopher, et al. "Dota 2 with large scale deep reinforcement learning." arXiv preprint arXiv:1912.06680 (2019).
[4] Degrave, Jonas, et al. "Magnetic control of tokamak plasmas through deep reinforcement learning." Nature 602.7897 (2022): 414-419.
[5] OpenAI "GPT-4 Technical Report" arXiv:2303.08774 (2023).
[6] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." nature 518.7540 (2015): 529-533.
[7] Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International conference on machine learning. PMLR, 2016.
[8] Schulman, John, et al. "Trust region policy optimization." International conference on machine learning. PMLR, 2015.
[9] Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
[10] Oh, Junhyuk, et al. "Discovering reinforcement learning algorithms." Advances in Neural Information Processing Systems 33 (2020): 1060-1070.
[11] Kirsch, Louis, Sjoerd van Steenkiste, and Jürgen Schmidhuber. "Improving generalization in meta reinforcement learning using learned objectives." arXiv preprint arXiv:1910.04098 (2019).
[12] Grudzien, Jakub, Christian A. Schroeder De Witt, and Jakob Foerster. "Mirror learning: A unifying framework of policy optimisation." International Conference on Machine Learning. PMLR, 2022.