Policy Learning from Tutorial Books via Understanding, Rehearsing and Introspecting

1Nanjing University, 2King's College London, 3University of Liverpool, 4The University of Hong Kong, 5University College Lonodn,
* Equal contribution

Abstract

When humans need to learn a new skill, we can acquire knowledge through written books, including textbooks, tutorials, etc. However, current research for decision-making, like reinforcement learning (RL), has primarily required numerous real interactions with the target environment to learn a skill, while failing to utilize the existing knowledge already summarized in the text. The success of Large Language Models (LLMs) sheds light on utilizing such knowledge behind the books. In this paper, we discuss a new policy learning problem called Policy Learning from tutorial Books (PLfB), which aims to leverage rich resources such as books and tutorials to derive a policy network. Inspired by how humans learn from books, we solve the problem via a three-stage framework: Understanding, Rehearsing, and Introspecting (URI). In particular, it first rehearses decision-making trajectories based on the derived knowledge after understanding the books, then introspects in the imaginary dataset to distill a policy network. To validate the practicality of this methodology, we train a football-playing policy via URI and test it in the Google Football game. The agent can beat the built-in AI with a 37% winning rate without interaction with the environment during training, while using GPT as the agent can only achieve a 6% winning rate.

Policy Learning from Books

Comparison of Policy Learning Methods

In this study, we introduce a novel topic built upon the shoulders of LLMs' systems: Policy Learning from Books (PLfB). PLfB aims to derive a policy network directly from natural language texts, bypassing the need for numerous real-world interaction data, as shown in this figure. This can be viewed as a further step towards enabling more resources for policy learning and also a more generalized form of offline RL, which uses textbooks to learn a policy offline.

Method

URI Framework Detailed Overview

To realize PLfB, inspired by human learning processes, we propose a three-stage learning methodology: Understanding, Rehearsal, and Introspection (URI), which is shown in Figure above. For understanding, it first extracts knowledge from books to form a knowledge database; then it rehearses imaginary decision-making trajectories with the help of the knowledge retrieved from the database; finally, it introspects on the imaginary dataset to distill a policy network for decision-making.

Our proposed URI framework consists of three main stages:

  1. Understanding: The knowledge extractor and aggregator modules process paragraphs from books to form a structured knowledge database organized as pseudo-code.
  2. Rehearsing: Using the knowledge database, the simulator generates and iterates through imagined states, actions, and rewards to create an extensive imaginary dataset.
  3. Introspecting: The introspection module refines the policy network by evaluating and correcting errors in the generated states, actions, and rewards to ensure accurate and effective policy implementation.
We give the first implementation for URI framework:
URI

Book Content Understanding

The understanding module extracts knowledge K from books B using a knowledge extractor and aggregator. Knowledge is represented as pseudo-code, which is interpretable, compact, and expressive. The process involves:

  • Using an LLM-injected model to extract knowledge from paragraphs
  • Aggregating knowledge using another LLM-injected model
  • Iteratively refining and organizing knowledge into dynamics, reward, and policy-related databases

Knowledge-based Rehearsing of Decision-Making

The rehearsing stage implements a closed-loop generation process involving LLM-injected dynamics, reward, and policy functions. Key components include:

  • State-based Knowledge Scope Retrieval to address modality gaps
  • Post-Retrieval Knowledge Instantiation for tailoring knowledge to specific states
  • Generation of imaginary dataset using LLMs for actions, rewards, and next states

Introspecting based on the Imaginary Dataset

The introspection stage refines the policy using the imaginary dataset. It involves:

  • Adopting Conservative Q-learning as the base offline RL algorithm
  • Adding uncertainty penalties for reward and transition estimations
  • Implementing Conservative Imaginary Q-Learning (CIQL) to address inaccuracies in LLM-generated data

Key Results

1. Performance in Multiple Environments

🎲 Tic-Tac-Toe Game

  • URI shows superior performance against various opponents.
  • URI's performance is closer to the optimal Minimax strategy compared to other methods.
Performance Comparison in Tic-Tac-Toe (URI as Player X)
Opponent (O) Win Draw Loss Net Win Rate
LLM-as-agent 80% 6% 14% +66%
LLM-RAG 62% 10% 18% +44%
Random Policy 70% 12% 18% +52%
Minimax-noise 36% 54% 10% +26%

⚽ Google Football (11v11)

  • URI outperforms baseline methods across all difficulty levels (Easy, Medium, Hard).
  • In the Hard task, URI achieves a higher win rate than the Rule-based Policy.
  • URI demonstrates consistent performance with an average Goal Difference per Match (GDM) of 0.38 ± 0.05 across all levels.
Performance Comparison Against Built-in AI Levels in GRF 11 vs 11 settings
Level Metric LLM-as-agent LLM-RAG Random Policy URI (Ours) Rule-based-AI
Easy Win 20% 30% 2% 37% ± 4% 70%
Draw 60% 60% 55% 57% ± 4% 30%
Lose 20% 10% 43% 6% ± 4% 0%
GDM 0.0 0.2 -0.58 0.40 ± 0.14 0.7
Medium Win 0% 20% 2% 42% ± 12% 70%
Draw 60% 60% 43% 50% ± 8% 30%
Lose 40% 20% 55% 8% ± 4% 0%
GDM -0.4 0.0 -0.76 0.43 ± 0.24 0.7
Hard Win 0% 0% 3% 32% ± 14% 30%
Draw 50% 40% 43% 58% ± 6% 70%
Lose 50% 60% 53% 10% ± 7% 0%
GDM -0.5 -0.6 -0.73 0.32 ± 0.14 0.3
Average Win 6.7% ± 9.4% 16.7% ± 12.5% 2.3% ± 0.5% 40.3% ± 6.2% 56%
GDM -0.30 ± 0.22 -0.13 ± 0.34 -0.69 ± 0.08 0.38 ± 0.05 0.56

2. Efficiency

  • URI significantly outperforms LLM-based methods in inference time, taking only 0.009 seconds on average to choose an action in Google Football.

3. Data Quality

  • Visualization of the imaginary dataset shows that generated data follows a similar distribution to real data.
  • Uncertainty estimation effectively identifies out-of-distribution clusters, enhancing the robustness of the learned policy.
TSNE Visualization 1
Figure 1: t-SNE visualization of real and generated data distribution in football game
TSNE Visualization 2
Figure 2: t-SNE visualization of real and generated data distribution in Tic-Tac-Toe game

LLM-RAG Agent (Baseline)

This video demonstrates the performance of our LLM-RAG Agent baseline in a challenging 11 vs 11 Medium level football scenario. The agent utilizes Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) to make decisions in real-time gameplay.

URI Policy (Ours)

The URI Agent demonstrates superior performance compared to the LLM-RAG baseline, showcasing enhanced decision-making capabilities and a significantly higher goal-scoring rate. This improvement is particularly evident in the agent's ability to create and capitalize on scoring opportunities more effectively.

BibTeX

@article{chen2024understanding,
      title={Understanding, Rehearsing, and Introspecting: Learn a Policy from Textual Tutorial Books in Football Games},
      author={Chen, Xiong-Hui and Wang, Ziyan and Du, Yali and Fang, Meng and Jiang, Shengyi and Yu, Yang and Wang, Jun},
      booktitle = {Advances in Neural Information Processing Systems},
      year={2024}
    }