Policy Learning from Tutorial Books via Understanding, Rehearsing and Introspecting

Abstract

When humans need to learn a new skill, we can acquire knowledge through written books, including textbooks, tutorials, etc. However, current research for decision-making, like reinforcement learning (RL), has primarily required numerous real interactions with the target environment to learn a skill, while failing to utilize the existing knowledge already summarized in the text. The success of Large Language Models (LLMs) sheds light on utilizing such knowledge behind the books. In this paper, we discuss a new policy learning problem called Policy Learning from tutorial Books (PLfB) upon the shoulders of LLMs’ systems, which aims to leverage rich resources such as tutorial books to derive a policy network. Inspired by how humans learn from books, we solve the problem via a three-stage framework: Understanding, Rehearsing, and Introspecting (URI). In particular, it first rehearses decision-making trajectories based on the derived knowledge after understanding the books, then introspects in the imaginary dataset to distill a policy network. We build two benchmarks for PLfB~based on Tic-Tac-Toe and Football games. In experiment, URI's policy achieves at least 44% net win rate against GPT-based agents without any real data; In Football game, which is a complex scenario, URI's policy beat the built-in AIs with a 37% while using GPT-based agent can only achieve a 6\% winning rate.

Policy Learning from Books

In this study, we introduce a novel topic built upon the shoulders of LLMs' systems: Policy Learning from Books (PLfB). PLfB aims to derive a policy network directly from natural language texts, bypassing the need for numerous real-world interaction data, as shown in this figure. This can be viewed as a further step towards enabling more resources for policy learning and also a more generalized form of offline RL, which uses textbooks to learn a policy offline.

Method

To realize PLfB, inspired by human learning processes, we propose a three-stage learning methodology: Understanding, Rehearsal, and Introspection (URI), which is shown in Figure above. For understanding, it first extracts knowledge from books to form a knowledge database; then it rehearses imaginary decision-making trajectories with the help of the knowledge retrieved from the database; finally, it introspects on the imaginary dataset to distill a policy network for decision-making.

Our proposed URI framework consists of three main stages:

Understanding: The knowledge extractor and aggregator modules process paragraphs from books to form a structured knowledge database organized as pseudo-code.
Rehearsing: Using the knowledge database, the simulator generates and iterates through imagined states, actions, and rewards to create an extensive imaginary dataset.
Introspecting: The introspection module refines the policy network by evaluating and correcting errors in the generated states, actions, and rewards to ensure accurate and effective policy implementation.

We give the first implementation for URI framework:

Book Content Understanding
Knowledge-based Rehearsing
Introspecting based on the Imaginary Dataset

Book Content Understanding

The understanding module extracts knowledge K from books B using a knowledge extractor and aggregator. Knowledge is represented as pseudo-code, which is interpretable, compact, and expressive. The process involves:

Using an LLM-injected model to extract knowledge from paragraphs
Aggregating knowledge using another LLM-injected model
Iteratively refining and organizing knowledge into dynamics, reward, and policy-related databases

Knowledge-based Rehearsing of Decision-Making

The rehearsing stage implements a closed-loop generation process involving LLM-injected dynamics, reward, and policy functions. Key components include:

State-based Knowledge Scope Retrieval to address modality gaps
Post-Retrieval Knowledge Instantiation for tailoring knowledge to specific states
Generation of imaginary dataset using LLMs for actions, rewards, and next states

Introspecting based on the Imaginary Dataset

The introspection stage refines the policy using the imaginary dataset. It involves:

Adopting Conservative Q-learning as the base offline RL algorithm
Adding uncertainty penalties for reward and transition estimations
Implementing Conservative Imaginary Q-Learning (CIQL) to address inaccuracies in LLM-generated data

Key Results

1. Performance in Multiple Environments

🎲 Tic-Tac-Toe Game

URI shows superior performance against various opponents.
URI's performance is closer to the optimal Minimax strategy compared to other methods.

Performance Comparison in Tic-Tac-Toe (URI as Player X)
Opponent (O)	Win	Draw	Loss	Net Win Rate
LLM-as-agent	80%	6%	14%	+66%
LLM-RAG	62%	10%	18%	+44%
Random Policy	70%	12%	18%	+52%
Minimax-noise	36%	54%	10%	+26%

⚽ Google Football (11v11)

URI outperforms baseline methods across all difficulty levels (Easy, Medium, Hard).
In the Hard task, URI achieves a higher win rate than the Rule-based Policy.
URI demonstrates consistent performance with an average Goal Difference per Match (GDM) of 0.38 ± 0.05 across all levels.

Performance Comparison Against Built-in AI Levels in GRF 11 vs 11 settings
Level	Metric	LLM-as-agent	LLM-RAG	Random Policy	URI (Ours)	Rule-based-AI
Easy	Win	20%	30%	2%	37% ± 4%	70%
	Draw	60%	60%	55%	57% ± 4%	30%
	Lose	20%	10%	43%	6% ± 4%	0%
	GDM	0.0	0.2	-0.58	0.40 ± 0.14	0.7
Medium	Win	0%	20%	2%	42% ± 12%	70%
	Draw	60%	60%	43%	50% ± 8%	30%
	Lose	40%	20%	55%	8% ± 4%	0%
	GDM	-0.4	0.0	-0.76	0.43 ± 0.24	0.7
Hard	Win	0%	0%	3%	32% ± 14%	30%
	Draw	50%	40%	43%	58% ± 6%	70%
	Lose	50%	60%	53%	10% ± 7%	0%
	GDM	-0.5	-0.6	-0.73	0.32 ± 0.14	0.3
Average	Win	6.7% ± 9.4%	16.7% ± 12.5%	2.3% ± 0.5%	40.3% ± 6.2%	56%
Average	GDM	-0.30 ± 0.22	-0.13 ± 0.34	-0.69 ± 0.08	0.38 ± 0.05	0.56

2. Efficiency

URI significantly outperforms LLM-based methods in inference time, taking only 0.009 seconds on average to choose an action in Google Football.

3. Data Quality

Visualization of the imaginary dataset shows that generated data follows a similar distribution to real data.
Uncertainty estimation effectively identifies out-of-distribution clusters, enhancing the robustness of the learned policy.

TSNE Visualization 1 — Figure 1: t-SNE visualization of real and generated data distribution in football game

TSNE Visualization 2 — Figure 2: t-SNE visualization of real and generated data distribution in Tic-Tac-Toe game

LLM-RAG Agent (Baseline)

This video demonstrates the performance of our LLM-RAG Agent baseline in a challenging 11 vs 11 Medium level football scenario. The agent utilizes Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) to make decisions in real-time gameplay.

URI Policy (Ours)

The URI Agent demonstrates superior performance compared to the LLM-RAG baseline, showcasing enhanced decision-making capabilities and a significantly higher goal-scoring rate. This improvement is particularly evident in the agent's ability to create and capitalize on scoring opportunities more effectively.

BibTeX

@article{chen2024understanding,
      title={Understanding, Rehearsing, and Introspecting: Learn a Policy from Textual Tutorial Books in Football Games},
      author={Chen, Xiong-Hui and Wang, Ziyan and Du, Yali and Fang, Meng and Jiang, Shengyi and Yu, Yang and Wang, Jun},
      booktitle = {Advances in Neural Information Processing Systems},
      year={2024}
    }