Mastering Board Games by External and Internal Planning with Language Models

DeepMind

This paper introduces MCTS for board games with pre-trained MAV model. Additionally, they distill the search procedure into the LLM.

Reasoning in language models aims to enhance performance on reasoning benchmarks and can be categorized into two approaches:

  1. Internal Planning:
    The model develops a plan within the context (e.g., Chain-of-Thought prompting) by autoregressively considering possible steps and their outcomes.

  2. External Planning:
    The model generates steps in a neurosymbolic system (e.g., Tree of Thought) with an outer loop explicitly searching over possible step sequences.

The paper explores training language models for both approaches to improve reasoning in sequential decision-making, using board games as the experimental domain.

Summary: Multi-Action-Value (MAV) Model

The MAV model is a Transformer pre-trained on textual game data, designed to function as:

  1. World Model:
    • Tracks game states after moves.
    • Predicts legal moves.
    • Detects terminal states.
  2. Value Function:
    • Outputs action values as win probabilities.
    • Uses discrete buckets (e.g., 64) to represent win probabilities.
  3. Policy Function:
    • Determines the best action for multiple board games.

Key Features:

Improvements:

  1. Unified modeling of world state, policy, and action values.
  2. Outputs best actions without reliance on external engines.
  3. Efficient single-call inference for reduced computational cost.
  4. Scalable inference-time computation for higher-quality results.

Datasets:

Applications:

Discrete Representation of Win Rate in MAV Model

The MAV model represents win rates in a discrete form rather than as continuous values. This is achieved through the use of 64 discrete buckets, each represented by a unique token (e.g., <ctrl28> for bucket 28).

Key Details:

Advantages of Discrete Representation:

  1. Stability in Training:
    • Classification tasks are less sensitive to noise compared to regression tasks, leading to more stable training.
  2. Efficiency in Inference:
    • Predicting discrete tokens is computationally simpler and faster than predicting continuous values.
  3. Improved Differentiation:
    • Discrete buckets allow the model to clearly differentiate between similar values, helping it select the optimal moves.

Example in Chess:

By using this bucketized representation, the MAV model achieves a balance between precision and computational efficiency, enabling better performance in both training and inference.

Explanation of MAV Input/Output Specification

The figure provides an example of how the Multi-Action-Value (MAV) model processes input and output in the context of game-playing tasks, specifically for Chess.

1. Structure Overview

Performance