Training Process

The AZ-Go training pipeline implements the AlphaZero algorithm with distributed self-play and iterative improvement.

Training Loop Overview

┌─────────────────────────────────────────────────┐
│                 Iteration Start                  │
│                                                 │
│  1. Self-Play ─────▶ 2. Train NN ─────▶ 3. Evaluate │
│       ▲                                     │    │
│       │                                     │    │
│       └─────────── 4. Update Model ◀────────┘    │
│                                                 │
└─────────────────────────────────────────────────┘

Phase 1: Self-Play Generation

Overview

The most time-consuming phase of training. The model starts with zero knowledge other than Go rules. Beginning with randomly initialized weights, we generate training samples through self-play games using a reinforcement learning approach.

Training Data Structure

Each training example consists of:

Board State: 18x7x7 stack representing current position
- Layers 1-16: Past 16 board states from each player’s perspective
- Layer 17: Current player indicator (all 1s for black, all -1s for white)
- Layer 18: Sensibility layer - encodes territory control
Policy Target: Length-49 vector with 1 at most-visited MCTS move
Value Target: Game result (-1 for loss, 1 for win)

Data Collection

50,000 samples per iteration
Include last 4 iterations in training set
- Iteration 1: 50,000 samples
- Iteration 2: 100,000 samples
- Iteration 3: 150,000 samples
- Iteration 4+: 200,000 samples (rolling window)

Randomness for Exploration

Temperature-based Selection:

First n moves: Select randomly (configurable temperature parameter)
Remaining moves: More deterministic selection
Reduces overfitting and ensures game variety

Dirichlet Noise:

Applied to probability distribution at each tree node during MCTS
Ensures variety in training games
Prevents learning repetitive patterns

MCTS Configuration

Simulations: 500 per move
C_PUCT: 1.0 (exploration constant)
Temperature: Configurable for move randomness
Dirichlet α: 0.25 at root node

Phase 2: Neural Network Training

Training Configuration

optimizer: Adam
batch_size: 2048
learning_rate: 0.0001
epochs: 10

Loss Function

Combined loss with equal weighting:

Loss = MSE(value_pred, value_target) + CrossEntropy(policy_pred, policy_target)

Training Process

Load collected self-play samples
Shuffle and batch data
Train for 10 epochs using Adam optimizer
Monitor both policy and value losses
Save checkpoint after training

Phase 3: Model Evaluation (Arena)

Competition Setup

Previous best model vs newly trained model:

Total Games: 50 matches
Acceptance Threshold: 54% win rate (27/50 games)
Fair Comparison: Same MCTS parameters for both models
Failure Case: If threshold not met, new model discarded, previous best retained

Arena Randomness

To add variation without modifying model output:

Board states rotated by 90° random number of times before NN input
Output transformed back to match original orientation
All Go board rotations are equivalent
Ensures diverse game positions during evaluation

Evaluation Metrics

Tracked for each iteration:

Win rate vs previous best
Game length distribution
Model acceptance decision

Phase 4: Model Update

Promotion Criteria

New model promoted if:

Win rate ≥ 55% in arena
OR iteration number is multiple of 10 (checkpoint)
OR this is the first iteration

Model Management

logs/
├── checkpoints/
│   ├── best_model.pth.tar
│   ├── checkpoint_100.pth.tar
│   ├── checkpoint_200.pth.tar
│   └── current_model.pth.tar
└── train_examples/
    ├── iter_100/
    ├── iter_200/
    └── current/

Training Configuration

Standard Settings

# 7x7 Go configuration
num_iterations: 500
num_episodes: 5000        # per iteration
num_mcts_sims: 500       # per move
num_epochs: 10           # NN training
batch_size: 512
num_games_arena: 40      # evaluation games
win_threshold: 0.55      # promotion threshold

Memory Management

Example History: 200,000 positions
Sampling: Uniform from recent 10 iterations
Deduplication: Remove duplicate positions

Monitoring Progress

Real-time Metrics

During training, monitor:

Iteration 42/500
Self-Play: 100%|████████| 5000/5000 [02:34:56]
Training Loss: 0.423
Arena: New Model Wins 24/40 (60.0%)
Model Updated: Yes
ELO Estimate: 1823 (+45)

Visualization

Graphs generated in logs/graphs/:

Loss curves (policy and value)
Win rates over iterations
Game length distributions
Move prediction accuracy

Distributed Training

Worker Configuration

num_parallel_workers: 4
worker_timeout: 3600  # seconds
episodes_per_worker: 1250  # 5000 total / 4 workers

Load Balancing

Dynamic work assignment
Automatic failure recovery
Progress synchronization

Advanced Training

Curriculum Learning

Start with simpler positions:

Iterations 1-50: Random opening positions
Iterations 50-200: Balanced middlegame
Iterations 200+: Full games

Hyperparameter Tuning

Key parameters to experiment with:

C_PUCT: Higher = more exploration
Temperature: Lower = more deterministic
Learning Rate: Affects convergence speed
MCTS Sims: More = stronger but slower

Training from Scratch vs Fine-tuning

# From scratch (default)
python start_main.py

# Fine-tune existing model
python start_main.py --load-model logs/checkpoints/best_model.pth.tar

Troubleshooting

Slow Convergence

Increase MCTS simulations
Lower learning rate
More training epochs

Overfitting

Increase L2 regularization
Add dropout layers
Larger example buffer

Worker Failures

Check SSH connectivity
Verify CUDA availability
Monitor memory usage

Next Steps

Usage Guide - Running trained models
Architecture - System design details