Training Process
The AZ-Go training pipeline implements the AlphaZero algorithm with distributed self-play and iterative improvement.
Training Loop Overview
┌─────────────────────────────────────────────────┐
│ Iteration Start │
│ │
│ 1. Self-Play ─────▶ 2. Train NN ─────▶ 3. Evaluate │
│ ▲ │ │
│ │ │ │
│ └─────────── 4. Update Model ◀────────┘ │
│ │
└─────────────────────────────────────────────────┘
Phase 1: Self-Play Generation
Overview
The most time-consuming phase of training. The model starts with zero knowledge other than Go rules. Beginning with randomly initialized weights, we generate training samples through self-play games using a reinforcement learning approach.
Training Data Structure
Each training example consists of:
- Board State: 18x7x7 stack representing current position
- Layers 1-16: Past 16 board states from each player’s perspective
- Layer 17: Current player indicator (all 1s for black, all -1s for white)
- Layer 18: Sensibility layer - encodes territory control
- Policy Target: Length-49 vector with 1 at most-visited MCTS move
- Value Target: Game result (-1 for loss, 1 for win)
Data Collection
- 50,000 samples per iteration
- Include last 4 iterations in training set
- Iteration 1: 50,000 samples
- Iteration 2: 100,000 samples
- Iteration 3: 150,000 samples
- Iteration 4+: 200,000 samples (rolling window)
Randomness for Exploration
Temperature-based Selection:
- First n moves: Select randomly (configurable temperature parameter)
- Remaining moves: More deterministic selection
- Reduces overfitting and ensures game variety
Dirichlet Noise:
- Applied to probability distribution at each tree node during MCTS
- Ensures variety in training games
- Prevents learning repetitive patterns
MCTS Configuration
- Simulations: 500 per move
- C_PUCT: 1.0 (exploration constant)
- Temperature: Configurable for move randomness
- Dirichlet α: 0.25 at root node
Phase 2: Neural Network Training
Training Configuration
optimizer: Adam
batch_size: 2048
learning_rate: 0.0001
epochs: 10
Loss Function
Combined loss with equal weighting:
Loss = MSE(value_pred, value_target) + CrossEntropy(policy_pred, policy_target)
Training Process
- Load collected self-play samples
- Shuffle and batch data
- Train for 10 epochs using Adam optimizer
- Monitor both policy and value losses
- Save checkpoint after training
Phase 3: Model Evaluation (Arena)
Competition Setup
Previous best model vs newly trained model:
- Total Games: 50 matches
- Acceptance Threshold: 54% win rate (27/50 games)
- Fair Comparison: Same MCTS parameters for both models
- Failure Case: If threshold not met, new model discarded, previous best retained
Arena Randomness
To add variation without modifying model output:
- Board states rotated by 90° random number of times before NN input
- Output transformed back to match original orientation
- All Go board rotations are equivalent
- Ensures diverse game positions during evaluation
Evaluation Metrics
Tracked for each iteration:
- Win rate vs previous best
- Game length distribution
- Model acceptance decision
Phase 4: Model Update
Promotion Criteria
New model promoted if:
- Win rate ≥ 55% in arena
- OR iteration number is multiple of 10 (checkpoint)
- OR this is the first iteration
Model Management
logs/
├── checkpoints/
│ ├── best_model.pth.tar
│ ├── checkpoint_100.pth.tar
│ ├── checkpoint_200.pth.tar
│ └── current_model.pth.tar
└── train_examples/
├── iter_100/
├── iter_200/
└── current/
Training Configuration
Standard Settings
# 7x7 Go configuration
num_iterations: 500
num_episodes: 5000 # per iteration
num_mcts_sims: 500 # per move
num_epochs: 10 # NN training
batch_size: 512
num_games_arena: 40 # evaluation games
win_threshold: 0.55 # promotion threshold
Memory Management
- Example History: 200,000 positions
- Sampling: Uniform from recent 10 iterations
- Deduplication: Remove duplicate positions
Monitoring Progress
Real-time Metrics
During training, monitor:
Iteration 42/500
Self-Play: 100%|████████| 5000/5000 [02:34:56]
Training Loss: 0.423
Arena: New Model Wins 24/40 (60.0%)
Model Updated: Yes
ELO Estimate: 1823 (+45)
Visualization
Graphs generated in logs/graphs/
:
- Loss curves (policy and value)
- Win rates over iterations
- Game length distributions
- Move prediction accuracy
Distributed Training
Worker Configuration
num_parallel_workers: 4
worker_timeout: 3600 # seconds
episodes_per_worker: 1250 # 5000 total / 4 workers
Load Balancing
- Dynamic work assignment
- Automatic failure recovery
- Progress synchronization
Advanced Training
Curriculum Learning
Start with simpler positions:
- Iterations 1-50: Random opening positions
- Iterations 50-200: Balanced middlegame
- Iterations 200+: Full games
Hyperparameter Tuning
Key parameters to experiment with:
- C_PUCT: Higher = more exploration
- Temperature: Lower = more deterministic
- Learning Rate: Affects convergence speed
- MCTS Sims: More = stronger but slower
Training from Scratch vs Fine-tuning
# From scratch (default)
python start_main.py
# Fine-tune existing model
python start_main.py --load-model logs/checkpoints/best_model.pth.tar
Troubleshooting
Slow Convergence
- Increase MCTS simulations
- Lower learning rate
- More training epochs
Overfitting
- Increase L2 regularization
- Add dropout layers
- Larger example buffer
Worker Failures
- Check SSH connectivity
- Verify CUDA availability
- Monitor memory usage
Next Steps
- Usage Guide - Running trained models
- Architecture - System design details