Architecture Overview

AZ-Go implements the AlphaZero algorithm with a distributed training architecture designed for efficiency and scalability.

System Components

1. Overseer (`training/overseer.py`)

The central coordinator that:

Manages the training loop
Distributes models to workers
Collects self-play games
Trains neural networks
Evaluates model improvements

2. Workers (`training/worker.py`)

Distributed nodes that:

Generate self-play games
Run MCTS simulations
Execute neural network inference
Send training data back to overseer

3. Neural Network (`neural_network/go_alphanet.py`)

ResNet-18 Architecture

At the core of this program is a neural network used to get move probabilities and win rate estimates for every board state. We use a Residual Neural Network (ResNet-18) which addresses the “vanishing gradient problem” through residual connections (skip connections).

Key Innovation: The input to each Residual Block is added to the output of the final Convolutional Layer in the block, allowing the network to learn residual mappings rather than complete transformations from scratch. This allows gradients to flow more easily through the network during backpropagation, reducing vanishing gradient issues and enabling much deeper architectures.

Why ResNet-18: Despite being relatively shallow compared to larger versions like ResNet-50 or ResNet-152, ResNet-18 works well for our use case because the board size is relatively small (7x7 compared to standard 19x19 Go boards). It consists of 18 layers with learnable weights, structured into four major blocks, each containing two residual units.

Input Structure (18x7x7 stack):

Layers 1-16: Past 16 board states from each player’s perspective
- 1 = black stones, -1 = white stones, 0 = empty
Layer 17: Current player indicator (all 1s for black, all -1s for white)
Layer 18: Sensibility layer - encodes intersections considered to be the current player’s territory (reachable only by current player’s stones)

Output Components:

Policy Head: 49-element probability distribution (one per board intersection)
Value Head: Win rate estimate in [-1, 1] (-1 = loss, 1 = win)

Input (18x7x7)
    │
Conv Block
    │
ResNet Blocks (x9)
    │
    ├─── Policy Head ──▶ 49 move probabilities
    │
    └─── Value Head ───▶ Position evaluation [-1, 1]

This is effectively an image classification network where:

Input “channels” are board states instead of RGB
Output “classes” are possible next moves instead of object categories

4. MCTS (`mcts.py`)

Monte Carlo Tree Search Implementation

MCTS is a heuristic search algorithm that combines tree search with random sampling to make optimal decisions in Go’s vast decision space. It operates through four main steps:

1. Selection:

Navigate through the tree selecting nodes with highest Upper Confidence Bound (UCB)
UCB balances:
- Exploitation: Following promising paths (high win rate)
- Exploration: Visiting less-explored nodes (low visit count)
C_PUCT parameter controls exploration vs exploitation balance

2. Expansion:

Add new child node when reaching non-terminal leaf that is not a terminal state
Initialize node value using neural network evaluation
Store win rate estimate and probability distribution with initial visit count of 1

3. Rollout:

Traditional MCTS uses random playouts to terminal states
We use neural network value estimate instead (more efficient)
Node initialized with NN output values

4. Backpropagation:

Update all traversed nodes with new information
Propagate values up the tree
Higher-level nodes reflect deeper exploration results

Usage in Go:

Run MCTS for every move selection in every game
Output: Vector containing visit counts for each node 1 level down from root
Next move: Index with maximum visit count corresponds to board intersection

5. Game Engine (`go/go_game.py`)

Go game implementation:

Legal move generation
Capture detection
Ko rule enforcement
Territory scoring
SGF export

Training Pipeline Overview

The system implements an iterative self-improvement loop:

Self-Play Generation → Collect training data
Neural Network Training → Improve model
Arena Evaluation → Validate improvements
Model Update → Deploy better models

For detailed training process information, see Training Process.

Custom Communication Protocol

Workers communicate with overseer across nodes via:

Model Distribution: Pickled PyTorch models
Game Collection: Serialized training examples
Status Updates: JSON progress reports

Key Design Decisions

1. Distributed Architecture

Scalability: Add workers dynamically
Fault Tolerance: Workers can disconnect/reconnect

2. Memory Management

Example Buffer: Fixed-size training history
Tree Reuse: MCTS tree persistence between moves
Batch Processing: Efficient GPU utilization

3. Model Versioning

Checkpointing: Regular model saves
Best Model Tracking: Automatic selection
Rollback Support: Previous model recovery

Next Steps

Training Process - Detailed training pipeline
Usage Guide - Running the system
Model Analysis - Analyze and understand models
Codebase Structure - Navigate the source code