Budget Alpha Zero trained to play chess in Python

13 min readMar 26, 2023

“An AI is only as good as it’s data”

Alpha Zero breaks this rule. And I tried to do the same. Let me explain

Stockfish is a chess bot that took 10 years to manually make, Alpha Zero demolished it in just 4 hours of training WITHOUT human data.

As a chess addict, that got me interested and so I tried make it in python even knowing that it used:

5,000 TPUs for self-play (TPU = Your math teacher on steroids)
64 TPUs for training
All simultaneously ran for 24 hrs

Personally, that would devastate my wallet 💀. But when I say $8 GPU. Let me explain:

While training, I quickly ran out of free resources on Colab. To deal with this, I looked up cheap GPUs I could rent.

Then came across Gradient Paperspace. Which offers:

For $8/month:

6H session limit
RTX4000 GPU
P5000 GPU
RTX5000 GPU
A4000 GPU (Best one here)

I was convinced by the big numbers. I was originally going to replicate this GitHub but reinforcement learning was far too slow for the GPU i’m using.

This is my 1v1 against the bot:

Prepping the dataset

By finding games played by professionals, we can try to make the AI mimic them. That’s where the CCRL dataset comes in.

The first step to training an AI is to pre-process the data. Doing it for this is fairly simple.

Each file has 8000 Chess games in it. First split all of them up so the AI can individually access them.

( What 1 chess PGN (game) looks like. ) 

[Event "CCRL 40/40"]
[Site "CCRL"]
[Date "2014.08.28"]
[Round "418.2.300"]
[White "Stockfish 5 64-bit"]
[Black "Sting SF 4.7"]
[Result "1-0"]

1. d4 f5 2. Bf4 Nf6 3. e3 g6 4. h4 h6 5. c4 d6 6. Nc3 Bg7 7. Nf3 Nbd7 8. Qc2 c5 9. h5 g5 10. Bg3 O-O 11. Nxg5 hxg5 12. h6 Bxh6 13. Rxh6 Kg7 14. Rh1 cxd4 15. exd4 f4 16. Bh2 Rh8 17. O-O-O Nb8 18. f3 Nc6 19. Qf2 Bf5 20. g3 fxg3 21. Qxg3 Rh5 22. d5 Nb4 23. Qg2 Kf7 24. a3 Nc2 25. Bg1 Rxh1 26. Qxh1 Qa5 27. Kd2 b5 28. Bd3 Bxd3 29. Kxd3 b4 30. Ne4 Qa4 31. Nxf6 bxa3 32. Ne4 Nb4+ 33. Ke2 Qc2+ 34. Rd2 Qxc4+ 35. Kf2 a2 36. Qh5+ Kg7 37. Nxg5 a1=Q 38. Qh7+ Kf6 39. Ne4+ Qxe4 40. Qxe4 Nd3+ 41. Qxd3 Rg8 42. Bh2 Qh1 1-0

16 hours later… (2.5 mil .pgn’s)

Oh wow, that took longer than I thought.

Alright now how does the AI do mimic the professional?

It does that by evaluating 2 things: (Assigning a number to, the higher the better)

It’s own position, trained by comparing who it thinks is winning then compares it with the winner of the game.
All the moves in the current position, trained by predicting the “good moves” then comparing it with the actual move made.

These are the literal steps in the code to help prepare the AI do that:

Select a random game from the dataset
Select a random move within the game.
Move the board until it reaches the random move

4. Encode these into numbers:

Board position
The next move of the current position
Winner of the game
All the legal moves of the current position

5. Ready for use for training

# What it looks like in code, for the process I mentioned above.

class CCRLDataset( Dataset ):
    def __init__( self, ccrl_dir ):
        self.ccrl_dir = ccrl_dir
        self.pgn_file_names = os.listdir( ccrl_dir ) # Load Dataset

    def __len__( self ):
        return len( self.pgn_file_names ) # Dataset Length

    def __getitem__( self, idx ):
        """
        Load the game in idx.pgn
        Get a random position, the move made from it, and the winner
        Encode these as numpy arrays
        
        Args:
            idx (int) the index into the dataset.
        
        Returns:
           position (torch.Tensor (16, 8, 8) float32) the encoded position
           policy (torch.Tensor (1) long) the target move's index
           value (torch.Tensor (1) float) the encoded winner of the game
           mask (torch.Tensor (72, 8, 8) int) the legal move mask
        """
        pgn_file_name = self.pgn_file_names[ idx ] # Get PGN file name from Index
        pgn_file_name = os.path.join( self.ccrl_dir, pgn_file_name ) # Full file name
        with open( pgn_file_name ) as pgn_fh:
            game = chess.pgn.read_game(pgn_fh) # Read PGN file

        moves = tolist(game.mainline_moves()) # Puts all the moves into a list. E.g. [Move.from_uci('e2e4'), ..., Move.from_uci('c4c8')]
        randIdx = int(np.random.random() * ( len( moves ) - 1 )) # Gets a random move's index from the [moves] list.
        board = game.board() 

        for idx, move in enumerate(moves): # Moves through each chess piece on the board. UNTIL randIdx has been reached.
            board.push(move)
            if (randIdx == idx): # Checks if the random selected move is the current move
                next_move = moves[idx + 1] # Defines the move AFTER randIdx
                break

        winner = encoder.parseResult(game.headers['Result']) # Gets winner of the SELECTED game

        position, policy, value, mask = encoder.encodeTrainingPoint(board, next_move, winner)
        """
        position = the encoded position that the AI can understand -> encodePosition(board)
        value = winner of the game
        mask = a mask containing all legal moves
        """
            
        return { 'position': torch.from_numpy( position ),
                 'policy': torch.Tensor( [policy] ).type( dtype=torch.long ),
                 'value': torch.Tensor( [value] ),
                 'mask': torch.from_numpy( mask ) }
# The AI is now ready to use the dataset

1 epoch of budget alpha zero

The neural network evaluates different positions LIKE professionals. The playing comes later.

Here’s what 1 epoch looks like:

Iterate through a batch of chess games
Get the:

Encoded position of the board of the randomly chosen move
Index of next move played in the game
Winner of game

3. Forward pass the data through the model. (More on this later)

4. Backpropagation through the 2 networks, this is when the AI adjusts the weights and biases

5. Adam optimizer step

6. After every epoch, save the model

But when training, how would the AI know where the pieces are?

Everything -> Numbers

Y’know how Machines only understand numbers? To convert a board into numbers, we “one-hot encode” it.

Imagine this:

But the white pawns are like this:

This is done for every piece for black & white. For castling, the entire matrix will be all 1, so that’s another 4.

Alright, but how does the AI know where they’re going?

The AI encodes the legal moves by:

Making an array of 72 boards made of 0’s each representing a certain direction and distance. Rooks + bishops make up 64, knights are 8.

For example:

Array Index: 18
Direction: Rook Forward | Distance: 2
 [0 0 0 0 0 0 0 0]
 [1 1 1 1 1 1 1 1] <- Pawns, wont matter which piece though
 [0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0] <- Target location
 [0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0]

(Castling is just the king moving right or left a few squares.)

2. Loop through legal moves.

3. Assign the move starting square in the appropriate place in 1's.

Inside the Neural network:

4. Flatten into a 1D array

5. Multiply this with the policy (move probabilities), that way only the legal moves will stay.

How the data goes through the network

The model architecture is about the same as Alpha Zero’s. The data goes in this order:

1 Convolution Block
20 Residual Blocks
Policy & Value network (separate values)
Training: MSE & Cross Entropy Loss
Playing: Filter out legal moves and softmax it.

Policy Network

Maps out a probability distribution on the likelihood of doing each legal move.

[0. 0. 0. ... 0. 0. 0.] # Output of policy network. Too big to be seen.

The shape of the output is the same as the legal move mask. Let’s print out something visible.

# This is the [18*8*8] to [19*8*8] range from the policy network
# These are all pawns on the second row. 
# The numbers shown are the probability of doing that move. 0 being illegal.

[0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
 0.0000000e+00 0.0000000e+00 0.0000000e+00 6.0451668e-05 6.1435247e-04
 2.4059096e-01 1.7351016e-01 3.7569043e-01 7.9157567e-03 4.4767123e-05
 1.0381683e-04 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00]

Value Network

Predicts the value of the current position, and outputs a value from -1 to 1 using the tahn activation.

Convolution block

Consists of a:

Convolution Layer
Batch Normalization
ReLU

And gets the data ready for the residual blocks.

Residual blocks

When models are too deep, they encounter the vanishing/exploding gradient problem.

It’s when gradients used to update the model’s parameters get way too small or too big due to the enormous amount of layers they go through. But chess is a complex game, it needs to be deep.

Residual blocks use “skip connections” to circumvent this:

It adds a shortcut connection between the input and output of the block.

class ResidualBlock( nn.Module ):
        def __init__( self, num_filters ):
            ...

        def __call__( self, x ):
            """
            Args:
                x (torch.Tensor) the tensor to apply the layers to.
            """
            residual = x
            x = self.conv1( x ) # Convolutional Layer
            x = self.bn1( x ) # Batch Normalization
            x = self.relu1( x ) # ReLU
            x = self.conv2( x )
            x = self.bn2( x )
            x += residual # <- Skip Connection
            x = self.relu2( x )
            return x

Mean squared error loss (Value Network when training)

The MSE loss is used to compare the predicted value of a game position with the actual outcome of the game. This loss is commonly used for regression (predicting numbers.)

MSELoss averages (the squared (differences (between the actual and predicted values.)))

def mse_loss(y_pred, y_true):
    squared_error = (y_pred - y_true) ** 2
    sum_squared_error = torch.sum(squared_error)
    loss = sum_squared_error / len(y_true)
    return loss

real = torch.Tensor([1]) # Real Value: Whoever won the game. -1/0/1
pred = torch.Tensor([.9]) # Alpha Zero predicted Q value
# The closer these 2 values align, the lower the loss.
print(mse_loss(pred, real)) # -> 0.0100
print(mse_loss(pred-1.9, real)) # -> 4

Cross Entropy (Policy network while training)

Measure the difference between the predicted move probabilities and the target move probability.

Soft Max (Policy Network while playing)

Finally, turning all the legal moves into probabilities of doing them.

# Inside AlphaZeroNetwork.py

policyMask = policyMask.view( policyMask.shape[0], -1 ) # Flattening legal moves
policy_exp = torch.exp( policy ) # Softmax stuff
policy_exp *= policyMask.type( torch.float32 ) # Filtering only the legal moves
policy_exp_sum = torch.sum( policy_exp, dim=1, keepdim=True ) # Softmax stuff
policy_softmax = policy_exp / policy_exp_sum # Softmax stuff
# policy_softmax is the Softmax ouptut
# will later be converted for easier access.

You might say, yeah it can evaluate positions now, so what? How does it play?

The Monte Carlo Tree Search

Imagine this is you:

You’re walking down a path to reach the end of a maze.

Your friend Stockfish decides to brute-force search every single path up to a certain point.

But you have a brain, and decide to ONLY go down the promising routes based on your current position. That’s basically the Monte Carlo Tree Search.

While traditional engines use brute-force search with handcrafted algorithms to measure the leaf node value. Alpha Zero uses a smarter method expanding only the promising nodes. But how?

Choosing which node to expand

During the selection phase of the MCTS, a crucial algorithm is needed to calculate which node to expand.

It’s called the Upper Confidence bounds applied to Trees

Which is:

UCT = Q + C * sqrt(log(parent_visits) / node_visits)

Where:

Q = Value of the Node

C = Value that balance’s the exploration and exploitation

def calcUCT( edge, N_p ):
    """
    Calculate the UCT formula. 
    Args:
        edge (Edge) the edge which the UCT formula is for
        N_p (float) the parent visit count
    Returns:
        (float) the calculated value
    """
    Q = edge.getQ() # Value of the node
    N_c = edge.getN() # Amount of visits of the NODE if expanded
    P = edge.getP() # The Probability of selecting the Node from the Neural Net
    C = 1.5 # Controls the balance between exploration and exploitation
    # Exploitation: Choosing values with a high Q value
    # Exploration: Choosing values with a low visit count 
    UCT = Q + P * C * math.sqrt( N_p ) / ( 1 + N_c ) # Seems to be a modified version of UCT, but still works.
    assert not math.isnan( UCT ), 'Q {} N_c {} P {}'.format( Q, N_c, P )
    
    return UCT

board = chess.Board()
root = MCTS.Root(board, alphaZeroNet)

for edge in root.edges:
    print(round(calcUCT(edge, root.N), 5), edge.getMove())

0.00016 g1h3
0.19131 g1f3
...
0.56352 e2e4 # King's Pawn Game <- AI likes this move the most
0.26023 d2d4 # Queen's Pawn Game
0.36094 c2c4 # English Opening
0.00092 b2b4

In short, High UCT = probably a good move.

But wait, there’s still more to the MCTS.

The 3 phases of Monte Carlo Tree Search

1. Selection

The algorithm chooses the move with the highest UCT. This keeps on going until a leaf node is reached. (Leaf node is a node without one below it)

# Inside MCTS.py | Root
def selectTask( self, board, node_path, edge_path ):
    cNode = self # The root is a node itself
    while True:
        node_path.append( cNode )
        cEdge = cNode.UCTSelect() # Select the move with the highest UCT
        edge_path.append( cEdge )
        if cEdge == None: # Exit if it's the last node | checkmate / draw.
            assert cNode.isTerminal()
            break
        cEdge.addVirtualLoss() 
        board.push( cEdge.getMove() ) # Do the move
        if not cEdge.has_child(): # Leaf node detected
            break
        cNode = cEdge.getChild() # Traversing tree

# Inside __main__.py

board = chess.Board() # Chess board
root = Root(board, alphaZeroNet) # Initiating the first node
node_path = [] # List of nodes traversed 
edge_path = [] # List of edges traversed

def selection(board, root):
    with torch.no_grad(): # For faster calculations
        root.selectTask(board, node_path, edge_path)

selection(board, root)

print(node_path, edge_path)
 
Output:
[<__main__.Root object at 0x7f5511d17820>] 
[<__main__.Edge object at 0x7f5511cefa30>]

2. Expand and Evaluate

The new board including the legal moves gets evaluated if the node is not checkmate/draw. Then these values get assigned to the child node.

# Inside MCTS.py | Edge
def expand( self, board, new_Q, move_probabilities ):
    """
    Create the child node with the given board position.
   
    board = current chessboard
    new_Q = the probability of winning according to the AI
    move_probabilities = the move probabilities according to the AI

    returns:
        Whether or not it was expanded.
    """

    if self.child == None:
        self.child = Node( board, new_Q, move_probabilities )
        return True
    else:
        return False

# __main__.py
def expand(edge_path, board):
    edge = edge_path[ -1 ] # [-1] grabs out the leaf node
    if edge != None:
        value, move_probabilities = encoder.callNeuralNetwork(board, alphaZeroNet) # EVALUATION
        # Value = Value of current position | Q
        # Move_probabilities = Probability distribution across all legal moves
        new_Q = value / 2. + 0.5 
        edge.expand( board, new_Q, move_probabilities ) # Expansion
        new_Q = 1. - new_Q # Calculate prob of winning from node value
    else: # Game ended
        winner = encoder.parseResult( board.result() ) 
        if not board.turn:
            winner *= -1
        new_Q = float( winner ) / 2. + 0.5

expand(edge_path, board)

3. Backpropagation

Once the Evaluation finishes, it goes back to the nodes it visited and changes their variables.

When going back, for each node visited:

Visit count gets increased
Node value gets increased based on the eval of the leaf node

# Data From Step 2.
# sum_Q = the new_Q value

last_node_idx = len( node_path ) - 1 # Choose the last node that has went through the selection
for i in range( last_node_idx, -1, -1 ): # E.g. [4, 3, 2, 1, 0]
    node = node_path[ i ] # Select node
    node.N += 1 # Add to the visit count
    if ( last_node_idx - i ) % 2 == 0: # Check if current node is the leaf node
        node.sum_Q += new_Q # Leaf node's sum_Q gets added
    else:  
        node.sum_Q += 1. - new_Q # Visited nodes sum_Q gets added

for edge in edge_path:
    if edge != None:
       edge.clearVirtualLoss() # Clear virtual loss so UCT calc will return to normal

This is 1 full iteration of the tree search. Once the bot has done this a certain amount of times / reached the time limit.

The move with the highest visit count will be played because the AI thinks that’s the most promising move to analyze. And to go down the tree faster,

MCTS has to be run in parallel. But when you do that, all threads will simply take the exact same path. Ruining the exploration.

Using Virtual Loss for parallel MCTS

You’ve probably seen this many times in the code. I’ll explain why it’s here.

To counteract the problem, a fake loss is introduced. It’s where:

The higher the virtual loss, the lower the UCT value will be.

For every node a thread visits, it will add 1 to it’s virtual loss.
Once it reaches the final node. It will clear the virtual loss to all the visited nodes.

That way threads will be discouraged from taking the same path.

# These variables are affected by the Virtual Loss

def getN( self ):
        """
        Returns:
            (int) the child's visit count.
        """
        if self.has_child():
            return self.child.N + self.virtualLosses
        else:
            return 0. + self.virtualLosses

def getQ( self ):
        """
        Returns:
            (int) the child's Q value
        """
        if self.has_child():
            return 1. - ( ( self.child.sum_Q + self.virtualLosses ) / ( self.child.N + self.virtualLosses ) )
        else:
            return 0.

# The Higher the virtualLoss the smaller the UCT will be.

Playing against my budget Alpha Zero

There’s only 1 thing to do left.

White: AI | Black: Me | Check it out over here

I resigned at the end because I knew I was doomed.

My Rematch Video:

___

Credits to jackdawkins11 and to geochri for the AlphaZero replication.

Thanks for reading this article! See you next time.