用 AlphaZero 算法实现一个五子棋 AI

2025-02-18

突然想整一下 RL 于是就看上了 AlphaZero，围棋我也不会，训练也难，那就上一个五子棋版本

注意，这里展示的基本上是我跟着这些资料学习的流程，想要最终最完善的版本可以直接点目录跳转，有些代码已经优化改进，但是我还是把初稿放上来了

游戏环境实现

# -*- coding: utf-8 -*-

import numpy as np
import pygame


class Board:
    """
    board: 0: available, 1: player1, 2: player2
    """
    board: np.ndarray

    current_player = 1
    last_move = -1
    move_count = 0

    def __init__(self, width=8, height=8, n_in_rows=4):
        self.width = width
        self.height = height
        self.n_in_rows = n_in_rows
        if width < self.n_in_rows or height < self.n_in_rows:
            raise ValueError("Board size must be greater than or equal to n_in_rows")

    def init_board(self, start_player=1):
        self.board = np.zeros((self.height, self.width))
        self.current_player = start_player
        self.move_count = 0

    def position2move(self, pos):
        """

        :param pos: (y, x)
        :return: move x + w * y
        """
        assert 0 <= pos[1] < self.width and 0 <= pos[0] < self.height, f'Position {pos} out of bounds'
        return pos[1] + self.width * pos[0]

    def move2position(self, move: int):
        """
        :param move:
        :return: pos: (y, x)
        """
        assert 0 <= move < self.height * self.width, f'Move {move} out of bounds'
        return move // self.width, move % self.width

    def game_over(self):
        if self.last_move == -1:
            return False, None
        if self.move_count < self.n_in_rows * 2 - 1:
            return False, None
        if self.move_count == self.width * self.height:
            return True, None

        directions = [(0, 1), (1, 0), (1, 1), (1, -1)]
        last_pos = self.move2position(self.last_move)

        for d in directions:
            count = 1  # 当前刚落子的棋子
            # 正向检查
            y, x = last_pos[0] + d[0], last_pos[1] + d[1]
            while 0 <= x < self.width and 0 <= y < self.height and self.board[y, x] == self.opponent_player:
                count += 1
                x += d[1]
                y += d[0]

            # 反向检查
            y, x = last_pos[0] - d[0], last_pos[1] - d[1]
            while 0 <= x < self.width and 0 <= y < self.height and self.board[y, x] == self.opponent_player:
                count += 1
                x -= d[1]
                y -= d[0]

            if count >= self.n_in_rows:
                return True, self.opponent_player
        return False, None

    @property
    def state(self):
        state = np.zeros((4, self.height, self.width))
        state[0] = self.board == self.current_player
        state[1] = self.board == self.opponent_player
        if self.last_move != -1:
            last_pos = self.move2position(self.last_move)
            state[2, last_pos[0], last_pos[1]] = 1.0
        if self.move_count % 2 == 0:
            state[3][:, :] = 1.0
        # https://github.com/junxiaosong/AlphaZero_Gomoku/issues/33
        # original state[:, ::-1, :]
        return state

    @property
    def opponent_player(self):
        return 1 if (self.current_player == 2) else 2

    def get_available_moves(self):
        rows, columns = np.where(self.board == 0)
        return rows * self.width + columns

算法基本思路

传统的 MCTS 过程是这样的，给定一个棋面，MCTS 共进行 N 次模拟。主要的搜索阶段有 4 个：选择，扩展，仿真和回溯

蒙特卡洛树搜索过程

第一步是选择 (Selection)，这一步会从根节点开始，每次都选一个“最值得搜索的子节点”，一般使用上限置信区间算法 (Upper Confidence Bound Apply to Tree, UCT) 选择分数最高的节点，直到来到一个“存在未扩展的子节点”的节点
第二步是扩展 (Expansion)，在这个搜索到的“存在未扩展的子节点”之上，加上一个没有历史记录的子节点并初始化该子节点
第三步是仿真 (simulation)，从上面这个没有试过的着法开始，用一个简单策略比如快速走子策略 (Rollout policy) 走到底，得到一个胜负结果。快速走子策略虽然不是很精确，但是速度较快，在这里具有优势。因为如果这个策略走得慢，结果虽然会更准确，但由于耗时多了，在单位时间内的模拟次数就少了，所以不一定会棋力更强，有可能会更弱。这也是为什么我们一般只模拟一次，因为如果模拟多次，虽然更准确，但更慢。
第四步是回溯 (backpropagation), 将我们最后得到的胜负结果回溯加到 MCTS 树结构上。注意除了之前的 MCTS 树要回溯外，新加入的节点也要加上一次胜负历史记录。^[1]

而 AlphaZero 算法用一个神经网络代替了第三步的仿真过程，我个人的理解是神经网络拟合后的准确度比一次仿真高，而用时又比多次仿真少，所以能在不增加太多时间的情况下大幅提升准确度

基于 MCTS 的 player 实现

用 Python 实现传统 MCTS 搜索

MCTS 搜索代码逻辑都是差不多的（原理参照^[2]，这边仿照 @junxiaosong 的代码写了一份^[3]，要注意的关键点都在注释里了

class TreeNode(object):

    def __init__(self, parent, prior_p):
        self.parent = parent
        self.children = {}

        self.n_visits = 0
        self.Q = 0
        self.u = 0

        self.P = prior_p

    def expand(self, action_priors):
        """Expand tree by creating new children.
        :param action_priors: a list of tuples of actions and their prior probability
            according to the policy function.
        :return:
        """

        for action, prior_prob in action_priors:
            if action not in self.children:
                self.children[action] = TreeNode(self, prior_prob)

    def select(self, c_puct):
        """Select action among children that gives maximum action value Q plus bonus u(P).
        :return: A tuple of (action, next_node)
        """

        return max(self.children.items(), key=lambda act_node: act_node[1].get_value(c_puct))

    def update(self, leaf_value):
        """Update node values from leaf evaluation.
        :param leaf_value: the value of subtree evaluation from the current player's
            perspective.
        """

        # Count visit.
        self.n_visits += 1
        # Update Q, a running average of values for all visits.
        # (n * Q + v) / (n + 1) - Q = (v - Q) / (n + 1)
        self.Q += 1.0 * (leaf_value - self.Q) / self.n_visits

    def update_recursive(self, leaf_value):
        """Like a call to update(), but applied recursively for all ancestors."""

        # If it is not root, this node's parent should be updated first.
        if self.parent:
            self.parent.update_recursive(-leaf_value)
        self.update(leaf_value)

    def get_value(self, c_puct):
        """Calculate and return the value for this node.
        It is a combination of leaf evaluations Q, and this node's prior
        adjusted for its visit count, u.
        :param c_puct: a number in (0, inf) controlling the relative impact of
            value Q, and prior probability P, on this node's score.
        """

        self.u = (c_puct * self.P * np.sqrt(self.parent.n_visits) / (1 + self.n_visits))
        return self.Q + self.u

    def is_leaf(self):
        """Check if leaf node (i.e. no nodes below this have been expanded)."""

        return self.children == {}

    def is_root(self):
        return self.parent is None


class MCTSPure(object):
    def __init__(self, policy_value_fn, c_puct=5, n_playout=10000):
        """
        :param policy_value_fn: a function that takes in a board state and outputs
            a list of (action, probability) tuples and also a score in [-1, 1]
            (i.e. the expected value of the end game score from the current
            player's perspective) for the current player.
        :param c_puct: a number in (0, inf) that controls how quickly exploration
            converges to the maximum-value policy. A higher value means
            relying on the prior more.
        :param n_playout:
        """
        self.root = TreeNode(None, 1.0)
        self.policy_value_fn = policy_value_fn
        self.c_puct = c_puct
        self.n_playout = n_playout

    def playout(self, state: Board):
        """Run a single playout from the root to the leaf, getting a value at
        the leaf and propagating it back through its parents.
        State is modified in-place, so a copy must be provided.
        """
        node = self.root
        while not node.is_leaf():
            action, node = node.select(self.c_puct)
            state.perform_move(action)
        action_probs, _ = self.policy_value_fn(state)

        end, winner = state.game_over()
        if not end:
            node.expand(action_probs)

        # Evaluate the leaf node by random rollout
        leaf_value = self.evaluate_rollout(state)
        # Update value and visit count of nodes in this traversal.
        """
        Q:  Why is the negative leaf_value for update_recursive function?
        A:  We use the negative value of the state, this is because alternate levels 
            in the search tree are from the perspective of different players and 
            the Q-values are in fact used by the parent node in select stage.
        Reference https://github.com/junxiaosong/AlphaZero_Gomoku/issues/25
        """
        node.update_recursive(-leaf_value)

    def evaluate_rollout(self, state: Board, limit=1000):
        """Use the rollout policy to play until the end of the game,
        returning +1 if the current player wins, -1 if the opponent wins,
        and 0 if it is a tie."""
        assert limit >= 1, 'Limit must be > 0'
        player = state.current_player
        winner = None
        for i in range(limit):
            end, winner = state.game_over()
            if end:
                break
            action_probs = random_rollout_policy_fn(state)
            max_action = max(action_probs, key=itemgetter(1))[0]
            state.perform_move(max_action)
        else:
            # If no break from the loop, issue a warning.
            print("WARNING: rollout reached move limit")
        if winner is None:  # tie
            return 0
        else:
            return 1 if winner == player else -1

    def get_move(self, state):
        """Runs all playouts sequentially and returns the most visited action.
        state: the current game state
        :return: the selected action
        """
        for n in range(self.n_playout):
            state_copy = deepcopy(state)
            self.playout(state_copy)
        return max(self.root.children.items(), key=lambda act_node: act_node[1].n_visits)[0]

    def update_with_move(self, last_move):
        """Step forward in the tree, keeping everything we already know
        about the subtree.
        """
        if last_move in self.root.children:
            self.root = self.root.children[last_move]
            self.root.parent = None
        else:
            self.root = TreeNode(None, 1.0)

AlphaZero 算法的 MCTS

直接继承上面的纯蒙特卡洛，重写 playout（一次完整的 MCTS 模拟，选择、扩展、仿真、回溯），其中用神经网络（policy_value_fn）代替仿真步骤。由于训练过程需要完整的决策概率，所以实现了 get_move_probs。

关于概率计算^[1:1]

MCTS 搜索完毕后，模型就可以在 MCTS 的根节点 s 基于以下公式选择行棋的 MCTS 分支了:

τ 是用来控制探索的程度，τ 的取值介于 (0,1] 之间，当 τ越接近于 1 时，神经网络的采样越接近于 MCTS 的原始采样，当 τ 越接近于 0 时，神经网络的采样越接近于贪婪策略，即选择最大访问次数 N 所对应的动作。因为在 τ 很小的情况下，直接计算访问次数 N 的 τ 次方根可能会导致数值异常，为了避免这种情况，在计算行动概率时，先将访问次数 N 加上一个非常小的数值（本项目是 1e-10），取自然对数后乘上 1/τ，再用一个简化的 softmax 函数将输出还原为概率，这和原始公式在数学上基本上是等效的。

计算 softmax 时减去一个最大值是为了防止指数运算得到的结果过大，最终的结果仍是相同的。

def softmax(x):
    probs = np.exp(x - np.max(x))
    probs /= np.sum(probs)
    return probs

class MCTSAlphaZero(MCTSPure):
    def playout(self, state: Board):
        """Run a single playout from the root to the leaf, getting a value at
        the leaf and propagating it back through its parents.
        State is modified in-place, so a copy must be provided.
        """
        node = self.root
        while not node.is_leaf():
            action, node = node.select(self.c_puct)
            state.perform_move(action)
        # Evaluate the leaf using a network which outputs a list of
        # (action, probability) tuples player and also a score v in [-1, 1]
        # for the current player.
        action_probs, leaf_value = self.policy_value_fn(state)
        end, winner = state.game_over()
        if not end:
            node.expand(action_probs)
        else:
            # for end state，return the "true" leaf_value
            if winner is None:
                leaf_value = 0
            elif winner == state.current_player:
                leaf_value = 1
            else:
                leaf_value = -1
        # Update value and visit count of nodes in this traversal.
        node.update_recursive(-leaf_value)

    def get_move_probs(self, state, temperature=1e-3):
        """Run all playouts sequentially and return the available actions and
        their corresponding probabilities.
        state: the current game state
        temperature: temperature parameter in (0, 1] controls the level of exploration
        """
        for n in range(self.n_playout):
            state_copy = deepcopy(state)
            self.playout(state_copy)

        # calc the move probabilities based on visit counts at the root node
        act_visits = [(act, node.n_visits) for act, node in self.root.children.items()]
        acts, visits = zip(*act_visits)
        act_probs = softmax(1.0 / temperature * np.log(np.array(visits) + 1e-10))
        return acts, act_probs

基于两种不同的 MCTS 实现 player

def random_rollout_policy_fn(board: Board):
    """a coarse, fast version of policy_fn used in the rollout phase."""
    # rollout randomly
    available_moves = board.get_available_moves()
    action_probs = np.random.rand(len(available_moves))
    return zip(available_moves, action_probs)


def random_policy_value_fn(board: Board):
    """a function that takes in a state and outputs a list of (action, probability)
    tuples and a score for the state"""
    # return uniform probabilities and 0 score for pure MCTSPure
    available_moves = board.get_available_moves()
    action_probs = np.ones(len(available_moves)) / len(available_moves)
    return zip(available_moves, action_probs), 0

class BaceMCTSPlayer(object):
    """AI player based on MCTS"""

    def __init__(self, mcts):
        self.player = None
        self.mcts = mcts

    def set_player(self, player: int):
        self.player = player

    def reset_player(self):
        self.mcts.update_with_move(-1)

    def act(self, board: Board):
        raise NotImplementedError()


class MCTSPlayer(BaceMCTSPlayer):
    """AI player based on MCTSPure"""

    def __init__(self, c_puct=5, n_playout=2000):
        super().__init__(MCTSPure(random_policy_value_fn, c_puct, n_playout))

    def act(self, board: Board):
        sensible_moves = board.get_available_moves()
        if len(sensible_moves) > 0:
            move = self.mcts.get_move(board)
            # https://github.com/junxiaosong/AlphaZero_Gomoku/issues/108
            self.mcts.update_with_move(-1)
            # self.mcts.update_with_move(move)
            return move
        else:
            print("WARNING: the board is full")


class MCTSAlphaZeroPlayer(BaceMCTSPlayer):
    """AI player based on MCTSAlphaZero"""

    def __init__(self, policy_value_function, c_puct=5, n_playout=2000, dirichlet=0.3, selfplay=False):
        super().__init__(MCTSAlphaZero(policy_value_function, c_puct, n_playout))
        self.dirichlet = dirichlet
        self.selfplay = selfplay

    def act(self, board, temperature=1e-3, return_prob=False):
        sensible_moves = board.get_available_moves()
        # the pi vector returned by MCTS as in the alphaGo Zero paper
        move_probs = np.zeros(board.width * board.height)
        if len(sensible_moves) > 0:
            acts, probs = self.mcts.get_move_probs(board, temperature)
            move_probs[list(acts)] = probs
            if self.selfplay:
                # add Dirichlet Noise for exploration (needed for
                # self-play training)
                move = np.random.choice(
                    acts,
                    p=0.75 * probs + 0.25 * np.random.dirichlet(self.dirichlet * np.ones(len(probs)))
                )
                # update the root node and reuse the search tree
                self.mcts.update_with_move(move)
            else:
                # with the default temperature=1e-3, it is almost equivalent
                # to choosing the move with the highest prob
                move = np.random.choice(acts, p=probs)
                # reset the root node
                self.mcts.update_with_move(-1)
                # location = board.move_to_location(move)
                # print("AI move: %d,%d\n" % (location[0], location[1]))

            if return_prob:
                return move, move_probs
            else:
                return move
        else:
            print("WARNING: the board is full")

神经网络实现

这里我使用的网络与 @junxiaosong 并不相同，他使用的网络在 6x6 4 子连珠的情况下效果非常好，但是将棋盘扩大之后网络的拟合能力有些不足，效果没有那么好，所以我使用了残差层（ResidualBlock）。

具体的理论实现请参考^[1:2]

# -*- coding: utf-8 -*-
import os
import warnings
from pathlib import Path

import numpy as np
import paddle
from paddle import nn
from paddle.static import InputSpec

from game import Board


class ResidualBlock(nn.Layer):
    def __init__(self, channels):
        super(ResidualBlock, self).__init__()
        self.conv_block = nn.Sequential(
            nn.Conv2D(channels, channels, kernel_size=3, padding=1, bias_attr=False),
            nn.BatchNorm2D(channels),
            nn.ReLU(),
            nn.Conv2D(channels, channels, kernel_size=3, padding=1, bias_attr=False),
            nn.BatchNorm2D(channels)
        )

    def forward(self, x):
        residual = x  # 旁路连接
        out = self.conv_block(x)
        out += residual  # 跳跃连接
        return nn.functional.relu(out)  # 经过 ReLU 激活


class Net(nn.Layer):
    """
    https://github.com/junxiaosong/AlphaZero_Gomoku
    """

    def __init__(self, board_width, board_height, num_blocks=5):
        super().__init__()

        self.board_width = board_width
        self.board_height = board_height

        # common layers
        self.common_layers = nn.Sequential(
            nn.Conv2D(4, 128, kernel_size=3, stride=1, padding=2),
            nn.BatchNorm2D(128),
            *[ResidualBlock(128) for _ in range(num_blocks)]
        )

        # action policy layers
        self.act_layers = nn.Sequential(
            nn.Conv2D(128, 4, kernel_size=1, stride=1, padding=0),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(4 * (board_height + 2) * (board_width + 2), board_width * board_height),
            nn.LogSoftmax()
        )

        # action value layers
        self.val_layers = nn.Sequential(
            nn.Conv2D(128, 2, kernel_size=1, stride=1, padding=0),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(2 * (board_height + 2) * (board_width + 2), 64),
            nn.ReLU(),
            nn.Linear(64, 1),
            nn.Tanh()
        )

    def forward(self, x):
        x = self.common_layers(x)
        x_act = self.act_layers(x)
        x_val = self.val_layers(x)
        return x_act, x_val


class PolicyValueNet(object):
    def __init__(self, board_width, board_height, model_file=None):
        self.board_width = board_width
        self.board_height = board_height
        self.model_file = model_file

        self.l2_const = 1e-4  # coef of l2 penalty

        self.net = Net(self.board_width, self.board_height)

        self.optimizer = paddle.optimizer.Adam(parameters=self.net.parameters(), weight_decay=self.l2_const,
                                               learning_rate=2e-3)

        if model_file is not None:
            # if not os.path.isfile(model_file):
            #     raise FileNotFoundError(f'{model_file} is not exist')
            net_params = paddle.load(model_file + '.pdparams')
            self.net.set_state_dict(net_params)
            opt_params = paddle.load(model_file + '.pdopt')
            self.optimizer.set_state_dict(opt_params)

    def policy_value(self, state: np.ndarray | list):
        """
        input: a batch of states
        output: a batch of action probabilities and state values
        """
        log_act_probs, value = self.net(paddle.to_tensor(state, 'float32'))
        act_probs = log_act_probs.exp()
        return act_probs.numpy(), value.numpy()

    def policy_value_fn(self, board: Board):
        """
        input: board
        output: a list of (action, probability) tuples for each available
        action and the score of the board state
        """
        legal_positions = board.get_available_moves()
        current_state = np.ascontiguousarray(board.state.reshape(-1, 4, self.board_height, self.board_width))
        act_probs, value = self.policy_value(current_state)
        value = value[0][0]
        act_probs = zip(legal_positions, act_probs[0][legal_positions])
        return act_probs, value

    def train_step(self, state_batch, mcts_probs, winner_batch, lr):
        """perform a training step"""
        state_batch = paddle.to_tensor(state_batch, dtype='float32')
        mcts_probs = paddle.to_tensor(mcts_probs, dtype='float32')
        winner_batch = paddle.to_tensor(winner_batch, dtype='float32')

        # zero the parameter gradients
        self.optimizer.clear_grad()
        # set learning rate
        self.optimizer.set_lr(lr)

        # forward
        log_act_probs, value = self.net(state_batch)
        # define the loss = (z - v)^2 - pi^T * log(p) + c||theta||^2
        # Note: the L2 penalty is incorporated in optimizer
        value = paddle.reshape(x=value, shape=[-1])
        value_loss = nn.functional.mse_loss(value, winner_batch)
        policy_loss = -paddle.mean(paddle.sum(mcts_probs * log_act_probs, 1))
        loss = value_loss + policy_loss
        # backward and optimize
        loss.backward()
        self.optimizer.step()
        # calc policy entropy, for monitoring only
        entropy = -paddle.mean(
            paddle.sum(paddle.exp(log_act_probs) * log_act_probs, 1)
        )
        return loss.item(), entropy.item()

    def get_policy_param(self):
        return self.net.state_dict()

    def set_policy_param(self, param):
        self.net.set_state_dict(param)

    def save_model(self, path: Path | str):
        """ save model params to file """
        if isinstance(path, Path):
            path = str(path)
        net_params = self.get_policy_param()  # get model params
        paddle.save(net_params, path + '.pdparams')
        opt_params = self.optimizer.state_dict()
        paddle.save(opt_params, path + '.pdopt')

    def save_for_predict(self, path: Path | str):
        if isinstance(path, Path):
            path = str(path)
        # save inferencing format model
        paddle.jit.save(
            self.net, path, input_spec=[
                InputSpec(shape=[None, 4, self.board_height, self.board_width], dtype='float32')
            ]
        )

训练流程

整体的思路是，先自我对弈收集数据，然后进行数据扩充（旋转，镜像），再通过梯度下降优化参数，循环进行

# -*- coding: utf-8 -*-
import os
import random
from collections import deque
from pathlib import Path

import numpy as np

from game import Board, GameGomuku
from mcts import MCTSPlayer, MCTSAlphaZeroPlayer
from policy_value_net import PolicyValueNet

MODEL_DIR = Path('./models')


class TrainPipeline:
    def __init__(self, name, init_model=None, **kwargs):
        self.name = name

        # params of the board and the game
        self.board_width = kwargs.get('board_width', 8)
        self.board_height = kwargs.get('board_height', 8)
        self.n_in_rows = kwargs.get('n_in_rows', 4)
        self.board = Board(width=self.board_width,
                           height=self.board_height,
                           n_in_rows=self.n_in_rows)
        self.game = GameGomuku(self.board)

        # training params
        self.learning_rate = kwargs.get('learning_rate', 2e-3)
        self.lr_multiplier = kwargs.get('lr_multiplier', 1.0)  # adaptively adjust the learning rate based on KL
        self.temperature = kwargs.get('temperature', 1.0)
        self.n_playout = kwargs.get('n_playout', 400)
        self.c_puct = kwargs.get('c_puct', 5)
        self.buffer_size = kwargs.get('buffer_size', 10000)
        self.batch_size = kwargs.get('batch_size', 512)
        self.data_buffer = deque(maxlen=self.buffer_size)

        self.play_batch_size = 1
        self.epochs = kwargs.get('epochs', 5)  # num of train_steps for each update
        self.kl_targ = kwargs.get('kl_targ', 0.02)
        self.check_freq = kwargs.get('check_freq', 50)
        self.game_batch_num = kwargs.get('game_batch_num', 1500)
        self.dirichlet = kwargs.get('dirichlet', 0.3)

        self.best_win_ratio = 0.0
        # num of simulations used for the pure mcts, which is used as
        # the opponent to evaluate the trained policy
        self.pure_mcts_playout_num = kwargs.get('pure_mcts_playout_num', 1000)
        if init_model is not None:
            logger.info(f'Loading model from [underline]{init_model}[/]')
            # start training from an initial policy-value net
            self.policy_value_net = PolicyValueNet(self.board_width,
                                                   self.board_height,
                                                   model_file=init_model)
        else:
            # start training from a new policy-value net
            self.policy_value_net = PolicyValueNet(self.board_width,
                                                   self.board_height)
        self.mcts_player = MCTSAlphaZeroPlayer(self.policy_value_net.policy_value_fn,
                                               c_puct=self.c_puct,
                                               n_playout=self.n_playout,
                                               dirichlet=self.dirichlet,
                                               selfplay=True)

        self.episode_len = 0

    def get_equi_data(self, play_data):
        """augment the data set by rotation and flipping
        :param play_data: [(state, mcts_prob, winner_z), ..., ...]
        """
        extend_data = []
        for state, mcts_prob, winner in play_data:
            for i in ([0, 1, 2, 3] if (self.board_width == self.board_height) else [0, 2]):
                # rotate counterclockwise
                equi_state = np.rot90(state, i, axes=(1, 2))
                equi_mcts_prob = np.rot90(mcts_prob.reshape(self.board_height, self.board_width), i)
                extend_data.append((equi_state, equi_mcts_prob.flatten(), winner))
                # flip horizontally
                equi_state = equi_state[:, :, ::-1]
                equi_mcts_prob = np.fliplr(equi_mcts_prob)
                extend_data.append((equi_state, equi_mcts_prob.flatten(), winner))
        return extend_data

    def collect_selfplay_data(self, n_games=1):
        """collect self-play data for training"""
        for i in range(n_games):
            winner, play_data = self.game.start_self_play(self.mcts_player, temperature=self.temperature)
            play_data = list(play_data)[:]
            self.episode_len = len(play_data)
            # augment the data
            play_data = self.get_equi_data(play_data)
            self.data_buffer.extend(play_data)

    def policy_update(self):
        """update the policy-value net"""
        mini_batch = random.sample(self.data_buffer, self.batch_size)
        state_batch, mcts_probs_batch, winner_batch = zip(*mini_batch)
        old_probs, old_v = self.policy_value_net.policy_value(state_batch)
        for i in range(self.epochs):
            loss, entropy = self.policy_value_net.train_step(
                state_batch,
                mcts_probs_batch,
                winner_batch,
                self.learning_rate * self.lr_multiplier)
            new_probs, new_v = self.policy_value_net.policy_value(state_batch)
            kl = np.mean(np.sum(old_probs * (np.log(old_probs + 1e-10) - np.log(new_probs + 1e-10)), axis=1))
            if kl > self.kl_targ * 4:  # early stopping if D_KL diverges badly
                break
        # adaptively adjust the learning rate
        if kl > self.kl_targ * 2 and self.lr_multiplier > 0.1:
            self.lr_multiplier /= 1.5
        elif kl < self.kl_targ / 2 and self.lr_multiplier < 20:
            self.lr_multiplier *= 1.5

        explained_var_old = (1 -
                             np.var(np.array(winner_batch) - old_v.flatten()) /
                             np.var(np.array(winner_batch)))
        explained_var_new = (1 -
                             np.var(np.array(winner_batch) - new_v.flatten()) /
                             np.var(np.array(winner_batch)))
        logger.info(f'[bold]Train[/] Step '
                    f'kl: {kl:.5f}, '
                    f'lr_multiplier: {self.lr_multiplier:.3f}, '
                    f'loss: {loss}, '
                    f'entropy: {entropy}, '
                    f'explained_var_old: {explained_var_old:.3f}, '
                    f'explained_var_new: {explained_var_new:.3f}')

    def policy_evaluate(self, n_games=10):
        """
        Evaluate the trained policy by playing against the pure MCTS player
        Note: this is only for monitoring the progress of training
        """
        logger.info(f'[bold]Evaluating[/]')
        current_mcts_player = MCTSAlphaZeroPlayer(self.policy_value_net.policy_value_fn,
                                                  c_puct=self.c_puct,
                                                  n_playout=self.n_playout)
        pure_mcts_player = MCTSPlayer(c_puct=5,
                                      n_playout=self.pure_mcts_playout_num)
        win, lose, tie = 0, 0, 0
        for i in range(n_games):
            winner = self.game.start_play(current_mcts_player,
                                          pure_mcts_player,
                                          start_player=i % 2 + 1, )
            if winner == 1:
                logger.info(f'    {i + 1}/{n_games} [bold]Win[/]')
                win += 1
            elif winner == 2:
                logger.info(f'    {i + 1}/{n_games} [bold]Lose[/]')
                lose += 1
            else:
                logger.info(f'    {i + 1}/{n_games} [bold]Tie[/]')
                tie += 1
        win_ratio = 1.0 * (win + 0.5 * tie) / n_games
        logger.info(f"num_playouts:{self.pure_mcts_playout_num}, "
                    f"win: {win}, lose: {lose}, tie: {tie}")
        return win_ratio

    def run(self):
        """run the training pipeline"""
        try:
            logger.info('Start training.')
            for i in range(self.game_batch_num):
                logger.info(f'[bold]Selfplay[/] batch {i + 1}')
                self.collect_selfplay_data(self.play_batch_size)
                if len(self.data_buffer) > self.batch_size:
                    self.policy_update()
                # check the performance of the current model,
                # and save the model params
                if (i + 1) % self.check_freq == 0:
                    self.policy_value_net.save_model(
                        MODEL_DIR / self.name / 'train' / f'current_step-{i + 1}'
                    )
                    win_ratio = self.policy_evaluate()
                    if win_ratio > self.best_win_ratio:
                        logger.info("[bold]New best policy!!!!!!!!")
                        self.best_win_ratio = win_ratio
                        # update the best_policy
                        self.policy_value_net.save_model(
                            MODEL_DIR / self.name / 'train' / 'best'
                        )
                        self.policy_value_net.save_for_predict(
                            MODEL_DIR / self.name / 'production' / 'best'
                        )
                        if (self.best_win_ratio == 1.0 and
                                self.pure_mcts_playout_num < 5000):
                            self.pure_mcts_playout_num += 1000
                            self.best_win_ratio = 0.0
        except KeyboardInterrupt:
            logger.info('[bold red]quit.')

    @property
    def name_suffix(self):
        return f'{self.board_width}x{self.board_height}-{self.n_in_rows}'

训练速度优化（C++多线程）

这部分才是本文的重点。在训练过程中会发现大部分的时间都花费在生成对局信息上了，这部分过程对算力的利用率很低，我一开始是想用 C++ 代替这部分（实际上我也试了，用 C++ 代替 MCTS 搜索之后，2000 次模拟用时从 8 秒多缩减到了 6 秒多，进一步分析后发现，模拟过程中大部分的时间其实在神经网络运算上），效果其实并不算好，想到可以多线程进行（最终实现是多线程同时生成对局 + 并行 MCTS，参考了 @hijkzzz ^[4]）

~~其实我还写了一个 Python 多线程版本，但是写完之后想起来 Python 有一个 GIL 锁，会强制限制多线程到一个 CPU 核心，所以这样处理对性能没有帮助~~

以下是多线程高性能版本

值得一提的是，这些代码还存在一些问题

存在内存泄露，训练过程中内存越跑越大，具体什么原因没查出来，只看代码我没分析出来

是否能训练出优秀的模型存疑，因为到现在我并没有训练出一个在 11x11-5 的棋盘上有良好表现的模型

与 @hijkzzz 不同的是，我并没有使用 SWIG，而是使用了 pybind11

pybind11 环境配置

环境配置之所以单独拿出来说，是因为 Windows 环境下 pybind11 很容易出问题。

方式1 使用 Visual Studio

官方有详细教程 ^[5]

方式2 使用 CLion + CMake

CMakeList.txt

cmake_minimum_required(VERSION 3.29)
project(Gomuku LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 20)

#set(PYTHON_EXECUTABLE "E:/Anaconda3/envs/paddle/python.exe")
set(pybind11_DIR "E:/Anaconda3/envs/paddle/Lib/site-packages/pybind11/share/cmake/pybind11")
set(PYBIND11_FINDPYTHON ON)

#find_package(Python3 COMPONENTS Interpreter Development)
message(DEBUG ${PYTHON_INCLUDE_PATH})
include_directories(${PYTHON_INCLUDE_PATH})
link_libraries(${Python3_LIBRARIES})

find_package(pybind11 REQUIRED)
include_directories("E:/Anaconda3/envs/paddle/Lib/site-packages/pybind11/include")


include_directories(src)
file(GLOB SOURCES "src/*.*")

# module
pybind11_add_module(
        library
        ${SOURCES}
)
#set_target_properties(library PROPERTIES MSVC_RUNTIME_LIBRARY "MultiThreaded$<$<CONFIG:Debug>:Debug>")
target_include_directories(
        library
        PRIVATE "src"
)