Abstract

  • Compared to Go, China Competitive Poker, also known as Dou dizhu(斗地主), is a type of imperfect information game, including hidden information, randomness, multi-agent cooperation and competition.

  • We introduce an approach to play China Competitive Poker using CNN to predict actions. This network is trained by supervised learning from human game records.

1 Introduction

Compared to Go, China Competitive Poker (CCP) is a kind of imperfect information game, starts from a random state. A player gets different cards each game, does not know the cards distribution of the other two players. Like bridge, Peasants must coordinate with each other, otherwise they could hardly beat the Landlord. CCP is a good problem for game AI. We choose CNN to solve the problem in CCP due to the following reasons:

  1. First, CNN has achieved superhuman performance in perfect information games.

  2. Second, there is semi-translational invariance in CCP, e.g. there are two sets of cards in the same category but with different ranks (like “34567” and “45678”, see more information in section 3), if we add each card’s rank, “34567” become “45678”, this is translational invariance. The player can play out “45678” after another one played out “34567”, but it is illegal if we swap the order, this is the reason for “semi”.

There is no AI about CCP using DNN so far. It is still to be proved that the network can output a proper action with an imperfect information input in CCP. The same factor between CCP and Go is blurred when making a decision. It is very hard to evaluate the value of alternative actions in Go, also CCP. Two problems:

  1. As there are teammates in CCP, one problem is to teach the network to cooperate, e.g. how to help teammate become the first one to play out all cards, while his cards is not good enough.

  2. Another problem we are interested in is inference, whether the network is able to analyze like human experts’ logic.

The DeepRocket, which is developed by us, gets the SOTA performance in CCP. We prove that the network can learn cooperate and inference in imperfect information game.

3 Notations

The Specified Player; Down Player; Up Player; Landlord(地主); Down Peasant(农民); Up Peasant; Public Cards; Round; Category; Main Group and Kicker Card; Set or Group; Active Mode; Passive Mode; Duplicate Mode

4 DeepRocket Framework

The system consists of three parts:

  • the Bid Module. When game starts, the Bid Module is called in order to make DeepRocket output score, if necessary.

  • the Policy Network. The Policy Network is called before DeepRocket will do action, the output may contain Kicker Card type (solo or pair) in some labels.

  • the Kicker Network. If there is a Kicker card need to be predicted, then the Kicker Network is called, as shown in Figure 1 and Figure 2.


Figure 1: DeepRocket game flow


Figure 2: The Policy Network and The Kicker Network work flow

4.1 The Bid Module

First of all, the game need to make sure who is Landlord by bid after players get cards, so we design the module for bid. The key factors of bid is cards. The Bid module is based on logic code. Like human experts, the output of module depends on control cards (like “A”, “2”, and “jokers”) and tidiness(整齐度)(that means less sets combined by cards). The module have been used on many online platform in business in China, and it is proved that works well. Bid is not used in Duplicate Mode, also in test, we do not introduce more details and focus on other components of DeepRocket.

4.2 The Policy Network


Figure 3: Example Game Record

If we want to learn Down Peasant, the record should be divided into the state-action pairs.

The input to the Policy Network is a 15×19×21 3-dimensional binary tensor (we name the corresponding dimension X, Y, and Z respectively). Each dimension along X-axis represents the ranks of cards, from 3 to big joker. Y-axis represents the number of each rank (from 1 to 4), and features in CCP, such as solo, pair, trio, etc. Z-axis represents the sequential information of each rounds, this design learned from AlphaGo, make the variable length to fixed length in game. Finally, the most recent 6 rounds are chosen. Si,j,k standards the binary feature for current state of the game. Details are shown in figure 4.


Figure 4: The meaning of Z-axis

512 filters are the most suitable after models are tested. 10-layers CNN achieved the best performance after repeated testing, different strides are used in each layer.

The Policy Network outputs 309 action probabilities after we add the Kicker Network to DeepRocket, See figure 5. It is noteworthy that the maximum number of cards is 20, it restricts the longest length of category.


Figure 5: The output’s composition PRIMAL

4.3 The Kicker Network

In order to address this problem, we add an extra network for predicting the Kicker Card. The Policy Network predict the Main Group and Kicker Cards type, like solo or pair, then the Kicker Network predict the kicker cards. This make the Policy Network only output 309 probabilities rather than thousands. For example, “333444555666789J” and “333444555666789Q” are the same output in the Policy Network (outputs “333444555666” and solo kicker cards) while “789J” or “789Q” is predicted by the Kicker Network.

The input to the Kicker Network, which include the remaining cards and the output of the Policy Network is a 15×9×3 3-dimensional binary tensor. Each dimension along X-axis represents the cards, is the same as the Policy Network. Figure 6 shows the details about Y-axis and Z-axis. The Kicker Network has 28 outputs, include 15 kinds of solo and 13 kinds of pair.


Figure 6: The kicker network’s input channel CHANNEL/PLANE

The Kicker Network consists of 5-layer CNN and 1 fully connection layer and outputs the probabilities of Kicker Cards. The Kicker Network outputs one type kicker each time. If the Policy Network predict “333444” and 2 solo Kicker Cards, the Kicker Network should be called twice.