Work

Garrick Fernandez

Deep Learning Dominion

Categories

Code

Date

December 2019

Benjamin Anderson and I developed this final project for a class on decision making and modeling uncertainty. I never thought that the card game he got hooked on (taking me along for the ride) years ago would come back and become our topic (taking me along for the ride again), but here we are!

We wanted to make an agent that could play Dominion. In Dominion, you start with a deck of small, weak cards, and build up a more powerful deck over the duration of the game in order to win. Cards serve different purposes: treasures allow you to acquire better and more powerful cards; victory cards contribute to your score at the end of the game; and actions allow you to do more things, such as pruning your deck or attacking opponents, during your turn.

dominion train

To make a Dominion-playing agent, you first have to be able to describe the game to it, but It’s intractably difficult to model the state space of the game at a given time step. For one thing, each player has their own deck; a hand drawn from the deck; the cards on the table available for purchase, and so on. Any game of Dominion has upwards of 20 unique card types with 10–60 cards per type, and certain piles of cards (kingdom cards) can be swapped out for others to make for different games and emergent strategies. The base game has 25 kingdom cards; further expansions bring that number to over 300.

Another issue is the large amount of actions, of which the decision-making agent needs to choose the best one, filtering out irrelevant or suboptimal options. You may play any card from your hand that is an action in the action phase, and you may buy any cards that you can afford in the buy phase. Playing actions change the flow of the game, and in following the rules of an action card, any player may be prompted to make decisions to continue the game.

To deal with the large state space and action space, we implemented SARSA with global approximation over a neural network. The input to the network is a compact encoding of state and action, and the output is an estimated Q value, representing the utility of taking that action in that state. As the network makes better and better guesses for Q values, a computer player can play Dominion well by choosing the action which maximizes the Q value.

In the neural network approach, input features are passed through layers of successive neurons, each of which compute a linear combination of the previous inputs and apply a nonlinear activation function to the result. This is analogous to stacking perceptrons, and the advantage of a neural net is that it can synthesize, generalize and infer information from a set of input features much smaller than an explicit full state/action encoding.

model dominion

Diagram of the model, showing how input is divided amongst towers (here, three analyzing the hand, table, and proposed action).

We designed a content-associative tower architecture inspired by the way human players may group and process information. If a player sees information about their hand, they may derive some high-level strategy from it. This is combined with stratagems about their deck or the state of the game, which contribute to a final cognitive decision on what could be the best course of action. As a result, our input features are segmented and fed to several “towers” which process the content and derive a latent representation of it. These representations are concatenated and fed through more dense layers to derive the Q value.

Some implementation details:

  • The network was small (12K parameters). We did this to ease the training time and focus on the veracity of a content-associative approach over large numbers of neurons and layers.
  • The hand features are one-hot vectors encoding up to 15 cards in the player’s hand. The respective tower uses a 1D convolution to apply a filter the size of the number of cards in the game to each of these slots, allowing for parameters to be shared. These filters can be thought of as mini-evaluations of different aspects of fitness and utility of cards in hand.
  • The initial layers used a ReLU activation (Rectified Leaky Unit), while later layers used a sigmoidal activation.
  • The final layer consisted of one neuron with a linear activation.
  • The model was built in Keras and trained using Adam optimization (a more stable optimizer than gradient descent) to minimize mean squared error.

While the neural network architecture helped with the intractable action space problem, it suffered due to spare reward. Games consist of hundreds of decisions, and the reward for most of them is zero. A neural network can cheat and guess zero most of the time and do pretty well. To help this, we introduced eligibility traces to assign credit from the end-game reward to past decisions. To implement this, we only trained the neural network after an entire game, and we distributed the final reward (with decay) back through the rewards vector.