I'm happy to announce two new manuscripts on reinforcement learning and the prisoner's dilemma with a number of exciting new results! These works are also a great story on open science, reproducible results, and sustainable software.

In Reinforcement Learning Produces Dominant Strategies for the Iterated Prisoner's Dilemma, we use evolutionary and particle swarm algorithms to create the best known performing strategies for the iterated prisoner's dilemma. By evolving strategies to maximize their total payout we arrive at about a dozen strategies based on neural networks, lookup tables, or finite state machines, that are generally cooperative, able to exploit weak strategies, and defect against aggressive strategies. These new strategies consistently win tournaments of over 150 strategies, including all the best known human-designed tournament strategies that we could find in the literature, such as OmegaTFT and several zero determinant strategies.

Standard Tournament Results: Score distributions

Pairwise Scores from Standard Tournament

In Evolution Reinforces Cooperation with the Emergence of Self-Recognition Mechanisms: an empirical study of the Moran process for the iterated Prisoner's dilemma, we evolve strategies to maximize their fixation probabilities in the Moran process, a population dynamic that models natural selection. To succeed in this evolutionary setup strategies need to be able to invade populations of other strategies and resist invasion by opponents.

Here's the big result: we find that the strategies that maximize their payout are great invaders whereas strategies that the use some kind of self-recognition mechanism (such as an initial handshake sequence) are the best resistors of invasion. When we evolve strategies to maximize their fixation probability, they often naturally acquire handshaking mechanisms. This falls totally out of the evolutionary algorithms -- we do not hard code in handshakes in any way.

Strategies ranked by invasion as a function of population size

Strategies ranked by resistance as a function of population size

This has an important implications for many evolutionary processes -- the strategies that are best at invading populations as mutants are not the best strategies at resisting invasion. Self-recognition mechanisms are less likely to fixate but are more evolutionarily stable when they do arise! This may have implications for social behaviors in human (in group / out group effects), mating rituals, and many other scenarios. It also suggests that handshakes can prevent the fixation of a less advantageous variant, much like how traditional institutions are resistant to change.

As someone with a long term fascination with all things evolutionary , as well as a machine learning engineer / data scientist by day, I'm very excited to have used evolution and other reinforcement learning techniques to create strategies that succeed at evolution, and to learn more about the evolutionary process.

I promised an open science story so let me get to it. The Axelrod project was started with the goal of reproducing early iterated prisoner's dilemma (named after Robert Axelrod, who coordinated and ran some very influential tournaments). Active for just over two years, I joined the project a few months in as one of the core developers (along with Vince Knight and Owen Campbell) after adding several strategies (e.g. the zero determinant strategies). We've had many amazing contributions from over 50 individuals on github, including professional programmers, researchers and biologists, university students, and even at least one high school student. The project is very welcoming to open source newcomers and even beginning programmers. The library is extensively tested by automated unit tests, is well-documented, and is completely open source.

Once the library started to accrue a sizeable number of strategies from the literature and various contributors, it became amenable to machine learning applications. Martin Jones contributed a strategy based on lookup tables trained with an evolutionary algorithm that became the top strategy in the library (and is still among the best despite the addition of approximately 100 strategies since then). Then Georgios Koutsovoulos made a stochastic version of Martin's strategy using a particle swarm algorithm, narrowly taking over the top spot. Not to be bested, Martin returned with a neural network based approach that was the new champion.

Some time later I made a sublibrary and generalized their contributions so that more strategies could be trained with flexible parameters and in many standard variations such as noisy tournaments. We had also added the Moran process to the main library so we began training strategies for that context as well, and added finite state machines as a new possibility, along with a stochastic version based on hidden Markov models.

Then we burned a lot of CPU cycles on a consumer-level desktop (quad core @ ~4GHz) and some 16 core machines at Cardiff (where core developer and author Vince Knight is a professor) training new strategies (and retraining the originals mentioned above) for standard tournaments, noisy tournaments, and the Moran process. Vince also ran many iterations of the tournaments and Moran processes of all the stratetgies to create the datasets for our paper while I worked on new strategies and we analyzed the data coming in.

Initially we had winners for both standard and noisy tournaments that we added back into the library. Our study of the Moran process was underway already (to compute all the pairwise fixation probabilities) and we added a few specifically trained strategies to that pool, as well as the tournament winners. While we began writing up the results, we had a contribution of a complex (and not very well-known) strategy called Desired Belief Strategy (DBS) from Edouard Argenson. After each release of the Axelrod library we update a standard collection of tournament results, and noticed that DBS was winning the noisy tournament! So we regenerated the data and had to rewrite some of the paper to account for the success of DBS in noisy tournaments. Our trained strategies ended up in a close second and third place.

Bonus: See Vince Knight's blog post