Monday, February 12, 2018

A Data-Driven Strategy Guide for Through the Ages, Part 0

Through the Ages (TtA) is a multi-player, deep strategy game with partially hidden information and variable background. All these properties make it hard to learn strategic lessons from statistical analysis.

I scraped 30k+ game data from boardgaming-online.  I use Support Vector Machine to predict the result of a game up to 70% accuracy, and the skills of a player up to 60% accuracy (out-of-sample performance).
Together with these predictions, we can learn important strategic lessons about which aspects are more important during different stages of a game.
We can also learn which specific cards are helpful or harmful to your strategy.
Read more here......

Also, I found a way to remove player skill bias and strategic misconceptions from the data. You can read about that here.

A Data-Driven Strategy Guide for Through the Ages, Part 6

Index:

1. Introduction  (Link to Part 1)
2. Data Analysis
    2.1 Classification: Infrastructure Development (Link to Part 2)
    2.2 Classification: Cards Played (Link to Part 3)
    2.3 Separating players with TrueSkill (Current Article)
3. Analysis for Boardgamers
    3.1 Infrastructure Development (Link to Part 4)
    3.2 Cards Played (Link to Part 5)
    3.3 Mistakes made by Good Players (Link to Part 7)

2. Data Analysis

2.3 Separating players with TrueSkill

TrueSkill is a system invented to calculate the relative skills between players for games with non-deterministic outcomes. We use the default model in python that starts a player with skill=25, and standard deviation of 8.33. Since we will continue to use the approach of separating "good" and "bad" performance in a game by 90% of the winner score, update the TrueSkill by the same standard. Every player in a game with a "good" performance "defeats" everyone with a "bad" performance. 

We go through the results of 30k games to update the every players TrueSkill.
The final distribution of TrueSkill looks like a good Bell Curve with a few sharp noise. In the meanwhile, it is also fun to see how many games has someone played.
See how this almost looks like a x*y=const. curve? That means on average, every game has a roughly equal share of experienced and inexperienced players. 

With the TrueSkill information, we can look into a few more interesting statistics. Cross-referencing these different statistics allows us to get more strategic insights. Note that I am using the final TrueSkill instead of developing TrueSkill along the way. The implicit assumption here is that for those who played many games, they stay at the same skill level for the majority of games they played. Thus the final state is roughly the equilibrium state, and the entire history of games I included are also in the equilibrium state.

First of all, since TtA has hidden information and multi-player interaction, "good moves" do not always win the game. Instead of classifying with respect to the outcome of a game, we can classify with respect to the skill of a player. This can answer the question of "what do better players do?" Maybe better players already saw through the luck-dependent noises and know how to consistently perform well.  
For example, in this chart we compare the prediction of outcome with the prediction of player skill, based on the infrastructure development. We can see that in the beginning, stronger players are already doing something measurably different from weaker players, despite that the outcome of the game is still pretty unclear. Then, for a long time, to almost the end of the game, we actually cannot see further differences between stronger and weaker players. All we know is that as the game progresses, it becomes more and more clear that who will win.

Also, as discussed earlier, classifying with respect to the outcome can sometimes be biased by the post-selection effect. A card can appear to be good simply because it's adding frostings on a cake, instead of being the true key of victory. The above cross-referencing method, with the appropriate choices of subset of cards, we can try to remove the post-selection effect. We can also recognize important lessons that even good players have not seen yet.

We can also calculate the average TrueSkill for players in a game. If it is larger than 25, we say it is a "good game". Otherwise we say it is a "bad game". We can then perform the previous analysis on the "good games" to see if there is any difference.  This can tell us "how to win if you are playing with good players?" We can see whether it is very different from the situation with typical players.


A Data-Driven Strategy Guide for Through the Ages, Part 7

Index:

1. Introduction  (Link to Part 1)
2. Data Analysis
    2.1 Classification: Infrastructure Development (Link to Part 2)
    2.2 Classification: Cards Played (Link to Part 3)
    2.3 Separating players with TrueSkill (Link to Part 6)
3. Analysis for Boardgamers
    3.1 Infrastructure Development (Link to Part 4)
    3.2 Cards Played (Link to Part 5)
    3.3 Mistakes made by Good Players (Current Article)

3. Analysis for Boardgamers

Disclaimers: 
(1) We can only learn correlations from data.  Whether these correlations actually imply causation is up to our interpretation.
(2) The data comes from 30k+ recent games at boardgaming-online . 

3.3 Mistakes made by Good Players 

As explained in Section 2.3, we calculated the TrueSkill of each player. This allows us to use joint-statistics to answer quite a few questions which were unclear if looking at the game result along. We have also gotten enough data that we can afford to separate 2er games from 3, 4 ers. The zero-sum nature and the relative quantity of hidden information might make 2er games quite different.

We will be showing similar charts as Section 3.2, with a few improvements. First of all, the title of the chart will tell you whether this is for predicting the result of the game, or predicting the player's skill. The single number (less than 1) in the title is replaced by a percentage. It still tells you how good the prediction is. I also subtracted the performance of trivial guess already, so it is easier to see how good such a prediction really is.

Let us first look at all the cards during Age A and Age I for 2ers.
We should first note that the first chart here has a bad performance, consistent with 0% (trivial guess). That means the usage of Age A and Age I cards fails to predicting the final outcome of the game. That is not very surprising. 2ers are probably decided by big military and/or culture swings which comes much later in the game. However, we can still predict players' skill quite well. We know which cards tend to be played by stronger players (blue bars), and which tend to be played by weaker players (red bars).  We cannot directly know whether these choices help you win. All we know is that those choices appear to be related to player skill.

Next we do the same thing for 3ers and 4ers.
Now this is interesting. The prediction for the game outcome is no longer consistent with the trivial guess. It becomes very informative to compare these 2 charts.  The prediction for player skill is better. This is somewhat expected. Since it is quite early in the game, it will be difficult to predict the final outcome. However, good players tend to follow certain strategy, which might already be distinguishable in their early choices. 

I can see 4 most striking differences here.

Pyramids vs Library of Alexandria.
Stronger players slightly prefer Pyramids over Library, more than weaker players do. However, Library performs significantly better for winning the game. I cannot see any alternative interpretation here and must conclude that this is a mistake made by strong players. Note that strong players must have performed better in many other places to make up the difference. However, there is very little doubt that at a 1-to-1 comparison, Library is better. Start to use it more!

I don't think that in the long run, Pyramids' ability is weaker. I looked at the statistics of how early do people build Age A Wonders. Pyramids and Library tend to be built as early as possible, which means delaying the 2nd Philosophy. 1 CA is better than 1 Science later in the game. However, this early in the game, Science might be slightly better, or at least equal. 1 turn earlier into any Age I technology can have a compounding effect that snowballs your economy. Combining with other benefits from Library makes it somewhat more powerful than Pyramids built at the same time.

Thus, maybe we should stop rushing to complete Pyramids. I know, it is a tempting package deal to get that CA early, delay population growth and get 1 extra food. But maybe it's not worth delaying the extra Science production. 

Age I Wonders.
Stronger players do not build Age I Wonders more often than weaker players. However, except for the Great Wall, all other Age I Wonders seem to win a lot. The interpretation here requires a bit more steps. Recall that if we use Age III cards to predict the game result, Wonders performs exceptionally well. That does not mean those cards win the game. They simply "indicate" that the player has enough CA, science and resources to complete a big Wonder. Leveraging those into other developments probably wins the game, too. 

We did not have to consider this "post selection" effect for Age A Wonders, because everybody starts at roughly the same condition. For Age I, we should ask whether there could be significant differences in infrastructure, which determines a player's ability to build a Wonder.

That does not seem to be the case to me. Even if someone upgrades Iron early, it will take 3 rounds before an upgraded mine to generate a net resource gain. During the prime time to build Age I Wonders, a few timely Yellow Cards probably give you more resource advantage, and that is not much. Thus, it seems like these Wonders are really contributing to victory, instead of just "indicating" someone's exiting advantage. Furthermore, if that were the case, Great Wall should have been an equally good "indicator", but it is not.

Thus, I again conclude that Taj Mahal, University, and Basilica, are under-valued by good players. Investing your early resources and CA in them seems to be a good deal, compare to other things you might have done. This is probably because all other ways to use resources require population and science. You are likely in shortage of either or both during Age I-II transition.

Code of Law vs Warfare.
Good players value them almost equally. However, CoL appears to perform significantly better. I again do not see an alternative explanation here. They have the same number of cards, and the science costs differ by only 1. Thus again, good players seem to over-value Warfare. Probably a bit paranoid and trying too hard to prevent early aggression. I won't say that's wrong. If you are indeed the better player in a game, then the most likely way for you to lose is probably a devastating early aggression. It is a good circumstantial strategy to be slightly paranoid about that and go with the mediocre Warfare. As long as a stronger player do not suffer from an early aggression, she can count on later moves to cover the lost ground.

Knights vs Swordsmen.
Good players value them almost equally, but Swordsmen performs better than Knights. This, even to me, is a surprise. Knights open up more Age I tactics, but Swordsmen are easier to discover and build. At Age II, such difference is almost gone. I would have expected them to perform equally well. My only explanation is that on average, people do spend 15% more CA to take Knights from the card row, and that is bad enough to undermine Knights' record.


Thursday, February 1, 2018

A Data-Driven Strategy Guide for Through the Ages, Part 5

Index:

1. Introduction (Link to Part 1)
2. Data Analysis
    2.1 Classification: Infrastructure Development (Link to Part 2)
    2.2 Classification: Cards Played (Link to Part 3)
    2.3 Separating players with TrueSkill (Link to Part 6)
3. Analysis for Boardgamers
    3.1 Infrastructure Development (Link to Part 4)
    3.2 Cards Played (Current Article)
    3.3 Mistakes made by Good Players (Link to Part 7)

3. Analysis for Boardgamers

Disclaimers: 
(1) We can only learn correlations from data.  Whether these correlations actually imply causation is up to our interpretation.
(2) The data comes from 10k+ recent games at boardgaming-online . I did not separate the "level" of the games.  Thus, it represents the behavior of all players, not just good players.

3.2 Cards Played

Again, the actual data analysis procedure is described in Section 2.2. Here is the summary for boardgamers.

The most important information I will use here, is whether a card is played or not. That includes elected Leaders, discovered Technologies, FINISHED Wonders, and new governments. The data did not include when was such card played, taken, or how was it played (revolution or peaceful). It also does not include yellow cards either.

Just like in the previous section, you will see a number (weight) associated with each card in several charts. Comparing those weights within the same chart tells us how important those cards are, relative to each other.  In addition, the title of a chart will be a number, a larger number there implies how much impact it will have if we follow what the chart teaches us. This title number will not be higher than 0.7. If it is smaller than 0.54, the chart probably has no impact.

Lesson One: Leaders, Leaders, Leaders!

This chart includes all Stage A and Stage I cards. We can see that all leaders have relatively higher weights than other cards available at the same time. Also, Age I leaders are significantly better than Age A ones. So, it is ok to miss out on a Wonder, or a Technology, but we should always plan to have a leader and maximize its benefit.

This should not be too surprising. The (opportunity) costs for these cards are not identical. Leaders have (strictly) the least costs. You only spend CA to take them, and you often don't even spend CA to play them. Everything else requires you to spend CA plus science/resource/population.

Lesson Two: Early CA is key.
Among the Age A leaders, Hammurabi clearly stands out. This echoes our observation in the previous section that early extra CA is important.  In fact, the only other two technology cards which are almost as good as an early leader are Code of Laws (CoL) and Monarchy. They both give you extra CA. In fact, Warfare costs 1 less Science than CoL, but has an obviously less weight (consistent with 0). This does not mean that 1 MA is not important, but it certainly is not as good as 1 CA with a similar cost. Pyramids being lower than Library is a bit surprising. I initially suspect the reason is that even average players tend to overspend their CAs for Pyramids, but that does not seem to be the case.
Here is a chart for all Age A, I and II Wonders, regarding how often they are taken, and on average how many CA a player spends to get them. We can see that Pyramids is taken 20% more often than Library, but people only spend slightly more CA for it on average. Thus, maybe Library is really better.

Lesson Three: St. Peter's Basilica is the best happiness solution.
By the end of Age I, almost everyone will need at least 2 happy faces or some extra yellow tokens. While the 1st happy face can come from the free temple, we need to work on the 2nd one. The possible solutions are Theology, Theocracy, Bread & Circuses, Hanging Gardens, Great Wall, St. Peter's Basilica, Homer, and (Columbus + Vast Territory).  Only the last 3 options have significantly positive weights in the above chart. The opportunity cost for Columbus or Homer is another leader of the same age, which has a comparable weight. Thus, the only solution that stands out is Basilica.

In fact, we can compare across all sources of Happy Face here.
We can see that this chart has a low impact (title number close to 54%). However, the two Stage I Urban solutions for happiness are just terrible. Basilica not only stands out among contemporary cards, it even looks better than later game stuffs which produces more cultures.  If we cross-reference to the previous chart, we can see that Basilica is taken with 2 CA, which is the minimum for your 2nd Wonder. Maybe it deserves to be taken with more CA.

Lesson Four: Be opportunistic.
This chart is for all Age I and II techs for farms, mines, urban buildings and military units. First we note that the impact is quite low (55%).  We should probably avoid Theology and Bread & Circuses, but almost everything else is fine. Let us again cross reference this to how often they are taken, and how many CA spend on them.

Iron is taken twice more often than Coal, but it does have 1 more card. So that difference is not huge. The same story goes for Alchemy v.s. Scientific Method, and Cannon v.s. others.  People get more CA to spend later, thus it is nor surprising that Age II techs costs more CA on average. Among Age I techs, people do take Iron, Knights and Alchemy with more CA.

I guess the conclusion here is that almost anything here will work. Nothing is a must have, but most of them are bad either. You just get what you can, and adjust with yellow cards accordingly. The military situation is similar.  Any units can shine with the right tactics, and the chance to match with the right tactics does not seem to vary significantly.

Lesson Five: Leaders again, but Strategy catches up.
This chart shows all cards in Age II. Leaders are still good, with Napoleon on top as expected. Strategy stands out among other cards and is almost as good as Bach. The 2 CA plus military power might decide whether you are the predator or the prey. The special technologies kind of have the 2nd lowest cost on average. You only spend science, and usually an affordable amount. In fact, let us look at all those things that requires science only.
Lesson Six: "Pure" techs are good.
The above chart compares all Governments and Special Techs. They all cost you only science.  First we note that they almost all have positive weights. That is because if you spend the science on Urban Building or Military Units instead, you also need to spend population and resources to get the actual benefit. You don't always break even after those extra costs. For these things, right after spending your science, you start to get benefits.

Lesson Seven: Constitutional Monarchy sucks!!??
This incorrect impression was due to a bug in the code that confuses Monarchy with Constitutional Monarchy. After fixing that bug, CM performs slightly better than Republic.

Lesson Eight: Many ways to seal your victory, but too late for your misery.
Finally, Age III is no longer dominated by leaders. With sufficient CA and resources, Wonders seem to be the most certain way to seal your victory. But you can also leverage your science or population advantage into Democracy or Air Force. It is also not too late to boost your CA with Civil Services. Note that everything here has a positive weight. This is the post selection effect that "if you can afford these, you are probably winning anyway". Thus, it is probably only worth looking at weights significantly different from average here. 

In particular, we see that Gandhi is terrible compared to other leaders. We probably only take him as the last and usually futile defence against strong aggressors. Professional Sports also appears to be terrible, which suggests that we should have solved potential happiness problem earlier.

A Data-Driven Strategy Guide for Through the Ages, Part 4

Index:

1. Introduction (Link to Part 1)
2. Data Analysis
    2.1 Classification: Infrastructure Development (Link to Part 2) 
    2.2 Classification: Cards Played (Link to Part 3)
    2.3 Separating players with TrueSkill (Link to Part 6)
3. Analysis for Boardgamers
    3.1 Infrastructure Development (Current Article)
    3.2 Cards Played (Link to Part 5)
    3.3 Mistakes made by Good Players (Link to Part 7)

3. Analysis for Boardgamers

Disclaimers: 
(1) We can only learn correlations from data.  Whether these correlations actually imply causation is up to our interpretation.
(2) The data comes from 10k+ recent games at boardgaming-online . I did not separate the "level" of the games.  Thus, it represents the behavior of all players, not just good players.

3.1 Infrastructure Development

Please refer to Section 2.1 if you are interested in the methods I used. For boardgamers, here is a summary of what you will see here:

(1) I recorded 6 aspects of infrastructure in every round per player per game. They are:
(A) Number of Civil Actions (CA) used.  (take/play cards, build/disband things, revolution is counted as 1 CA, wasted CA does not count)
(B) Number of Military Actions (MA) used. (draw cards + build/disband units + play aggression/war/tactics + adopt tactics, wasted MA does not count. In this counting, a Robespierre revolution will drop this to 0 but add 1 to CA.)
(C) Resource produced. (ignore corruption)
(D) Science produced.
(E) Culture produced.
(F) Food produced (minus consumption).

(2) Up to a specified round, you will see 1 number (weight) per round per aspect. This number implies how strongly such aspect at that round is related to a good performance in the end of the game. (>90% of the winner score is classified as good). A positive weight means being better than your opponent on this aspect is important. A negative weight means that you being better on this aspect is bad.

We will try to determine whether some strategic lessons can be learned from these weights.

Here is what they look like from Round 0 to Round 4.
Lesson One: All curves except CA start at 0 in round 0. This is the card-selection round, so no one can do anything different. A small but positive weight for CA implies a small advantage to later players.

Lesson Two: At Round 1, the common actions is to build the 3rd mine (73%). If you have Urban Growth, you can build the 2nd Lab instead (5%). They perform equally well. This resource or science advantage actually has almost the largest weight up to Round 4. Thus it is probably not a good idea to build a farm or work on Wonder instead.

Lesson Three: At Round 1, the only way to get extra CA is Hammurabi. The negative weight suggests that it may not be a good idea to exercise his ability right away. Also, the only way to get MA is Caesar. The negative weight suggests that he may be a relatively weaker leader. Although, it could also mean that typical players do not know how to use him properly. Since the MA weights are no longer negative after Round 2, maybe one should avoid electing Caesar at Round 1. There is a little puzzle here though. Presumably, the only disadvantage of using Hammurabi's power is 1 less Military Card. And the only advantage of electing Caesar is 1 more Military Card. It is not impossible, but quite amusing that both are bad.😏

Lesson Four: From Round 2 to Round 4, extra CA seems to be the best investment. There are three ways to get it. Hammurabi, Pyramid, or Code of Laws. It might be wise to prioritize them.

Lesson Five: Foods are mostly irrelevant, probably because most people produces exactly the same amount anyway. A small exception happens in Round 2, during which you can delay growing a population to produce 1 more food. This is usually means working on a Wonder, delay the 2nd Lab, or risk missing out an Age A event. The relative lower weight in science at Round 2 also indicates that delay 2nd Lab for this may not be a bad idea.

Next, we move on to look at the same thing up to Round 10.
Lesson Six: Food is consistently irrelevant as expected. Most of the time, people dance near the upkeep and produce 1 or 0 food each round. Food matters only in combination with population footprints and happiness. Thus the long-term food production trend does not teach us interesting lessons.

Lesson Seven: Between Round 4 and 7, all the weights remain relatively small, with Resource production being slightly more important than others. Before and after such quiescence period, CA and Science are highly valuable. This is the natural rhythm of TtA. Before Round 4, we are working to setup Age I technologies. CA and Science is the key to get better techs and discovering them earlier. We then need resources to reach the full potential of those techs. A few rounds later, Age II technology comes and we want Science and CA again.  The resource lead during the quiescence period, can come from either Iron upgrade or a 4th Bronze.

Lesson Eight: Culture production and MA stays underwhelming until Round 10.  In fact, to see their full potential, we have to look at the entire game duration.
Here, we divided every game into Round 0 plus 10 portions (so 1 portion is about 2 actual rounds). We can see that the importance of MA shoots up in the last 1/3 of the game. The importance of Culture increases steadily and also becomes quite relevant in the last 1/3. They both drop back at the last 1/10 of the game shows that this is a timing issue. Your opponents will catch up eventually. But you can win by setting up an earlier aggression or massive culture production.

Lesson Nine: CA becomes important again in the very end. You need them to grab essential technologies and Wonders. They also increase you ability to complete Wonders.

It is a bit tempting to interpret the negative CA weight at 4/10 of the game as "the prime time for revolution", since in my way of counting, the revolution player would have taken just 1 CA that round. I myself do not trust that interpretation too much, since the same feature did not show up in a round-by-round analysis. While looking at the entire duration of the game and there are much larger weights in late game, one needs to be careful about interpreting the meaning of other weights.

In the next Section, I will use the same technique to analysis individual cards.

A Data-Driven Strategy Guide for Through the Ages, Part 3

Index:

1. Introduction (Link to Part 1)
2. Data Analysis
    2.1 Classification: Infrastructure Development (Link to Part 2)
    2.2 Classification: Cards Played (Current Article)
    2.3 Separating players with TrueSkill (Link to Part 6)
3. Analysis for Boardgamers
    3.1 Infrastructure Development (Link to Part 4) 
    3.2 Cards Played (Link to Part 5)
    3.3 Mistakes made by Good Players (Link to Part 7)

2. Data Analysis

2.2 Classification based on Key Cards played

Let us first recall this Figure from the previous section.
This shows how well we can predict the final outcome based on "infrastructure development" up to a certain round.  It grows monotonically as it should.  However, it grows at obviously different rates.  With the error-bars, we are certain that the slope between round 4 to 8 is almost half of the slopes before and after.  This implies that the status changes during these few rounds are less relevant to the final result.

Based on my experience with this game, there is a likely reason.  In order to develop any aspect of infrastructure, one must play certain cards.  Increasingly better cards become available in different stages of the game.  Thus, the biggest difference occur when players setup those cards.  That happens roughly at round 4 for State I cards, and round 8 for Stage II cards.  That is why those rounds have higher impact on the final result. Thus here, we will try to learn strategic lessons from the usage of those cards.

There are 88 such cards in the game.  We again go through the scraped data to get a 88-dimensional vector per-player-per-game.  Although we also recorded when and how are these cards played, we will not use those information yet. All we care now is whether a card is played, thus the value of each component of this 88-dimensional "Key Cards" vector is either 0 (not played) or 1 (played).

We repeat exactly the same procedure as in the previous section. An SVM with a linear kernel can classify "good" or "bad" performance based on the "Key Cards" vector.  We get a performance of 70% and a weight for each card.
Please do not squint at the above Figure.  This is not very useful due to the same reason that we do not want to use the last few rounds in the previous section. In the above Figure, a lot of cards with heavy weights are "big-late" cards.  They are played at the end of the game and costs a lot of resources and actions. By that time, if a player has those resources and actions to spare, most of the time that's already a "good" performance anyway. Therefore, they are just cashing in their lead.  These are good "indicators" to show that a player is doing well, but they are not the strategic reason why such player is good.

A more careful analysis is required to coax causation from the observed correlations.  For example, we should focus on cards which are played earlier in the game. Ideally, We should also compare cards with comparable opportunity costs.  Again, GO is a biased example that every single move has the same opportunity cost---another move.  In TtA, playing a card usually involves a combination of many types of resources, thus no card has exactly the same opportunity cost as another. Thus, we need to be more clever about "asking the right questions" here, and may need to cross-reference to some other statistics.

This does not mean that once we have selected the appropriate subset of cards to compare, we can then refer to that subset in the above Figure.  A classifier is trained to classify.  In some sense, it will use the best clue first, and then conditioned on that, it will consider less significant clues. (In a decision tree, this would have been exactly true. For a linear SVM, we can draw a 2-D example to convince ourselves that a similar effect is still present.) In this particular example, our classifier is effectively asking whether a player has played those "big-late" cards, and use that as the main guidance for classification.  It will then consider earlier cards with less weights to fine-tune the prediction.  This is similar to a conditional probability in the reverse-time order (condition on the result to compare the causes, also known as a post-selection effect). This is opposite to the usual strategic thinking, and will often lead to strange results at face value.

Let me give a concrete example here.  All those "big-late" cards cost a lot of resources.  However, all cards that help to produce more resources have small, or even negative weights in the Figure. This sounds weird, but it makes perfect sense for the classifier.  It first gets a correlation between winning and big-late cards. Then, if a player won without playing those big-late cards, she better not have invested too much in resource production.  This criterion will help the classifier to recognize those rarer winning situations. We cannot learn from the negative weights here to conclude that those resource-producing cards are bad in general.

In order to get a meaningful result for a subset of cards, we should train our classifier only on that subset.  Here is an example for all cards available early in the game.
First of all, the number on the top is the classifier performance. At 59%, it is comparable to using the data up to round 4 in the previous section.  Since "playing cards" is closer to individual moves than "improving infrastructure", that is quite a good news.  It also teaches a clear lesson.  One particular type of cards stands out in the above Figure. All the Leader cards have significantly higher weights than other types. This is consistent with advices from experienced players, and also consistent with the fact that leaders have strictly smaller opportunity cost than other cards, yet they often provide more benefits.

We will repeat this process for various other choices of subsets.  If the performance is higher than 54%, we will analyze the results in Section 3.2We will need to select those subsets carefully, and often need to supplement the result with other statistics to get meaningful lessons.

Sometimes, the true power of a card does not manifest alone.  Seasoned gamers usually expect combos--2 or more cards that combine to have dramatically better effects.  There is a particularly simple way to detect the existence of combos. We can train an SVM with polynomial kernel of degree X. If the performance turns out to be better than the linear kernel, that implies the existence of combos involving X or less cards.

Unfortunately, and a bit surprisingly, such situation has not come up yet during our analysis. Actually, the validation performance for nonlinear kernels are usually worse than the linear kernel. This is true even if we train on a subset of cards which seasoned players consider to have good combos. This is probably because that the chance for a player to get both cards in a combo is very small. A card has a 10%-30% chance to be played by someone. Thus a particular 2-card combo only shows up 4% of the time. Within 10k samples, there might be too much "noise" among those cases.

Note that this is not about over-fitting though. In fact, we have only talked about validation performances so far, but in all examples they are actually close to the in-sample error. Even with only 10k games, the VC dimensions of our models have always been small enough to avoid over-fitting. The "noise" here actually stops the classifier from recognizing any pattern associated with combos even in-sample. It is not very clear whether increasing the data set size will improve that.

A Data-Driven Strategy Guide for Through the Ages, Part 2

Index:

1. Introduction (Link to Part 1)
2. Data Analysis
    2.1 Classification: Infrastructure Development (Current Article)
    2.2 Classification: Cards Played (Link to Part 3)
    2.3 Separating Players with TrueSkill (Link to Part 6)
3. Analysis for Boardgamers
    3.1 Infrastructure Development (Link to Part 4)
    3.2 Cards Played (Link to Part 5)
    3.3 Mistakes made by Good Players (Link to Part 7)

2. Data Analysis

2.1 Classification based on Infrastructure Development

The first idea is a simple 2-type classification---"good" or "bad".  We can then learn from the good behaviors and avoid the bad ones.  We will say a "good" outcome is >90% of the winner's score, and a "bad" outcome is below that. This threshold is motivated by statistics.
If we remove one winner from every game, the scores of remaining players follow the above distribution.  We can see that the median of non-winners is about 80%, which should not be defined as "good".  Choosing 90% means that about 53% the results are "good" (the actual winner included), and 47% are "bad".  This is a pretty comfortable ratio for a classifier without special tuning.  Also note that the game allows resigning.  That represents the small bump around 0 in the above Figure.  Because they did not finish the game, their data are incomplete.  All resigned players are removed from our consideration.  From 10k+ games, we get 30k+ results to classify.

Next, we look at the development of infrastructure.
This examples shows the amount of foods generated each round by each player in a 4-player game (note that grey resigned at round 15).  One can find the same information for 5 other aspects in the infrastructure.  This game lasted 17 rounds, which means that the data dimension is 17*6 = 102.  We will not directly use the value in the above figure.  Since the final result is a relative quantity with respect to other players, it only makes sense for the input to be relative quantities.  We will normalize this vector by the mean of the game, indicating whether a player is doing relatively better or worse than other players at the same game and the same round.

We are not ready to throw this into a Machine Learning Classifier yet.  The number of rounds actually varies from game to game, but a typical classifier wants all data points to have the same dimensionality. 
We have two ways to circumvent this problem.

We first consider the entire game duration, but rescale that into 11 portions independent of how many rounds there are. This gives us a 66-dimensional "infrastructure development" vector per game per player, and we can classify the final result accordingly. We use the support vector machine classifier (svm.svc) from Sci-kit Learn.  It is trained on a random subset of N points, and validate it on a disjoint subset of the same size. For 3k<N<10k, the validation performance stays around 73%, and the linear kernel performs equally well with nonlinear ones. (Exception: The sigmoid kernel performs no better than random guesses. I have not figured out why.)

73% may not sound impressive, but it is already useful for our purpose. Unlike GO, TtA is not fully deterministic. There are hidden and random elements. Actually, our choice of 6 aspects does not even cover all deterministic information. The null model that always predicts "good" would have had a performance of 53%, with a standard deviation less than 1% at N=3k. Thus, the classifier is performing quite well and already learned some strategic lessons.  We train with the linear kernel 10 times and take the average of its coefficients.
These coefficients tell us "being better than your opponent at what aspect, during which time of the game", is more likely to help you win the game. We will look closer at these results and analyze them in the actual game context in Section 3.1.

We can see that in the above Figure, the coefficients tend to be larger in later portions of the game. That is expect, but also a bit problematic for our purpose. TtA simulates an economical development.  Small investments early in the game can snowball into huge benefits later.  In the last few rounds, players are typically cashing in those benefits.  Monitoring those "cashing in" moves is the most accurate way to predict the outcome. However, that is not exactly what we want to learn here.  We want to know the subtle effects of early investments. 

It is not safe to just look at the earlier portions in the above Figure. When we play this game, we make early decisions without knowing the later developments. The above classifier is already contaminated by information from the future. This will create a post-selection effect such that the coefficients on earlier portions can be misleading. In the next Section, we will provide an obvious example. For now, if we want to learn things from earlier portions, it is the best to ask our classifier to ignore future information.

This brings us to our second method.  We will only use the information from the first X rounds, with X up to 11. Despite the variability of the actual duration of each game, the first 11 rounds are almost always early-mid stages of the game.  We will take these (6*X) dimensional vectors and feed into the classifier.  After the same training and validation process, we get their performance as a function of X.
We can see that after the 1st round, the performance is better than blind guesses, and monotonically increasing.  With all 11 rounds of data, the classifier is correct 66% of the time, not too far from the 73% while using the full duration.  This implies that early developments already have small but measurable effects on the result.
For example, the above Figure shows the coefficients from round 0 to round 4. One can see clear difference between different aspects.  Military actions, foods and culture are unimportant, or even bad.  Civil actions, science and resources are generally better. We will look closer at these results and analyze them in the actual game context in Section 3.1.

The development of infrastructure does provide strategic lessons.  It tells us when and what aspects are more important to victory. However, such lessons might be a bit vague.  In other words, the Intermediate Status chosen here might be a bit far away from individual moves.  In the next Section, I will consider a different choice.

A Data-Driven Strategy Guide for Through the Ages, Part 1

Index:

1. Introduction (Current Article)
2. Data Analysis
    2.1 Classification: Infrastructure Development (Link to Part 2)
    2.2 Classification: Cards Played (Link to Part 3)
    2.3 Separating players with TrueSkill (Link to Part 6)
3. Analysis for Boardgamers
    3.1 Infrastructure Development (Link to Part 4)
    3.2 Cards Played (Link to Part 5)
    3.3 Mistakes made by Good Players (Link to Part 7)

1. Introduction

1.1 About this game:

Through the Ages (TtA) is a very popular board game first published in 2006.  Its most current revision (2015) is ranked at top 3 in the world.  For people who are familiar with boardgames, this page should tell you everything you want to know about it.

For everyone else, here is a small diagram that introduces the concept of general Euro/Economy/Strategy games, and some specific features of this game.
Basically, every player will have access to some resources.  Throughout the game, they decide how to invest those resources.  One option is direct conversion into points, since the player with the most points wins the game in the end.  On the other hand, it is often wiser to invest resources on various infrastructure.  These are the things that can continuously help you generate points and resources.  Choosing when and what to invest on is the key to improve your efficiency, and the key to victory.

TtA is a Civilization Simulation game.  You manage resources like foods, ores and knowledges; you invest them to develop technologies that improves farms, mines, and various other aspects of your country. Then finally, the country with the most aspiring cultural legacy is the winner.

Typically, a player has to make more than hundreds of decisions during a game.  The consequence of one decision often remains unclear until ten (or more) decisions later.  Thus, the strategy manifests mostly as human intuition and high-level (vague) reasoning.  This is exactly the type of problem that modern data science might be useful.

1.2 A Data-Driven Strategy?

After the tremendous success of AlphaGo, the world knows that AI can play deep strategy games.  It turns out that how an AI plays a game is very similar to how a person does.  We both follow the middle flow chart in the following diagram.

For example, in TtA, you can make a single move to build a farm.  When you do that, it usually comes with a train of reasons. "This will produce foods for me, which enables me to increase population in a future move. Then I can use that population as miners/soldiers/... etc."  Such reasoning probably stops here, because you don't exactly know how that extra miner/soldier helps you win the game.  Therefore, you cannot exactly calculate the actual effect of this farm, nor its difference to other choices you have.  Experience and intuition takes over here, which gives you a rough feeling of how "good" a farm is, and allows you go move on and evaluate your other options.

The place where explicit derivation stops and intuition takes over is marked as Intermediate Status in the above diagram.  For human beings, the derivation from individual moves to the intermediate status is usually called "tactics".  Analysis from the intermediate status to the final result is often called "strategy".  An AI basically uses two algorithms to perform these two functions.  For example, a Monte-Carlo tree search can go through individual moves and see their outcomes; a Neural Network can learn from millions of examples of the intermediate status and tell you which ones are closer to victories.

I am not taking the right path all the way as a human, nor am I taking the left path all the way as an AI.  I wish to take a diagonal path goes from top-right to bottom left.  Therefore, a Data-Driven Strategy Guide should tell us what intermediate status are more likely to win, and a human player will be in charge of finding the best tactics to achieve such intermediate status.

There is a very practical reason why I am doing this. In order for an algorithm to derive from individual moves, one must hard-code all the rules. GO is a somewhat biased example that the rules are extremely simple.  The rules of GO can be written in 10 sentences, while the rules for games like TtA are often booklets of 10+ pages.  The "rules" of a real-life problem may not even be fully captured by any finite number of words.  In these kind of situations, human minds are still superior in creativity and thinking outside the box.  Thus in general, I find it more natural for human to come up with "what can be done", and then consult an AI to understand the final consequence of such option.  In other words, AI can give you answers, but we are in charge of asking the right questions.

1.3 Data Source

Boardgaming-online has a very nice implementation of Through the Ages (TtA).  It also keeps journals of all past games.  A journal is a pretty detailed record of a game. Although it is insufficient to reconstruct the actual entire game, it should contain some information to offer a glimpse into the strategy.

Using Python packages Requests and BeautifulSoup, I scraped the content of game journals as Pandas DataFrames. There are more than 100k stored games, and I have managed to scrape 10k+ games at this moment.  Hopefully these will be enough to provide some insights.

A complete journal has about 5-10 pages, which represents 2-4 data points (depending on the number of players).  The first challenge is to use my knowledge of the game and the interface to parse the journals, in order obtain simple information that can be fed into machine learning algorithms.  This is a tedious process involving not only standard selections in pandas frame, but also quite a few customized parsing routines. I will not bother the readers with details here.  Let us jump ahead to some "clean" data I extracted from the journals.

1.4 Outline:

In Section 2, we will explore a few different choices of Intermediate Status, and see how Machine Learning can estimate the final result from them. Naively speaking, we want the Intermediate Status to be close enough to the final result, such that the Strategic Evaluation is accurate. On the other hand, we want it to be close enough to individual moves, such that Tactical Derivation is not too difficult.  GO is again a biased example that there is a clear choice for Intermediate Status: a 19x19 matrix with 3 possible values at each entry. In TtA, given the form of data we have, even choosing the form of the Intermediate Status is a main challenge.

I will first setup a classification problem and explain the process of how to obtain the training/validation set from the raw data. I will also explain the how the Machine Learning algorithms can help us formulate strategy. I will apply this classification problem in Section 2.1 and 2.2. These is the main section that Data Scientists might be interested in.


In Section 3, we will analyze the results of Section 2 in the actual context of the game.  This is the main section that boardgamers might be interested in.



First Major Update (02/12/2018)

(Link to Section 2.3)    (Link to Section 3.3)

In Section 2.3
, I will introduce another algorithm--TrueSkill. TtA is not an entirely deterministic game. TrueSkill takes that into account and allows us to ask more relevant questions. In addition to classification based on individual game results, we can instead classify behavior based on players' TrueSkill.

In Section 3.3, we look into all cards during Age A and Age I. By cross-referencing the outcomes of individual games with players's skill, we discover a few interesting mistakes made by stronger players.