## Friday, January 5, 2018

### Opening analysis, all seasons, late stages

Earlier in the season I accidentally came across a game whose opening moves were identical to a game in a previous TCEC season. This led me to think about whether TCEC openings have special properties that increase the probability of such repeats. I have been reporting on the openings as part of my statistics reports for each stage since season 8, but I have never looked at all the seasons and stages together.

Designing the opening book for TCEC is an important and difficult task, especially in the later stages. The opening book has conflicting goals. On the one hand the draw rate should not be too high in order to make the games exciting to watch. On the other hand openings should not be too biased as games are played twice with engines swapping colors, if each engine wins once the result gives no indication which engine is stronger. So the openings chosen should be biased in just the right amount, where a stronger engine has a better chance of finding a win when attacking, and finding a draw when defending.

Cato has been responsible for the openings for the last 5 seasons, and lately Jeroen has assisted with the superfinal openings. Cato imposes other restrictions on the openings using his extensive game database, such as minimal number of games played from the position, more than one reasonable continuation, low draw rate, etc. It is reasonable to say that the 'TCEC opening space' is a 'small' subset of the possible starting positions, but how small is it? Has TCEC already covered a substantial part of it and we are likely to see many repeats in the next seasons?

After analyzing the data, my short answers for these questions are 'not so small', and 'no'.

For the analysis I used openings from the following stages:
• Season 6 stage 3 - 112 games
• Season 6 stage 4 - 96 games
• Season 6 superfinal - 64 games
• Season 7 stage 3 - 112 games
• Season 7 stage4 - 72 games
• Season 7 superfinal - 64 games
• Season 8 stage 3 - 90 games
• Season 8 superfinal - 100 games
• Season 9 stage 3 - 224 games
• Season 9 superfinal - 100 games
• Season 10 stage2 - 112 games
• Season 10 superfinal - 100 games
Total = 1246 games in 623 reverse pairs

In the analysis I use the term entropy of a distribution, though the number I calculate is exp(entropy) as it is defined normally. Without going into detail I use this as the effective size of a set of elements with a finite probability distribution. A set of size n with a uniform distribution has an entropy of n, if most of the distribution is concentrated on one element the entropy is close to 1.

When considering ECO codes and opening names you should always keep in mind that the engines playing can influence these, especially when the book sequence is relatively short. For Cato's favorite length of 16 plys the opening is almost always determined by the book. So, mostly Cato's fault as it should be.

ECO, first letter

Entropy =  4.92
Close to a uniform distribution, some preference to Sicilians and flank openings, closed games a little under-represented.

ECO, full code

There are 500 different ECO codes. The entropy of the ECO codes in the list is 246.9, covering 312 different codes out of the possible 500. The most frequent ECO codes are
• B90 Sicilian Najdorf, 20 times
• B12 Caro-Kann defense, 14 times
• B80 Sicilian Scheveningen, 14 times
• C02 French advance, 14 times
• C45 Scotch, 14 times
ECO codes are usually defined by short opening sequences, much shorter than most TCEC book sequences. The B90 code for example is defined by 10 plys, quite long for an ECO code. In TCEC the 10 game pairs with code B90 had book lengths of 11-21 plys. There were a total of 7 different named variants, one repeated twice and one three times. Only one book sequence was a full book repeat, Season 8 stage 3 games 14 and 29 Gull-Protector, Season 9 stage 3 games 2 and 30 Andscacs-Rybka.

In the under-represented openings we find:
• Gruenfeld defense (D80-D99), only 7/20 codes represented.
• King's Gambit (C30-C39), 4/10 represented.
• Ruy Lopez (C60-C99), 19/40 represented.

Opening, full name

Analyzing the full opening names including opening variants, we find there is a much more uniform distribution. The entropy is 573.3, with 618 distinct openings. The most repeated opening variants are:
• Dutch: 2.c4 Nf6 3.g3 e6 5.Nf3 d5 6.O-O Bd6, 8 times
• Trompowsky: 2...Ne4 3.Bf4 c5 4.f3 Qa5+ 5.c3 Nf6 6.d5, 8 times
• Neo-Grֳuenfeld: Alekhine's, 7.Be3 O-O, 6 times
• Pirc: 4.Be3, 150 Attack, 6 times
• Sicilian: Kan, Polugaevsky, 6.Nb3 Ba7, 6 times
• Sicilian: Najdorf, 6.f3, 6 times

Book sequences

If we look at the full book sequences, there are 618 distinct sequences out of 623 game pairs, with entropy 616.3. Only 5 book sequences are repeated in two game pairs.

I truncated the book sequences at fixed lengths to measure the expansion as the length increases. For short lengths I also list the most frequent sequences.

After 2 plys the entropy is only 12.74, a total of 34 sequences for 622 game pairs. The leading sequences are:
• 1. d4 Nf6, 24.4%
• 1. e4 c5, 18.2%
• 1. e4 e5, 10.3%
• 1. d4 d5, 10.1%
• 1. e4 e6 , 6.9%
After 4 plys the entropy is 50.0, a total of 133 sequences for 622 game pairs. The leading sequences are:
• 1. d4 Nf6 2. c4 g6, 8.2%
• 1. e4 c5 2. Nf3 d6, 8.0%
• 1. e4 e5 2. Nf3 Nc6, 7.6%
• 1. d4 Nf6 2. c4 e6 , 7.1%
• 1. e4 e6 2. d4 d5, 6.1%
After 6 plys the entropy is 132.9, a total of 260 sequences for 620 game pairs. The leading sequences are:
• 1. e4 c5 2. Nf3 d6 3. d4 cxd4, 6.8%
• 1. d4 Nf6 2. c4 g6 3. Nc3 Bg7, 4.8%
• 1. d4 Nf6 2. c4 e6 3. Nc3 Bb4, 3.4%
• 1. e4 e5 2. Nf3 Nc6 3. Bb5 a6, 3.4%
• 1. e4 c5 2. Nf3 e6 3. d4 cxd4, 3.2%
• 1. e4 c5 2. Nf3 Nc6 3. d4 cxd4, 3.2%
After 8 plys the entropy is 230.2, a total of 360 sequences for 604 game pairs.
After 10 plys the entropy is 328.0, a total of 432 sequences for 587 game pairs.
After 12 plys the entropy is 442.5, a total of 487 sequences for 575 game pairs.
After 16 plys the entropy is 523.4, a total of 530 sequences for 543 game pairs.