Designing the opening book for TCEC is an important and difficult task, especially in the later stages. The opening book has conflicting goals. On the one hand the draw rate should not be too high in order to make the games exciting to watch. On the other hand openings should not be too biased as games are played twice with engines swapping colors, if each engine wins once the result gives no indication which engine is stronger. So the openings chosen should be biased in just the right amount, where a stronger engine has a better chance of finding a win when attacking, and finding a draw when defending.
Cato has been responsible for the openings for the last 5 seasons, and lately Jeroen has assisted with the superfinal openings. Cato imposes other restrictions on the openings using his extensive game database, such as minimal number of games played from the position, more than one reasonable continuation, low draw rate, etc. It is reasonable to say that the 'TCEC opening space' is a 'small' subset of the possible starting positions, but how small is it? Has TCEC already covered a substantial part of it and we are likely to see many repeats in the next seasons?
After analyzing the data, my short answers for these questions are 'not so small', and 'no'.
For the analysis I used openings from the following stages:
- Season 6 stage 3 - 112 games
- Season 6 stage 4 - 96 games
- Season 6 superfinal - 64 games
- Season 7 stage 3 - 112 games
- Season 7 stage4 - 72 games
- Season 7 superfinal - 64 games
- Season 8 stage 3 - 90 games
- Season 8 superfinal - 100 games
- Season 9 stage 3 - 224 games
- Season 9 superfinal - 100 games
- Season 10 stage2 - 112 games
- Season 10 superfinal - 100 games
In the analysis I use the term entropy of a distribution, though the number I calculate is exp(entropy) as it is defined normally. Without going into detail I use this as the effective size of a set of elements with a finite probability distribution. A set of size n with a uniform distribution has an entropy of n, if most of the distribution is concentrated on one element the entropy is close to 1.
When considering ECO codes and opening names you should always keep in mind that the engines playing can influence these, especially when the book sequence is relatively short. For Cato's favorite length of 16 plys the opening is almost always determined by the book. So, mostly Cato's fault as it should be.
ECO, first letter
Entropy = 4.92
Close to a uniform distribution, some preference to Sicilians and flank openings, closed games a little under-represented.
ECO, full code
There are 500 different ECO codes. The entropy of the ECO codes in the list is 246.9, covering 312 different codes out of the possible 500. The most frequent ECO codes are
- B90 Sicilian Najdorf, 20 times
- B12 Caro-Kann defense, 14 times
- B80 Sicilian Scheveningen, 14 times
- C02 French advance, 14 times
- C45 Scotch, 14 times
In the under-represented openings we find:
- Gruenfeld defense (D80-D99), only 7/20 codes represented.
- King's Gambit (C30-C39), 4/10 represented.
- Ruy Lopez (C60-C99), 19/40 represented.
Opening, full name
Analyzing the full opening names including opening variants, we find there is a much more uniform distribution. The entropy is 573.3, with 618 distinct openings. The most repeated opening variants are:
- Dutch: 2.c4 Nf6 3.g3 e6 5.Nf3 d5 6.O-O Bd6, 8 times
- Trompowsky: 2...Ne4 3.Bf4 c5 4.f3 Qa5+ 5.c3 Nf6 6.d5, 8 times
- Neo-Grֳuenfeld: Alekhine's, 7.Be3 O-O, 6 times
- Pirc: 4.Be3, 150 Attack, 6 times
- Sicilian: Kan, Polugaevsky, 6.Nb3 Ba7, 6 times
- Sicilian: Najdorf, 6.f3, 6 times
If we look at the full book sequences, there are 618 distinct sequences out of 623 game pairs, with entropy 616.3. Only 5 book sequences are repeated in two game pairs.
I truncated the book sequences at fixed lengths to measure the expansion as the length increases. For short lengths I also list the most frequent sequences.
After 2 plys the entropy is only 12.74, a total of 34 sequences for 622 game pairs. The leading sequences are:
- 1. d4 Nf6, 24.4%
- 1. e4 c5, 18.2%
- 1. e4 e5, 10.3%
- 1. d4 d5, 10.1%
- 1. e4 e6 , 6.9%
- 1. d4 Nf6 2. c4 g6, 8.2%
- 1. e4 c5 2. Nf3 d6, 8.0%
- 1. e4 e5 2. Nf3 Nc6, 7.6%
- 1. d4 Nf6 2. c4 e6 , 7.1%
- 1. e4 e6 2. d4 d5, 6.1%
- 1. e4 c5 2. Nf3 d6 3. d4 cxd4, 6.8%
- 1. d4 Nf6 2. c4 g6 3. Nc3 Bg7, 4.8%
- 1. d4 Nf6 2. c4 e6 3. Nc3 Bb4, 3.4%
- 1. e4 e5 2. Nf3 Nc6 3. Bb5 a6, 3.4%
- 1. e4 c5 2. Nf3 e6 3. d4 cxd4, 3.2%
- 1. e4 c5 2. Nf3 Nc6 3. d4 cxd4, 3.2%
After 10 plys the entropy is 328.0, a total of 432 sequences for 587 game pairs.
After 12 plys the entropy is 442.5, a total of 487 sequences for 575 game pairs.
After 16 plys the entropy is 523.4, a total of 530 sequences for 543 game pairs.