Precise Engine Testing - A new method of engine testing

Nov 5, 2009 ·

Engine testing in a common way is rather simple, the common method is to set the two candidate engines, load a balance opening db such as nunn test suite or a balanced opening book with varied ECOs and set the match at tournament or blitz time control. This is done by popular engine testing sites such as CEGT or CCRL. The problem is with this of course is their accuracy and the time it takes for the tournament. It takes almost a year to test an engine and still has +/-10 elo error margin.

The new method takes a day or less, and has ~0.00 error margin, yes we will tackle the world of Micro-Elo :-)




As you see in the image, what I have done is testing engines on a fixed incremental depths and with a massive amount of games which is 90000. Now I will explain why this is precise.

If you test engines in a 10 game match and get like for example +3=5-2, this of course is very unprecise. To prove this you play another 5x10 game match and the possibility of getting the same results in all sets is of course low (suppose we are testing engines independent of opening lines). And if we play 500x game sets the possibility of getting closer results is of course higher than the 10x sets. So to solve this we seal the error margin by micro elo accuracy and get a massive number like 90000, getting the same results in 90000x sets is of course high.

The problem now is the huge number of games takes massive amount of time to play in a normal game. We solve this again by having a same fixed depth for both sides, low depths are easy and fast to play you can have 1 game per 20-60 sec per core. Of course a few low depth games are insignificant but the 90000 games compensates for this.

The only problem now is depth biases. The nature of the engine play and the search tree of course changes at higher plies. We solve this again by playing the same game set this time with higher ply and see how the candidate engines behave. Do this incremental-ply game sets until the games takes longer to finish and does not permit your budget time for the test.

After all of that game sets, we take all the result elo and take their average. This is now our final elo and determines the strength gain of engine/version x to y (+7.3 elo in the example above). Of course more games sets is always better. The chances of this final elo the same for higher depths and for normal engine testing games to be similar is also high (of course games has to be also massive or unless you have unique engine terms in higher depths).

This is good if you are testing minimal changes to an engine, you can track the changes with precise accuracy and log it. I currently use this method in tuning an engine.

In the future I will also write about a new method of opening books/sets in engine testing, engine-eval values tuning and probably release the engine that my friends made and the one I am tuning right now. And it is clusterable, hopefully this will be the first cluster chess engine available to public :-)

0 comments:

Post a Comment

Followers

Visitor Count

 

Elite Chess - 1337chess | Copyright © 2009 - 2013