Skip to content

How can I give computer time to the Stockfish community?

Even if you don't know programming yet, you can help the Stockfish developers to improve the chess engine by connecting your computer to the fishnet. Your computer will then play some chess games in the background to help the developers testing ideas and improvements.

Instructions on how to connect your computer on the fishtest network are given there:

  • [[Running the worker]]
  • [[Running the worker on Windows]]
  • [[Running the worker on Linux]]
  • [[Running the worker on macOS]]
  • [[Running the worker in the Amazon AWS EC2 cloud]]

Can I take my computer off at any time without wasting work?

For SPRT tests, which are by far the most common type, the worker will send an update to fishtest every eight games. So on average, you can expect to lose four games when quitting the worker. Four STC games for a 1 core worker represents about 2 minutes of work. Four LTC games (which are less common) represent about 12 minutes of work.

What is a "residual"?

The statistical models that Fishtest uses are based on the assumption that the pentanomial probabilities (a variation on the win, loss, draw probabilities) are the same for each worker. Therefore for each worker, a "residual" is shown on the overview page of every test. It is a measure of how far the worker deviates from the average. Small deviations are normally just due to statistical fluctuations and these will be colored green. However, if the deviation is exceptionally large then the residual will be colored yellow or even red. If this happens on a regular basis for a particular worker then this may be some cause for concern.


The following questions are more technical and aimed at potential Stockfish developers:

Can I program or run any test I want?

You should first check if the test has not been run previously. You can look at the test's history, and follow the corresponding link on the left of Fishtest's main view.

What time-control/method should I use for my test?

Most tests should use the two-stage approach, starting with stage 1, and if that passes, using the reschedule button to create the stage 2 test.

Selecting the type of test according to the stage you are in will configure all the necessary options for you.

Stage 1RescheduleStage 2
imageimageimage

What is SPRT?

SPRT stands for sequential probability ratio test. In SPRT, we have a null hypothesis that the two engines are equal in strength, while an alternative hypothesis is that one of the engines is stronger. With SPRT, we can test the hypothesis with the least expected number of games, that is, we don't attempt to fix the number of games to be played. The parameters of the test control the Type 1 and Type 2 errors. Essentially, we run matches sequentially, for each match we update a value from a likelihood function. The test is terminated when the value is below a lower-bound threshold or above an upper-bound threshold. The threshold is calculated based on the two parameters given to the test (please read the paragraph "Testing methodology" on the page [[Creating my first test]] for details).

What if my test is tuning parameters?

You can use the NumGames stop rule, with 20000 games TC 10+0.1, and schedule a few tests around the direction you want to tune in. If you find a tuning that looks good, you can then schedule a two-stage SPRT test.

How many tries on an idea are too many?

Generally, four or five tries is the limit. It's a good balance between exploring the change and not giving lucky tries too much of a chance to pass.

Can I test a fork of SF?

No. For various reasons, please base your tests on the current SF master.

What is a union patch?

A union is the bundling of patches that failed SPRT but with a positive or near-positive score. Sometimes retesting the union as a whole passes SPRT. Due to the nature of the approach and because each individual patch failed already, a union has some constraints:

  1. Maximum 2 patches per union
  2. Each patch shall be trivial, like a parameter tweak. Patches that add/remove a concept/idea/feature shall pass individually.

How can I test commits N-1, N-2, ... of a branch?

If your branch name is passed_pawn, you can enter passed_pawn^, passed_pawn^^, ... in the branch field of the test submission page at https://tests.stockfishchess.org/tests/run .

The diff of my test seems wrong?

This may happen with complicated git commit histories, most commonly with tests against a base other than master. This is generally caused by the new tag and base tag sharing common code which was introduced in uncommon ancestor commits.

This has to do with the fact that git has two different ways of comparing commit ranges, "double dot" and "triple dot".

See https://stackoverflow.com/questions/462974/what-are-the-differences-between-double-dot-and-triple-dot-in-git-com

  • Three dotted diff aka ancestry diff: diff between one commit and a common ancestor, useful to see the effects of a merge.
  • Two dotted diff aka literal diff: full diff between two specific commits, ignoring ancestry (usually what is intended for a fishtest test).

Github defaults to the former display, whereas the latter is more typically useful for fishtesters.

The conflict can be avoided by maintaining a clean git history. If a test's new tag and base tag share common code, both tags should inherit that common code from a common parent commit. If this holds in your commit history, then you shouldn't have a problem with this "dotted diff" business.

In graphical form, if we want to run a test with the new tag B against base tag A, then this is the correct history format:

Good ✓
master --> A --> B

Both the new tag B and A share the common code master+A, and both tags get that common code from a common commit (A). In this case, the ancestry-diff and literal-diff are identical.

This is a format that can cause problems:

Bad ❌
       ___ A
     / 
master
     \ ___ A' ___ B

In this case, commits A and A' contain identical code, but in separate commits with no common ancestry. If you try to run a fishtest with new tag B and base tag A, they share common code from an uncommon ancestor commit, and when Github defaults to the "triple dot" ancestry-diff format, it will show the diff between the two tags as A+B, even tho the "double dot" literal-diff of merely B is what the tester intends to test. This can be avoided by ensuring the new tag B includes the base tag A in its own history, not in a separate branch.

How to disable NUMA?

Note for patch authors: it is necessary, when testing patches with more than 8 threads, to disable "thread binding" in thread.cpp. Not doing so would have a negative effect on multi NUMA node (more than one physical CPU) Windows contributors machines with more than 8 cores, due to the parallelization of our test scripts for fishtest. This would bias the statistical value of the test.

The lines to comment out in thread.cpp are the following:

C++
if (Options["Threads"] > 8)
    WinProcGroup::bindThisThread(idx);

See for instance https://github.com/WOnder93/Stockfish/commit/97c95b7cf63ff9211544195f7621091ffbcbb459

Why is the regression test bad?

First, note that regression tests are not actually run to detect regressions. SF quality control is very stringent and regressive patches are very unlikely to make it into master. No, they are run to get an idea of SF's progress over time, which is impressive. See

https://github.com/official-stockfish/Stockfish/wiki/Regression-Tests

But still... what if the Elo outcome of a regression test is disappointingly low? Usually, there is little reason to worry.

  • First of all: wait till the test is finished. Drawing conclusions from an unfinished test is statistically meaningless.

  • Look at the error bars. The previous test may have been a lucky run, and the current one is perhaps an unlucky one. Note that the error bar is for the Elo relative to the fixed release (base). Differences between two such Elo estimates have nearly double the statistical error (2-3 Elo).

  • SFdev vs SF11 : NNUE vs classical evaluation is very sensitive to the hardware mix present at the time of testing. If a fleet of AVX512 workers is present/absent, Elo will be larger/smaller.

  • Error bars are designed to be right 95% of the time. So, conversely, 1 in 20 tests will be an outlier.

  • Selection bias is a book-related effect, patches are more likely to be selected if they perform well with the testing book. When they are retested with a different book their Elo score may be adversely affected.

  • Elo estimates of single patches (SPRT runs) typically come with large error bars. Take this into account when adding Elo estimates. Furthermore, Elo's estimates of passing patches are biased. The SPRT Elo estimates are only unbiased if one takes all patches into account, both passed and non-passed ones. As a result, the Elo gain measured by a regression test will typically be less than the sum of the estimated Elo gains of the individual patches since the previous regression test.

How to re-enable Travis-CI results in pull requests?

If the Travis-CI results do not show up in pull requests, the maintainers can try points 1 and 3 suggested by user "javeme" in this post comment (revoking access and authorizing it again): https://travis-ci.community/t/github-pr-is-being-built-but-result-is-not-shown/7025/2 .

How to compare opening books

If a book is new, first make a PR against the Stockfish book repo https://github.com/official-stockfish/books and wait for a maintainer to commit it.

Then use the books to run time odds tests of master vs itself with a fixed number of games and compare the normalized Elo estimates - taking into account the error bars. Don't make the time odds too large since the aim is to approximate standard testing conditions. On the other hand, you also cannot make them too small since in that case, you will need many games to separate the books. I have had good experiences with tests of 60000 games with 30% time odds. Using this procedure it has been shown that unbalanced books are definitely better than balanced books for engine tests.

  • Do not run SPRT tests. They are a waste of resources for this application.

  • Do not run tests of master vs an earlier version. This may give misleading results as it favors the current book. This effect (selection bias) has been shown to exist several times.

  • This procedure can also be used to evaluate other testing changes (e.g. contempt). For changes that affect the amount of resources used (e.g. TC) one should take the resources into account (the amount of resources used by a test is ~ (game duration)/(normalized Elo)^2).