Speed up like patch guidelines are not optimal and could be improved #2593

OuaisBla · 2020-03-22T17:52:56Z

Recently, many decent patch passed STC with honors but didn't make it to LTC with [0.25, 1.75] bounds.

Like some stated earlier, speed up like patch doesn't scale well for LTC test. Speed up enable SF to get to higher depth significantly more on low TC. On higher TC, only logics improvement can really make a difference.

Here are the actual guidelines:

Quoting from https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#non-functional-changes :

We will verify the speed up, for 3 reasons:

The speedup needs to be statistically significant and not just random noise.
The speedup needs to be confirmed on different machines. Sometimes a speedup on one machine is a slowdown on another.
The speedup needs to be put in perspective with the code changes. If the code changes are invasive and/or ugly, only to achieve a small speedup, we will probably not accept the patch. This is subject to our appreciation.
If the above steps are not enough to clarify the opportunity to commit the patch, then the patch will go under normal STC+LTC fishtest tests. The rationale is that a speed-up is totally comparable to a normal patch: it adds complexity with the aim to improve ELO, so it makes sense to test under the same conditions.

So, for me, the guidelines regarding Speed up like patch are not optimal and should be improved. Otherwise, this type of patch won't make it anytime soon and SF will continue to lack of efficiency at low TC which is a very valid use case. Also, maintainer are doing a good job preventing regression in many area, but the lack of appropriate and optimal guidelines for thoses patch leads to unconstructive discussions based on a flawed, not optimal, criteria.

The proposition of some to have:

STC test with LTC bounds [0.25,175] for speed up patch

perfectly make sense and is a step in the right direction for those type of tests. We can also refine the definition of Speed up test to prevent relaxing the LTC rules for test that are not actual speed up to be something like:

"Code change that doesn't change the logic, hence the bench number of must remains the same" Probably stating "non functional" is also enough to clarify. Clearly, that must not be an actual logic improvement nor a simplication or a mix or all of that.

Sopel97 · 2020-03-23T18:44:44Z

Could the fishtest framework be extended to allow testing for speed? For example by running a bench (possibly on some new, chosen positions). This way we could get fast performance change measurments on multiple platforms. (noob's machines being similar and in high numbers may have to be accounted for.)

This would have 3 effects:

It would encourage more speedup attempts.
It would reduce the amount of work needed per patch.
It would allow including smaller gains that cannot be measured through sprt.

MichaelB7 · 2020-03-23T20:55:45Z

Excellent suggestion !

Vizvezdenec · 2020-03-23T21:03:22Z

I don't like measuring speedups with strict LTC bounds.
Speedups in general suck on LTC. I propose to do normal double SPRT tests for them but with STC on both bounds.

vondele · 2020-03-23T22:18:50Z

So, we have 3 recent speedup patches that I'd like to refer to in this issue (see references above). I do agree that we need some more clarity, so let's consider some of the arguments:

speedup patches typically add complexity... like many other patches. So, we should use the standard rules, as used for other Elo gaining patches.
speedup patches are generally safe... they do not change search / eval. So we should use relaxed acceptance rules (to be specified below).

The first argument leads to an easy rule to follow. Passing our STC + LTC tests is not easy, and doing so solidly shows that there is a benefit. This makes discussion on the merit of the patch a rare thing (only for very invasive patches). It is hard to achieve a speedup that manages this... but this is similar for many of the eval/search patches as well.

The second argument can be understood as well. A speedup will be good at any TC, and as such is risk-free, contrary to e.g. an eval/search patch. However, the benefit is likely to become smaller with increasing TC, as Elo grows more slowly with increased speed at higher TC. Also, while the risk wrt. playing strength is small, there is a risk concerning the maintenance of the code, the evolution of the code, or the chance of introducing subtle bugs.

So, the latter argument raises two questions:

how to establish a speedup? Measuring the time of a bench is an obvious approach. It turns out that measuring with the accuracy we're talking about (0.5 - 2.0%) is actually a hard thing. Requires careful testing, including many runs, dedicated hardware, etc, and quite often depends on the testing environment used (compiler version, OS, hardware, ...). I've personally done it wrong many times.. We could develop a better infrastructure for this (e.g. add to fishtest), but that actually requires quite some human time. Therefore, just using our established infrastructure, i.e. fishtest playing games, is an easy solution. How do we test reliably with games? Passing a single SPRT test is not reliable enough IMO. Right now, we have ~20% chance to pass a single {-0.5, 1.5} STC test with a 0 Elo patch. A double SPRT test reduces that significantly if we use the stricter {0.25, 1.75} bounds.
what's the right balance added speedup vs added complexity? Once we have reliably measured speedup, the question is, what % speedup is worth a given complexity? We could compare to our SPRT bounds for guidance... Speedups that are simplifications could be taken with roughly 0% speedup. Speedups that would add significant complexity would need 1 Elo at LTC. I don't have precise number on how this Elo vs speedup translates currently, but will try to measure it (using TC odds), it will be around 1-2% speedup ~ 1 Elo at LTC, I assume. At lower TC, 1-2% speedup will be worth more than 1Elo. We could argue that really easy speedups could pass with lower requirements, but we're not doing that for eval/search patches either.

Having written all this, I feel that our current guidelines do make sense... they give some flexibility to wave 'trivial' speedup patches through (keeping in mind the caveats of speedup measurements), while they apply standard rules in cases the patches are not considered trivial.

OuaisBla · 2020-03-24T07:09:13Z

Here's a nice reference to a previous discussion regarding speed up measurement. Nothing conclusive with many question left opened though but instructive. Its a reccuring topic of course.

#1422

Please note that Marco Costalba itself claimed test succeed with [-4,0] bounds for a .32 ELO gain. Mine was .72 ELO gain with [.25,.1.75] and was labelled as fail. Again, not arguing against the people judgement here nor complaining. But I want to highlight the fact that unappropriate guidelines lead to subjective appreciation that could go either way. Hence I reiterate the importance to try to clear that issue by improving the guidelines to the benefit of a better, stronger engine.

OuaisBla · 2020-03-24T07:35:21Z

Fishcooking is full of usefull info.

https://groups.google.com/forum/m/?fromgroups#!searchin/fishcooking/Speed$20up/fishcooking/gdfixOXME78

Specially in this one where it is stated that LTC are not required for speed up.

Why did that changes?

OuaisBla · 2020-03-24T07:43:40Z

One last:

Reliable speed comparison: some math required

https://groups.google.com/forum/?nomobile=true#!searchin/fishcooking/speed$20up$20marco%7Csort:date/fishcooking/XYsHQvLF6xk/H0clbsMNBgAJ

OuaisBla · 2020-03-24T07:49:44Z

Summarizing the actual discussion including previous one found on fishcooking, for speed up like patch, we have:

STC [-0.5, 1.50] Could lead to false positive up to 20% of the time
LTC[0.25,1.75] Never make it green

Previous guinelines clearly stated: no LTC required.

This lead to me to update the actual guidelines to:

STC [-0.5, 1.50], then STC [0.25, 1.75]
No LTC.

And we are good to go. Fair enough?

MJZ1977 · 2020-03-24T14:16:42Z

At least to have an information on LTC ELO gain and verify it is still OK, I will suggest
STC [-0.5, 1.5]
LTC [-1.5, 0.5] (like non-regression)

Here 2 tests of the same speed up patch with 2 different LTC bounds:
https://tests.stockfishchess.org/tests/view/5e73924fe42a5c3b3ca2e79a
Elo +0.62 [-1.48,2.64] (95%)
https://tests.stockfishchess.org/tests/view/5e74e4e6e42a5c3b3ca2e819
Elo +0.78 [-0.25,1.91] (95%)

The Elo assessment is still good with [-1.5, 0.5]

OuaisBla · 2020-03-24T14:24:12Z

I agree. The ELO assessment should give a clue on how the speed up scales.

This could be added as well. I still beleive that testing STC with LTC bounds could prevent unwanted false positive though.

MJZ1977 · 2020-03-24T15:26:53Z

STC [0.25, 1.75] is not easy.

Perhaps STC [0, 1.5] is balanced :
https://tests.stockfishchess.org/html/SPRTcalculator.html?elo-0=0&elo-1=1.5&draw-ratio=0.61&rms-bias=0

OuaisBla · 2020-03-24T15:47:53Z

Seems reasonable to me. +1

I'd like to add that having specifics test bounds for speed up patch clearly indicated that thoses are in a class of there own, not to be mixed with other type of patches. A plus.

vondele · 2020-03-24T16:12:24Z

I'm still in the process of getting some data to map speedups to Elo... I think this is a first step before we rush to establishing things. No need to rush.

OuaisBla · 2020-03-24T16:48:27Z

Agreed, no need to rush. Always best to analyse first. Just curious, what are you using as data set and methodology to map that? Can I help in any way?

vondele · 2020-03-24T17:02:59Z

I'll just be using games with time odds. So, tc=11.600+0.116 vs tc=10.000+0.100 is equivalent to a 16% speedup. That's right now the only data point I have, this is 32.4 +- 3.1 Elo.

vondele · 2020-03-24T17:04:14Z

I'll be doing that for 16%, 8% and 4% speedup both at STC and LTC (20k games at STC, maybe a bit less at LTC). This should allow to get some interpolation near ~1% speedup / 0.5 - 1 Elo.

OuaisBla · 2020-03-24T17:19:08Z

Thanks a lot.

joergoster · 2020-03-24T17:55:35Z

Just a thought: how about playing 200,000 games at STC? Or maybe even 400,000 games?
Drawing any serious elo assessment from SPRT tests is somewhat questionable.

vondele · 2020-03-25T07:27:07Z

so, results of the measurements in graphical form:

for the small speedups (<~5%) the linear estimate can be used that gives Elo gain as a function of speedup percentage (x) as:

Elo_stc(x) = 2.10 x
Elo_ltc(x) = 1.43 x

Thus, it seems that, to have 50% passing chance at STC{-0.5,1.5}, we need a 0.24% speedup, while at LTC{0.25,1.75} we need 0.70% speedup. A 1% speedup has nearly 85% passing chance at LTC.

and raw data:

tc 10+0.1:
16   32.42  3.06
 8   13.67  3.05
 4    8.99  3.04
 2    3.52  3.05

tc 60+0.6:
16   20.85  2.59
 8   12.20  2.57
 4    4.67  2.57

note that numbers will vary with hardware.

NKONSTANTAKIS · 2020-03-27T04:57:37Z

Time odds and speedup is not the same. Speedup only gains time when the engine is running, while extra time applies a benefit in addition to every possible non operating state due to internal and external delays (gui, OS, latencies etc).

Im sure that a fast test of speedup vs time odds will show it immediately. Logic and experience also cant believe an 1% speedup to be worth 1.3 LTC elo.

MJZ1977 · 2020-03-27T16:12:57Z

Thanks @vondele for these interessant results. But how can we conclude now ?

vondele · 2020-03-27T18:23:31Z

I propose we essentially stick to the current guidelines (which I'll try to formulate a little more cleanly in the wiki). Very simple speedups (obviously correct, no additional book-keeping, concise, etc) can be proposed on the basis of well done speedup measurements that are verified independently if they are in the range ~0.50%. Larger patches (and the three PRs linked to in this issue fall in this category), require standard testing (and thus speedups >0.7%). As with all patches, even passing that is no guarantee the PR will be merged, but it's a solid argument. Where to draw the line between these cases remains ultimately a decision of the maintainer.

OuaisBla · 2020-03-28T02:35:49Z

@vondele This is good work. The LTC curve seems a bit high to me, but without more data, I can't argue. The patch I submitted for instance shows around 1% speed up on my machine (Windows with BMI2 support compile with GCC 9.3.0 on Cygwin). It doesn't make it to fishtest though even with a STC showing up to 6 ELO gain. I can't explain this with your number. Maybe we are in the situation of confirmation bias case, but I have nothing more to add. So lets move forward with this. I'm not sure any speed up will be able to make it though from this day forward. Time will tell..

This was referenced Mar 23, 2020

Speedup OppositBishop calculations #2592

Closed

Speedup complexity #2591

Closed

vondele mentioned this issue Mar 23, 2020

Remplace slower function for pseudo_legal evaluation. #2581

Closed

OuaisBla closed this as completed Mar 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up like patch guidelines are not optimal and could be improved #2593

Speed up like patch guidelines are not optimal and could be improved #2593

OuaisBla commented Mar 22, 2020 •

edited

Loading

Sopel97 commented Mar 23, 2020 •

edited

Loading

MichaelB7 commented Mar 23, 2020

Vizvezdenec commented Mar 23, 2020

vondele commented Mar 23, 2020

OuaisBla commented Mar 24, 2020 •

edited

Loading

OuaisBla commented Mar 24, 2020

OuaisBla commented Mar 24, 2020

OuaisBla commented Mar 24, 2020

MJZ1977 commented Mar 24, 2020

OuaisBla commented Mar 24, 2020

MJZ1977 commented Mar 24, 2020 •

edited

Loading

OuaisBla commented Mar 24, 2020 •

edited

Loading

vondele commented Mar 24, 2020

OuaisBla commented Mar 24, 2020

vondele commented Mar 24, 2020

vondele commented Mar 24, 2020

OuaisBla commented Mar 24, 2020

joergoster commented Mar 24, 2020

vondele commented Mar 25, 2020

NKONSTANTAKIS commented Mar 27, 2020

MJZ1977 commented Mar 27, 2020

vondele commented Mar 27, 2020

OuaisBla commented Mar 28, 2020

Speed up like patch guidelines are not optimal and could be improved #2593

Speed up like patch guidelines are not optimal and could be improved #2593

Comments

OuaisBla commented Mar 22, 2020 • edited Loading

Sopel97 commented Mar 23, 2020 • edited Loading

MichaelB7 commented Mar 23, 2020

Vizvezdenec commented Mar 23, 2020

vondele commented Mar 23, 2020

OuaisBla commented Mar 24, 2020 • edited Loading

OuaisBla commented Mar 24, 2020

OuaisBla commented Mar 24, 2020

OuaisBla commented Mar 24, 2020

MJZ1977 commented Mar 24, 2020

OuaisBla commented Mar 24, 2020

MJZ1977 commented Mar 24, 2020 • edited Loading

OuaisBla commented Mar 24, 2020 • edited Loading

vondele commented Mar 24, 2020

OuaisBla commented Mar 24, 2020

vondele commented Mar 24, 2020

vondele commented Mar 24, 2020

OuaisBla commented Mar 24, 2020

joergoster commented Mar 24, 2020

vondele commented Mar 25, 2020

NKONSTANTAKIS commented Mar 27, 2020

MJZ1977 commented Mar 27, 2020

vondele commented Mar 27, 2020

OuaisBla commented Mar 28, 2020

OuaisBla commented Mar 22, 2020 •

edited

Loading

Sopel97 commented Mar 23, 2020 •

edited

Loading

OuaisBla commented Mar 24, 2020 •

edited

Loading

MJZ1977 commented Mar 24, 2020 •

edited

Loading

OuaisBla commented Mar 24, 2020 •

edited

Loading