Small speedup in incremental accumulator updates #5576

MinetaS · 2024-09-07T19:46:48Z

Instead of updating at most two accumulators, update all accumluators during incremental updates. Tests have shown that this change yields a small speedup of at least 0.5%, and up to 1% with shorter TC.

Passed STC:
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 54368 W: 14179 L: 13842 D: 26347
Ptnml(0-2): 173, 6122, 14262, 6449, 178
https://tests.stockfishchess.org/tests/view/66db038a9de3e7f9b33d1ad9

Passed 5+0.05:
LLR: 2.98 (-2.94,2.94) <0.00,2.00>
Total: 55040 W: 14682 L: 14322 D: 26036
Ptnml(0-2): 303, 6364, 13856, 6664, 333
https://tests.stockfishchess.org/tests/view/66dbc325dc53972b68218ba7

Passed non-regression LTC:
LLR: 2.95 (-2.94,2.94) <-1.75,0.25>
Total: 57390 W: 14555 L: 14376 D: 28459
Ptnml(0-2): 37, 5876, 16683, 6069, 30
https://tests.stockfishchess.org/tests/view/66dbc30adc53972b68218ba5

No functional change

Instead of updating at most two accumulators, update all accumluators during incremental updates. Tests have shown that this change yields a small speedup of at least 0.5%, and up to 1% with shorter TC. Passed STC: LLR: 2.93 (-2.94,2.94) <0.00,2.00> Total: 54368 W: 14179 L: 13842 D: 26347 Ptnml(0-2): 173, 6122, 14262, 6449, 178 https://tests.stockfishchess.org/tests/view/66db038a9de3e7f9b33d1ad9 Passed 5+0.05: LLR: 2.98 (-2.94,2.94) <0.00,2.00> Total: 55040 W: 14682 L: 14322 D: 26036 Ptnml(0-2): 303, 6364, 13856, 6664, 333 https://tests.stockfishchess.org/tests/view/66dbc325dc53972b68218ba7 Passed non-regression LTC: LLR: 2.95 (-2.94,2.94) <-1.75,0.25> Total: 57390 W: 14555 L: 14376 D: 28459 Ptnml(0-2): 37, 5876, 16683, 6069, 30 https://tests.stockfishchess.org/tests/view/66dbc30adc53972b68218ba5 No functional change

peregrineshahin · 2024-09-08T02:53:27Z

Wow, congrats on the speedup!
I'm curious to ask a couple of questions...
Could you explain the rationale about how this idea came to mind, or works?
I wonder if this option is a new one...
Does such a change align well with what we think we know best about NNUE, or a surprising one to you?

peregrineshahin · 2024-09-08T03:35:24Z

@peregrineshahin unclear what you're even asking in your last two questions

the questions are general, I'm good with any reply about this patch.

xu-shawn · 2024-09-08T03:58:42Z

@peregrineshahin unclear what you're even asking in your last two questions

They make perfect sense to me

MinetaS · 2024-09-08T05:13:06Z

I was about to sum up the logics behind but I had to sleep first, thanks peregrine for the questions. :)

Could you explain the rationale about how this idea came to mind, or works?

This is the current update strategy:

Let's assume there are not computed accumulators between two accumulators (that we are trying to update), and only State[1] and State[5] are computed. Because SF search is limited-depth DFS, starting from ply = 6 accumulators will be updated in the following order:

2/6, 3/7, 4/8, 5/9, 10, 11

As you see, for a few iterations, SF has to write to two different accumulators that are not close to each other, which is possibly not cache-friendly. Now, this patch updates all accumluators up to pos.state() successively, which could be slower for one single position but benefits the overall search.

Note that this explanation is not what I had in my mind first; I'd just been trying different approaches to find something works.

Does such a change align well with what we think we know best about NNUE, or a surprising one to you?

It depends on the shape of search tree and update/refresh cost model. I think experimenting with update/refresh cost model further is a good direction to improve this patch.

cj5716 · 2024-09-08T07:06:50Z

I've been making some attempts to improve update behaviour for some time, and I think there are some optimisations we can make on this PR. I will try and launch some tests in the coming days. Good job on this PR!

PS. in my head the main saving is the fact that we save multiple adds and subs, i.e. if 1 is computed and current accumulator is 5, at the moment we will update 2 and 5. However, what is to note is that 3 and 4 are computed in the intermediate steps to update 5, ie. this PR saves some extra recomputation.

MinetaS · 2024-09-08T07:51:59Z

PS. in my head the main saving is the fact that we save multiple adds and subs, i.e. if 1 is computed and current accumulator is 5, at the moment we will update 2 and 5. However, what is to note is that 3 and 4 are computed in the intermediate steps to update 5, ie. this PR saves some extra recomputation.

Eventually 3 and 4 are computed in later evaluations so I thought the cost itself is equal. Also if the search tree is not wide (i.e. multiple moveCount=1 nodes and they are all beta cutoff), or the node is leaf, it actually becomes worse than what we do currently.

cj5716 · 2024-09-08T09:46:37Z

Eventually 3 and 4 are computed in later evaluations so I thought the cost itself is equal.

if you think about it, it's more like:
old:
update 2 and 5:
1 -> apply 2 -> store 2 -> apply 3 -> apply 4 -> apply 5 -> store 5
update 3 and 4:
2 -> apply 3 -> store 3 -> apply 4 -> store 4

if you look at it, "apply 3" and "apply 4" are called twice

vs patch:
1 -> apply 2 -> store 2 -> apply 3 -> store 3 -> apply 4 -> store 4 -> apply 5 -> store 5

although, this is assuming the worst case so it makes sense why the extra savings don't gain as much as one might think in the theoretical sense

MinetaS · 2024-09-08T09:57:50Z

if you think about it, it's more like: old: update 2 and 5: 1 -> apply 2 -> store 2 -> apply 3 -> apply 4 -> apply 5 -> store 5 update 3 and 4: 2 -> apply 3 -> store 3 -> apply 4 -> store 4

if you look at it, "apply 3" and "apply 4" are called twice

vs patch: 1 -> apply 2 -> store 2 -> apply 3 -> store 3 -> apply 4 -> store 4 -> apply 5 -> store 5

Yes that's correct as well, although I'm not sure if populating removed/added lists is slow to that extent. Might be worth to check where the gains are from to optimize this even further.

Instead of updating at most two accumulators, update all accumluators during incremental updates. Tests have shown that this change yields a small speedup of at least 0.5%, and up to 1% with shorter TC. Passed STC: LLR: 2.93 (-2.94,2.94) <0.00,2.00> Total: 54368 W: 14179 L: 13842 D: 26347 Ptnml(0-2): 173, 6122, 14262, 6449, 178 https://tests.stockfishchess.org/tests/view/66db038a9de3e7f9b33d1ad9 Passed 5+0.05: LLR: 2.98 (-2.94,2.94) <0.00,2.00> Total: 55040 W: 14682 L: 14322 D: 26036 Ptnml(0-2): 303, 6364, 13856, 6664, 333 https://tests.stockfishchess.org/tests/view/66dbc325dc53972b68218ba7 Passed non-regression LTC: LLR: 2.95 (-2.94,2.94) <-1.75,0.25> Total: 57390 W: 14555 L: 14376 D: 28459 Ptnml(0-2): 37, 5876, 16683, 6069, 30 https://tests.stockfishchess.org/tests/view/66dbc30adc53972b68218ba5 closes official-stockfish/Stockfish#5576 No functional change

vondele added simplification A simplification patch 🚀 gainer Gains elo no-functional-change labels Sep 8, 2024

Disservin added the to be merged Will be merged shortly label Sep 9, 2024

Disservin closed this in 2680c9c Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small speedup in incremental accumulator updates #5576

Small speedup in incremental accumulator updates #5576

MinetaS commented Sep 7, 2024

peregrineshahin commented Sep 8, 2024

peregrineshahin commented Sep 8, 2024 •

edited

Loading

xu-shawn commented Sep 8, 2024

MinetaS commented Sep 8, 2024

cj5716 commented Sep 8, 2024

MinetaS commented Sep 8, 2024

cj5716 commented Sep 8, 2024

MinetaS commented Sep 8, 2024

Small speedup in incremental accumulator updates #5576

Small speedup in incremental accumulator updates #5576

Conversation

MinetaS commented Sep 7, 2024

peregrineshahin commented Sep 8, 2024

peregrineshahin commented Sep 8, 2024 • edited Loading

xu-shawn commented Sep 8, 2024

MinetaS commented Sep 8, 2024

cj5716 commented Sep 8, 2024

MinetaS commented Sep 8, 2024

cj5716 commented Sep 8, 2024

MinetaS commented Sep 8, 2024

peregrineshahin commented Sep 8, 2024 •

edited

Loading