Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(perf): benchmark processMutations/addList worst case performance #1300

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

mdellanoce
Copy link
Contributor

In reference to this slack thread: https://rrweb.slack.com/archives/C01BYDC5C93/p1693333759752269

This is meant as a starting point for discussion, not necessarily merge-ready code. I added a benchmark demonstrating the "bad" behavior, and a naive attempt at a fix.

Before fix:
image

After fix:
image

The issue with the fix as-is, is that it makes performance slightly worse in the average case, so the extra pass through addList to re-order the list nodes probably needs to be done conditionally, or maybe there's some way to build the list more efficiently to avoid this altogether.

So let me know what you'd like to see from here. I could change this PR to only include the benchmark, and open an issue to address fixing the performance in this case, for example.

@changeset-bot
Copy link

changeset-bot bot commented Sep 7, 2023

⚠️ No Changeset found

Latest commit: d5d3ebd

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@wfk007
Copy link
Contributor

wfk007 commented Sep 8, 2023

@mdellanoce Thanks for your excellent work and it worked for me to reproduce the issue. I am wondering if there is a way to optimize without affecting the execution performance of the average case.

@wfk007
Copy link
Contributor

wfk007 commented Sep 9, 2023

@mdellanoce Hi Michael, I opened a new PR(#1302) without affecting the execution performance of the average case. Is it working for you?

@wfk007
Copy link
Contributor

wfk007 commented Sep 9, 2023

The root cause is that if you add a node to a DoubleLinkedList, due to random sort and shuffle, the previous sibling and next
sibling of this node have not been added, and the insertion position of this node in the DoubleLinkedList will be inaccurate.

details

For example:

If we have 6 options in select: 0/1/2/3/4/5
After processMutation, the addedSet will be:

opt0: the first option
tx0: the text node of the op0

opt0
tx0
opt1
tx1
opt2
tx2
opt3
tx3
opt4
tx4
opt5
tx5

After random sort and shuffle, If the options become to: 0/4/1/5/2/3, and we still keep the above addedSet. When adding to DoubleLinkedList, things will be strange.

we loop the addedSet:

  1. after addNode(opt0), the DoubleLinkedList becomes:
opt0
  1. after addNode(tx0)
tx0 <=> opt0
  1. after addNode(opt1)
opt1 <=> tx0 <=> opt0
  1. after addNode(tx1)
tx1 <=> opt1 <=> tx0 <=> opt0
  1. after addNode(opt2)
opt2 <=> tx1 <=> opt1 <=> tx0 <=> opt0

The DoubleLinkedList will be dealing with from tail to head(from right to left)

in fact, the opt1 should be serialized before opt0, and should be sorted to the right of opt0 in the DoubleLinkedList.

when we serialize the nodes in DoubleLinkedList from tail to head:

  1. opt0: nextId(opt4) is not ready, jump
  2. tx0: parentId(opt0) is not ready, jump
  3. opt1: nextId(opt5) is not ready, jump
  4. tx1: parentId(opt1) is not ready, jump
  5. opt2: nextId(opt3) is not ready, jump
    We will waste a lot of performance on useless traversal

solutions

When a round of traversal is completed and the nodes are added to the DoubleLinkedList (although some nodes may not be in the correct position), it is reasonable for defragment to force re-finding of the insertion position. So your answer works.

The optimization I first thought of was

public addNode(n: Node) {
  // ...
  if (n.previousSibling && isNodeInLinkedList(n.previousSibling)){
  } else if (n.nextSibling && isNodeInLinkedList(n.nextSibling) && n.nextSibling.__ln.previous) {
  } else {
+     tempNodes.push(n)
  }
}

+public defragment() {
+   while(this.tempNodes.length) {
+   const top = this.tempNodes.pop();
+      this.removeNode(top);
+      this.addNode(top)
+   }
+ }

+ addList.defragment()

Another way is to ensure that the nodes added in addedSet are ordered:#1302

@YunFeng0817 @Juice10 @eoghanmurray any suggestions or ideas?

@wfk007
Copy link
Contributor

wfk007 commented Sep 9, 2023

Another interesting is that if you update the benchmark test to:

setTimeout(() => {
        // re-shuffle the options
        options.sort(() => Math.random() - 0.5);
        for (var o of options) {
          o.optionElement.parentNode.appendChild(o.optionElement);
        }
}, 100)

Performance issues are gone

@mdellanoce mdellanoce force-pushed the md-addList-worst-case-optimization branch from a1ce992 to e6e87d2 Compare September 11, 2023 14:20
@mdellanoce
Copy link
Contributor Author

@wfk007 seems a little bit slower than rebuilding the list for that specific case:
image

However, yes, I can see that the other cases were less impacted, and overall it is still a massive improvement over the current behavior.

I amended my commit here to only include the benchmark.

Also, I had noticed the setTimeout behavior as well. Separating the addition of the nodes and the reorder that way seems to put them in separate mutations which helps a lot. Unfortunately, the customer I'm working with that experienced this issue is doing all of this without a setTimeout.

@mdellanoce mdellanoce changed the title fix(perf): benchmark and optimize processMutations/addList worst case performance fix(perf): benchmark processMutations/addList worst case performance Sep 11, 2023
@wfk007
Copy link
Contributor

wfk007 commented Sep 11, 2023

@mdellanoce I got 800+ms with defragment in this benchmark test, but why are 173.4ms in yours

+public defragment() {
+    let current = this.head;
+    while (current) {
+      const next = current.next;
+      this.removeNode(current.value);
+      this.addNode(current.value);
+      current = next;
+    }
+  }

let candidate: DoubleLinkedListNode | null = null;
+ addList.defragment();
while (addList.length) {}
image image

@mdellanoce
Copy link
Contributor Author

@wfk007 it looks like the first time i ran the benchmark I had this other fix applied: #1294, which gives me the 170ish result.

If i do the same with your change, i get 160ms, so that's even better... in which case ignore my previous comment!

@mdellanoce mdellanoce force-pushed the md-addList-worst-case-optimization branch from 8823119 to e6e87d2 Compare October 31, 2023 19:23
@mdellanoce mdellanoce force-pushed the md-addList-worst-case-optimization branch from e6e87d2 to 41e146b Compare November 13, 2023 18:57
@mdellanoce mdellanoce marked this pull request as draft November 13, 2023 18:57
@mdellanoce
Copy link
Contributor Author

@wfk007 I converted this PR to "draft" and added 2 commits that I've been working on.

  1. On the recording side, I changed the serialization process for added nodes so that the worst case performance should be 2N, rather than N^2. Best case performance should still be N. My approach here is to serialize every node immediately while processing the addedSet, even if the parent/next relationships cannot be determined at that time. If the parent/next cannot be determined, it is added to the addList queue, which is processed after addedSet. Since all the adds have been serialized at that point, determining parent/next relationships should be quick lookups in the mirror.
  2. The change above required a change on the replay side to handle what could be a drastically different ordering for the adds on each mutation. Again, the worst case performance here should be 2N, while best case is still N. The process is similar to the recording side: attempt to add each node directly to the replay document, but if that isn't possible, build up one or more "resolve trees" which are then traversed again to finalize the replay document. I also changed how the child nodes on the resolve trees are stored to optimize the building of the tree and ultimately the insertBefore call to build the document.

This appears to have sped up the recording benchmarks across the board, though most are only minor improvements with the exception of the one I added in this PR. Of the 3 replay-fast-forward benchmarks, one was unchanged, one got significantly faster, and another got a bit slower (append 70x70x70) due to hitting worst case performance where previously the nodes were in a "good" order because recording ensured it.

I'm interested to hear your thoughts on this, and if it is worth pursuing further.

@mdellanoce mdellanoce force-pushed the md-addList-worst-case-optimization branch from b9e2753 to 63006eb Compare December 12, 2023 16:00
@mdellanoce
Copy link
Contributor Author

Added another benchmark for node insertion order. This one does not currently perform poorly, but it does perform poorly with the fix from #1302 so this is just meant to avoid fixes that just change the worst case to a new scenario.

@wfk007
Copy link
Contributor

wfk007 commented Dec 13, 2023

@mdellanoce I considered this solution when optimizing recording performance in May this year: wfk007@05c7771, and communicated with @YunFeng0817, but gave up because the impact was too huge, so I finally chose the tail pointer solution: #1220.

It will cause the following problems that I can think of:

  1. Data generated will cause a large number of test cases to fail due to sequence reasons.
  2. It has an impact on the use of record and replay versions, that is, you cannot use the old version of replayer and the new version of recording, and vice versa.
  3. May affect downstream data consumption, such as: data analysis of recording results, E2E and etc.
  4. It seems to shift the performance pressure from the recording side to the playback side.

But I still look forward to this plan coming to fruition

@colingm
Copy link
Contributor

colingm commented Dec 13, 2023

@wfk007 can you think of any way that can get us down to linear time processing without also needing a corresponding change in the player? Not having linear time processing on the recording side is a major problem and seems to cause major slowdowns in like 15% of our customers that end up using recording so far so it isn't really just an edge case anymore. It would be awesome if there were a way to do it that didn't require replay changes but maybe that isn't possible? I definitely feel like a bit more performance pressure on the playback side is more reasonable than performance pressure on the recording side where the effect is that the app freezes up for multiple seconds.

@mdellanoce
Copy link
Contributor Author

@wfk007 yes, 2 was my major concern. I agree with @colingm that 4 might be the correct trade-off. I've had the opportunity to see rrweb performing side-by-side with some of the commercial leaders in the recording space, and they're shockingly fast by comparison when it comes to data collection. I'm not sure exactly how they do it, but hard to imagine they're doing more than one or two passes over the node lists to get the performance I'm seeing.

In the meantime, I've gone back to my original solution with the extra pass over addList to ensure the "good" order. Hopefully that'll work as a temporary solution.

@wfk007
Copy link
Contributor

wfk007 commented Dec 13, 2023

@colingm Don't discuss issues with the attitude that if you can do it, do it, and if you can't, shut up. I'm just objectively elaborating the problems this PR may cause. "customers that end up using recording" is NONE OF MY BUSINESS, and it can not be solved by merging this PR ONLY. You don’t know what @mdellanoce and I have discussed and tried before. And from my personal point of view, I also agree to have as little impact on users as possible on the recording side.

@colingm
Copy link
Contributor

colingm commented Dec 13, 2023

@wfk007 wow that was extremely hostile and I feel unwarranted. I am legitimately asking if there is a way to get to linear processing without player changes because obviously it is better if we don't have to do that. (also some context you might be missing is I am on @mdellanoce 's team at work so I am a bit more aware of what has been going on than might be immediately known). I don't know if this solution is the best solution, I am asking for input about if there are other ways to handle this you can think of since you have much more experience in this.

@wfk007
Copy link
Contributor

wfk007 commented Dec 14, 2023

@colingm sorry, I misunderstand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants