Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RAC] Alerts view in Observability UI performance #110758

Closed
katrin-freihofner opened this issue Sep 1, 2021 · 31 comments
Closed

[RAC] Alerts view in Observability UI performance #110758

katrin-freihofner opened this issue Sep 1, 2021 · 31 comments
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" Team:Threat Hunting Security Solution Threat Hunting Team Theme: rac label obsolete

Comments

@katrin-freihofner
Copy link
Contributor

katrin-freihofner commented Sep 1, 2021

...probably also affects Security

Kibana version:
Internal edge cluster; Chrome Version 92.0.4515.159 on Mac

Description of the problem including expected versus actual behavior:
There is a noticeable lag when interacting with the UI.

This video shows that when scrolling and hovering one of the alerts, the hover state takes ~2 sec to show up.
alerts-small

This video shows a lag when selecting multiple alerts.
multi-select

This has a negative effect on the usability of this view!

Steps to reproduce:

  1. set timeframe to last week

Proposed solution for 7.15
Set rows per page to 10. Remove the selector for rows per page underneath the alerts table.

@katrin-freihofner katrin-freihofner added bug Fixes for quality problems that affect the customer experience Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Team:Threat Hunting Security Solution Threat Hunting Team Theme: rac label obsolete labels Sep 1, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-threat-hunting (Team:Threat Hunting)

@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@weltenwort
Copy link
Member

🤔 As a first guess this likely comes down to the way the table and cells are rendered. We should investigate the causes for superfluous re-renders and reduce them using memoization and similar optimizations.

@weltenwort
Copy link
Member

What browser in which version were you using when making the recordings?

@sorenlouv
Copy link
Member

sorenlouv commented Sep 1, 2021

What browser in which version were you using when making the recordings?

I can also reproduce this behaviour in latest version of Chrome. Ran the react profiler when clicking a single checkbox and got this flamegraph:

image

What stands out to me is the number of empty react commits which can be seen in the top left corner of the screenshot. There are 71 commits in total but most of them (66 or so) are empty. Empty commits could introduce unnecessary delay as seen in the recording.

Just for comparisons sake, this is the flamegraph when clicking a checkbox in the Dashboard table
image

Here there are only 3 commits.

@weltenwort
Copy link
Member

@sqren thanks for reproducing. We'll probably have to dig into how the t-grid performs state management internally.

@katrin-freihofner
Copy link
Contributor Author

I tested on Chrome, I added the version to the issue description.

@snide
Copy link
Contributor

snide commented Sep 1, 2021

Just popping in to note that @XavierM and @andrew-goldstein have been meeting with @chandlerprall since last week on this issue. I'm not certain about timing, but it's even more prevalent in the Security grid because of the larger amount of cells and actions they render. My understanding is that there's optimizations to be made around tabable and the actions hover state (also sounded like there was some bad state being passed around).

Beyond optimizations around the components, the general advice is:

  1. Virtualize the grid (which would require setting a height similar to Discover / Lens usage).
  2. Limit the amount of rows rendered
  3. Limit the amount of columns rendered
  4. Limit the amount of content loaded in cells
    • removing actions
    • making tooltips title tags
    • removing heavy interactions that require a lot of dom
    • removing as much wrapping divs/spans from inner content within cells

The only quick "fix" I've seen with the implementation as it stands is to limit pagination to 10 rows.

@miltonhultgren
Copy link
Contributor

miltonhultgren commented Sep 1, 2021

(Likely over-communicating, sorry if so)

Beyond fixing the bad data re-renders and enabling virtualization again, keeping the amount of rows and columns down was really key at my last job (we built a table like this for other teams). To do that we needed:

  1. Data priority (allows smaller row counts by sorting on "do this first" use cases on page load, like high impact alerts, always nice to guide user action)
  2. Search (cross column search to quickly find what is most relevant for the current user's goal, makes it less "painful" to have less rows per page)
  3. Column filters (maybe not all columns are relevant at start, having easy was to toggle columns on/off is helpful)

@snide's comments feel all too familiar!

@weltenwort
Copy link
Member

@snide thanks for the cross-team information. I assume any optimizations made by the security team in the shared underlying t-grid should benefit us too.

@andrew-goldstein
Copy link
Contributor

My understanding is that there's optimizations to be made around tabable and the actions hover state (also sounded like there was some bad state being passed around).

The EUI team is tracking the tabable and hover action state bottlenecks in datagrid identified by @chandlerprall and @snide last week in the following issues:

@chandlerprall joined the Threat Hunting team meeting yesterday, and we agreed to the following plan:

  • The EUI team assigned a priority !urgent! to the issues above, and if they can be patched for 7.15, we will ship with the current pagination options
  • If fixes for the issues above are unavailable before the final BC, we will limit the page size as suggested by @snide

@katrin-freihofner
Copy link
Contributor Author

The proposed solution, for now, is to set rows per page to 10 and remove the selector for rows per page underneath the alerts table.

@miltonhultgren
Copy link
Contributor

Started looking at this. If I understand the current code I'll need to find a way to pass in the itemsPerPage setting instead of the itemsPerPageOptions setting. Will come back to this tomorrow!

@miltonhultgren miltonhultgren self-assigned this Sep 2, 2021
@andrew-goldstein
Copy link
Contributor

@miltonhultgren, FYI @chandlerprall just opened the following PR:

[EuiDataGrid] resolve performance issues when interacting with a large grid #5136 to resolve the issues above:

The description of the PR has some much-happier looking perf charts:

perf

and (also in the PR description), some promising results when tested in the Security Solution, (which as noted above has both more columns and more cell actions):

Result

Moving from 1-1.5s (2-5s in siem's table) stop-the-world when leaving a cell to ~40ms.

@snide
Copy link
Contributor

snide commented Sep 2, 2021

Just a reminder for those following those EUI PRs. The following will need to happen:

  1. Both PRs will need to merge in EUI (one is already in)
  2. We need to cut a backport release of EUI for kibana that only includes these fixes
  3. We will make a PR in kibana to upgrade EUI to that new version
  4. That PR will need to be tested and merged for Kibana
  5. That merge itself will need to be backported to 7.15

Upgrading EUI in Kibana is never straightforward and we all know how fast Kibana CI runs. Data Grid is used in more places than just RAC so we'll need to test in Kibana. We also have a holiday this weekend.

Keep that context in mind.

@chandlerprall
Copy link
Contributor

  1. We will make a PR in kibana to upgrade EUI to that new version

We can skip that step, getting EUI's backport release straight into the 7.15 branch. There are jest snapshot changes to expect, but otherwise these two should drop in cleanly.

Assuming the PR (elastic/eui#5136) merges quickly, I'll cut the backport release tomorrow & immediately begin upgrading Kibana's 7.15 branch.

@MikePaquette
Copy link

Wow, sounds like some amazing progress!

If those results hold true, we should be safe in going back to the preferred values for rows per page and keeping the selector, correct?

@chandlerprall
Copy link
Contributor

We'll definitely want to test the end result in a complete setup - Dave's suggestions around enabling virtualization may still be needed. The changes so far resolve what has been observed as the bottlenecks, but it's possible we've only dug a hole large enough to hit another layer. However, based on @kqualters-elastic's initial explorations I don't expect another set of immediate concerns.

@miltonhultgren
Copy link
Contributor

I opened #111100, how do we want to handle this?
We can merge my PR and if the EUI fixes land in time we can revert the limiting after we've tested.

miltonhultgren added a commit to miltonhultgren/kibana that referenced this issue Sep 3, 2021
@chandlerprall
Copy link
Contributor

EUI upgrade PR is now up at #111199 , letting CI flag affected snapshots then I will update those & take out of draft.

@chandlerprall
Copy link
Contributor

eui upgrade in 7.15 has merged

@miltonhultgren
Copy link
Contributor

@katrin-freihofner Can this issue be closed after the EUI upgrades?

@katrin-freihofner
Copy link
Contributor Author

@miltonhultgren is edge showing the latest version?

We still see performance issues there:
performance

@miltonhultgren
Copy link
Contributor

miltonhultgren commented Oct 6, 2021

(I'm guessing edge is the name of a cluster?)
I don't know, maybe @chandlerprall knows which version of Kibana/EUI is deployed there?

EDIT: @paulb-elastic sent me some links to the cluster, I'll check tomorrow on the performance as well as which EUI version is running and get back on this!

@miltonhultgren
Copy link
Contributor

miltonhultgren commented Oct 12, 2021

@katrin-freihofner I just checked the edge cluster and I see the same performance issues, selecting a row takes more than a second to update the UI and the cell actions have a delay before they are displayed. This isn't visible when only showing 10 rows.

As far as I can tell, edge is built from master which has version 38.0.1 of EUI which should include the fixes by @chandlerprall

@elastic/eui-design can we get some help in investigating this further?

It could be something we're doing wrong on our side that isn't visible in the Security side.
@elastic/security-threat-hunting Is your performance still degraded?

@kqualters-elastic
Copy link
Contributor

is that https://edge-oblt.elastic.dev/ ? I don't think has been updated in a bit, looks like a ton of calls are still being made to the problematic isUntouchable function, which should not be happening if it's updated.

@miltonhultgren
Copy link
Contributor

@kqualters-elastic That's the correct URL. I'll ask around to see if/when it was updated and get back

@miltonhultgren
Copy link
Contributor

miltonhultgren commented Oct 13, 2021

Based on https://edge-oblt.elastic.dev/api/status, the build_hash points to 9217be3287d9a4efa362900f8adcdd7aa7260459 which has this version of EUI:

"@elastic/eui": "37.6.0",

Which should include the fixes from this PR #111199

Let me try to verify also from my local master copy with the edge cluster data (CCS for the win)

@kqualters-elastic
Copy link
Contributor

@miltonhultgren
Copy link
Contributor

miltonhultgren commented Oct 13, 2021

I see, that would explain it then. I was confused by the linked PR since it mentions version 37.1.3.

I've also verified that locally from master the performance for cell action is much improved.
I've asked if the edge cluster can be updated from master since it is a bit outdated. (Update: It has now been updated to something much more recent and I can verify that the edge cluster now behaves like my local Kibana)
So this particular performance issue is resolved. @katrin-freihofner

I still feel like there is some delays in responsiveness when selecting rows:
With 50 rows:
50 rows
(Note how the checkbox gets the lightblue outline on click but there is a delay before it gets colored in)

With 10 rows:
10 rows

When changing the page from 50 to 10 rows:
Change page size
(I'm scrolling down, changing the page size, then the rows directly get cut down to 10 while the data is still loading and then it updates the checkbox selection states)

But these can likely be covered in another issue.

@jasonrhodes jasonrhodes added Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" and removed Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Nov 1, 2021
@jasonrhodes
Copy link
Member

So this particular performance issue is resolved. @katrin-freihofner

I'm going to close this issue since it looks like the main problem here is resolved. @miltonhultgren if you can create a separate issue for what you describe in your last comment and assign it to Actionable Observability, that'd be great. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" Team:Threat Hunting Security Solution Threat Hunting Team Theme: rac label obsolete
Projects
None yet
Development

No branches or pull requests