Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HOLD for payment 2024-05-06] [$500] [Wave Collect] [Ideal Nav] Chat List and Workspace List sometimes loads slowly & App sometimes hangs #35704

Closed
1 of 6 tasks
hayata-suenaga opened this issue Feb 2, 2024 · 77 comments
Assignees
Labels
Awaiting Payment Auto-added when associated PR is deployed to production Bug Something is broken. Auto assigns a BugZero manager. Engineering Internal Requires API changes or must be handled by Expensify staff Weekly KSv2

Comments

@hayata-suenaga
Copy link
Contributor

hayata-suenaga commented Feb 2, 2024

Action Performed:

Reported by David here, here, and here

This issue has two aspects:

The Chat List and Workspace List loads sometimes loads too slow

  1. Make sure you have an account with a lot of chats and workspaces
  2. Click the Chats button on the bottom tab
  3. Confirm that the list of chats load very slow
  4. Click the 🔧 Settings button in Expensify/All workspace > Workspaces
  5. Confirm that the list of workspaces load very slow

Another performance issue was reported by David
-> internal Slack thread where David posted the issue

  1. Make sure your account has several workspaces with a lot of chats/reports
  2. Open the Chat Switcher (Search Page) by pressing CMD + K
  3. Switch to a different chat/person (possibly one in a different workspace
  4. Observe that the LHN that displays the list of chats loads very slowly

The App sometimes hangs

This issue doesn't have any defined reproduction step. David reported that he sometimes get the browser popup error saying that the page is unresponsive. Please see the screenshot below. This issue was initially reported in this GH issue.

Screenshot 2024-02-05 at 3 00 37 PM

Workaround:

N/A

Platforms:

Which of our officially supported platforms is this issue occurring on?

  • Android: Native
  • Android: mWeb Chrome
  • iOS: Native
  • iOS: mWeb Safari
  • MacOS: Chrome / Safari
  • MacOS: Desktop

Screenshots/Videos

Screenshot -> https://expensify.slack.com/archives/C036QM0SLJK/p1706901348693069?thread_ts=1706887969.154719&cid=C036QM0SLJK

View all open jobs on GitHub

Upwork Automation - Do Not Edit
  • Upwork Job URL: https://www.upwork.com/jobs/~01e122c7e0ae0d76b5
  • Upwork Job ID: 1753562635025690624
  • Last Price Increase: 2024-04-23
  • Automatic offers:
    • cubuspl42 | Reviewer | 28142256
@hayata-suenaga hayata-suenaga added Daily KSv2 Bug Something is broken. Auto assigns a BugZero manager. labels Feb 2, 2024
@hayata-suenaga hayata-suenaga self-assigned this Feb 2, 2024
Copy link

melvin-bot bot commented Feb 2, 2024

Triggered auto assignment to @lschurr (Bug), see https://stackoverflow.com/c/expensify/questions/14418 for more details.

@lschurr
Copy link
Contributor

lschurr commented Feb 2, 2024

Hey @hayata-suenaga - should this be External for contributors to work on or are we keeping this Internal?

@hayata-suenaga hayata-suenaga added the External Added to denote the issue can be worked on by a contributor label Feb 2, 2024
@melvin-bot melvin-bot bot changed the title [Wave 8] [Ideal Nav] Loading is too slow for Chat List and Workspace List [$500] [Wave 8] [Ideal Nav] Loading is too slow for Chat List and Workspace List Feb 2, 2024
Copy link

melvin-bot bot commented Feb 2, 2024

Job added to Upwork: https://www.upwork.com/jobs/~01e122c7e0ae0d76b5

@melvin-bot melvin-bot bot added the Help Wanted Apply this label when an issue is open to proposals by contributors label Feb 2, 2024
Copy link

melvin-bot bot commented Feb 2, 2024

Triggered auto assignment to Contributor-plus team member for initial proposal review - @cubuspl42 (External)

@hayata-suenaga
Copy link
Contributor Author

Hey @hayata-suenaga - should this be External for contributors to work on or are we keeping this Internal?

I'll keep this external for now. Thank you for reminding me of it 😄

@melvin-bot melvin-bot bot added the Overdue label Feb 5, 2024
@hurali97
Copy link
Contributor

hurali97 commented Feb 5, 2024

@mountiny can you assign this to me ? 👋

@melvin-bot melvin-bot bot removed the Help Wanted Apply this label when an issue is open to proposals by contributors label Feb 5, 2024
Copy link

melvin-bot bot commented Feb 5, 2024

📣 @cubuspl42 🎉 An offer has been automatically sent to your Upwork account for the Reviewer role 🎉 Thanks for contributing to the Expensify app!

Offer link
Upwork job

@melvin-bot melvin-bot bot removed the Overdue label Feb 5, 2024
@hayata-suenaga hayata-suenaga changed the title [$500] [Wave 8] [Ideal Nav] Loading is too slow for Chat List and Workspace List [$500] [Wave 8] [Ideal Nav] Chat List and Workspace List sometimes loads slowly & App sometimes hangs Feb 5, 2024
@hayata-suenaga
Copy link
Contributor Author

hayata-suenaga commented Feb 5, 2024

@hurali97 I changed the issue description of the original post of this issue to include another related issue reported about the App becoming unresponsive. Please check the issue description again and let me know if you can handle the added issue. 🙇

@hurali97
Copy link
Contributor

hurali97 commented Feb 6, 2024

Hey @hayata-suenaga !

I took a look at the issue description and it's clear 👍 I tried to reproduce and locate the root cause but this one's tricky to reproduce. However, I found that switching between chat list and workspace is sometimes really slow. After locating the stack traces for these I found that most of the points are re-usable from the App Startup Audit on-going here. After adding the optimization code from these points, the switching between chat list and workspace is better than before by margins.

However, with the fixes I left the Expensify tab open for a few minutes switched to other tab and when I came back to Expensify and reloaded the page then switched to Workspace list, I got the Unresponsiveness Alert, even with the fixes applied from the above step. It's clear that the fixes doesn't solve the root cause which is apparently related to memory consumption.

I observed the memory usage whilst switching between chat list and workspaces a bunch of time the memory consumption gets to more than 2000 MB❗Upon analysing the memory allocation trace most of the allocation is consumed by getOrderedReportIDs.

I will analyse this further tomorrow and keep y'all posted 👍

@hayata-suenaga
Copy link
Contributor Author

Upon analysing the memory allocation trace most of the allocation is consumed by getOrderedReportIDs.

I don't fully remember what the getOrderedReportIDs function does. But I remember that the app identifies which workspace items should include RBR/GBR when the Workspace Switcher displays the list of workspaces (screenshot below). It does so by reviewing all reports for each workspace to determine if any reports or chats require the user's attention, signified by RBR/GBR. This process might consume a significant amount of memory.

Screenshot 2024-02-06 at 11 24 48 AM

@hurali97
Copy link
Contributor

hurali97 commented Feb 7, 2024

@hayata-suenaga The getOrderedReportIDs is the one which prepares the data for LHN. And as you explained about GBR there's some checks for GBR as well in the getOrderedReportIDs. Below is a piece of code from that function:

...
  reportsToDisplay.forEach((report) => {
        report.displayName = ReportUtils.getReportName(report);
        report.iouReportAmount = ReportUtils.getMoneyRequestSpendBreakdown(report, allReports).totalDisplaySpend;

        const isPinned = report.isPinned ?? false;
        const reportAction = ReportActionsUtils.getReportAction(report.parentReportID ?? '', report.parentReportActionID ?? '');

        // Checking for Require Attention
        if (isPinned || ReportUtils.requiresAttentionFromCurrentUser(report, reportAction)) {
            pinnedAndGBRReports.push(report);
        } else if (report.hasDraft) {
            draftReports.push(report);
        } else if (ReportUtils.isArchivedRoom(report)) {
            archivedReports.push(report);
        } else {
            nonArchivedReports.push(report);
        }
    });

    pinnedAndGBRReports.sort((a, b) => (a?.displayName && b?.displayName ? a.displayName.toLowerCase().localeCompare(b.displayName.toLowerCase()) : 0));

...

I will further investigate the cause of high memory consumption and then update with my findings here, thanks for the pointers 👍

@quinthar
Copy link
Contributor

quinthar commented Feb 7, 2024

Happened again; what can I do to help diagnose this?

image

@hurali97
Copy link
Contributor

hurali97 commented Feb 7, 2024

@quinthar I have some break through today as I was able to locate the root cause. I also tested the fix and things are looking way better now. Switching between chat list and workspaces is snappy and I didn't encounter any sort of lag OR unresponsiveness. I switched a lot of times between them.

I will post my findings here tomorrow 🤞

@hurali97
Copy link
Contributor

hurali97 commented Feb 8, 2024

Hey guys 👋

I want to share my findings on this issue with you. This issue was relatively hard to reproduce and even if I reproduced it, it was practically impossible for me to profile the memory leaks and garbage collector not able to free the memory. The reason for that is when we are profiling for memory allocations and the web page is unresponsive, everything stops at the moment, even the profiler. So there's no way forward in that area.

So I tried to profile this in chunks. To explain it, the flow this issue happens is when we are switching between chat list and workspaces. To take a chunk, we can put dummy views in chat list UI and use original view in workspaces. This gives us a chance to look for the memory issues whilst we are switching back and forth.

Following this approach I started the profile process and was able to locate a bunch of very important memory issues. Let's see them in detail:

  • The reports object:
Screenshot 2024-02-08 at 11 32 00 AM

Analysing this image from memory allocation instrument, we see that the report object still associated with detached HTML views and Garbage Collector is not able to retrieve that allocated memory. This happens in a couple of places in our profile. Exploring this further, we can see that it's bound to WorkSpacesListPage.tsx. The report object has around 15k reports in it. Take a look at the Retained Size section which means the amount of memory that will be freed once the object is removed alongside it's dependent objects. The detached HTML views takes about 2 mb of memory each and the list is relatively long.

The question is the web app became kind of usable when we placed dummy UI views in the chat list area, does it mean that the report object in other areas is also causing the issue?

The report object is supplied via withOnyx to the WorkSpacesListPage.tsx. We have report object supplied via withOnyx in 3 other places, namely ReportScreenIDSetter, SidebarLinksData and LHNOptionsList. Which means that we have more of the same problem in memory whenever we switch back and forth, hence our web app becomes slow and ultimately hangs. To explain it further let's see another issue:


  • The internals of withOnyx:

tempState variable:

Screenshot 2024-02-08 at 11 33 00 AM

state variable:

Screenshot 2024-02-08 at 11 33 08 AM

These two variables are part of withOnyx implementation and as far as I could understand, they both hold data for the supplied key. So this means for each key we have two variables holding the same data for some implementation details reasons. We have reports object used via withOnyx in 4 prominent places, which means 8 variables holding the reports object. The point to emphasise here is that the reports object is really huge and allowing it to have multiple instances in the React Tree is not ideal. Doing so slows down our web app when we are switching between chat list and workspaces.

The ideal solution in such cases is to share the single source of truth for such huge data. To explain it, we can leverage React.Context and store the reports object there using Onyx.connect. And later where we are using withOnyx for reports we will refactor that to now use a hook for the context, say useReports(). This fixes the bottleneck we have but it also gives us the advantage of immediately mounting the component instead of having to wait to parse it through withOnyx. This is achieved because React.Context already holds the data for us and whenever we ask, it will give us immediately and no extra variables are holding the data like tempState and state.


  • The FlashList in LHNOptionsList:

The underlying principle of FlashList is Recycling and it doesn't play very well if our renderItems are heavy. When we are switching back and forth with all the UI in place and with above report optimisation. The web app still sometimes hangs and the main reason for that is the usage of FlashList in LHNOptionsList.

I tried to profile the memory allocations by FlashList but since web page keeps getting stuck, I couldn't profile. I also tried with chunk approach by setting dummy UI in workspaces because LHNOptionsList is in chat list area but still the web page was hanging up.

Screenshot 2024-02-08 at 2 02 12 PM

The above image is from the docs of FlashList and as we can see that the list will perform poorly if it's slow to render. If we want to use FlashList, we will need to make the renderItem fast to render but it's hard since it's associated with lots of important data. Otherwise, we can switch to FlatList on web and still keep using FlashList on mobile platforms. Alternatively, we can use FlatList now and in a separate effort we first make renderItems fast to render and then bring back FlashList.

Only case where the web page did not hang up was when I replaced FlashList with FlatList and things were back to normal. I could also profile with original UI and no hang up in that as well.

Since we can't proceed with profiling, what I will deduce from this observation is point 1. Remember the HTML detached elements were not removed by GC and reports was still part of the memory? Since our LHN is based on rendering the reports, it seems to link back to the point 1. Also, since support for FlashList is in beta on web, we can think of replacing it with FlatList.

Just a note, if we are to replace FlashList with FlatList and we know that now it works very well, we still should dedicate some effort in making the renderItems light weight.


The above three step makes our web app a lot smoother, let's see that in action below:

memory.mp4

The above fixes the issue and makes our web app snappy but there are some areas that were observed during the profiling and are worth discussing over:

  • The report object being used to access data via Onyx.connect
Screenshot 2024-02-08 at 11 37 33 AM

We have Onyx connections in our util files more than 10 times. Most of them are only used for accessing the report data. In the profiling session, we see that more than 10 times the report object is holding the significant amount of memory. Above screenshot is just one example of it.

The ideal solution we can have is either Onyx.get() which helps us retrieve the updated value wherever we need it. If we can't do it for some reasons, we can create a single file with onyx connection and then read the data from it using a method, say Report.getAllReports().

I profiled the memory after closing all the connections and kept alive only a couple and the memory allocation for objects went down by margins.

  • The JSON.Stringify in getOrderedReportIDs:
Screenshot 2024-02-07 at 3 22 04 PM

We have cache mechanism in place for getOrderedReportIDs and to achieve that we generate a key through the passed arguments to the function. In the screenshot, we can see that the string key takes 25 mb at max and around ~130 mb in total . The reason is that the arguments which are used to generate the stringified key are really huge.

We should also discuss that the cache mechanism is not useful in getOrderedReportIDs. The reason is that due to such a diverse data which is prone to be non-unique because of it having a lots of keys. Which means that even if we place the cache mechanism in place, we have a limit of 5 and once that hits, we pop the oldest cache item to make room for the latest one. However, the cache is not really used because the data is different almost each time and we end up generating the key for it which is not used.

This is not only observed in this flow But in other flows as well. For example, as part of Callstack's audit, we analysed that in the App Startup, Send a message and Report screen load time, cache for getOrderedReportIDs is not used at all. In fact, it adds up to the execution time of the function by ~350 ms.


cc: @hayata-suenaga @quinthar

@hayata-suenaga
Copy link
Contributor Author

@hurali97
Thank you so much for the detailed analysis and the summary of the findings. I was really impressed ❤️

I have some questions and comments on the analysis you did. I gonna share them below.

Analysing this image from memory allocation instrument, we see that the report object still associated with detached HTML views and Garbage Collector is not able to retrieve that allocated memory.

By "detached HTML views", do you mean detached components? What is the reason that the object is still associated with the detached components?

The report object is supplied via withOnyx to the WorkSpacesListPage.tsx. We have report object supplied via withOnyx in 3 other places
we can leverage React.Context and store the reports object there using Onyx.connect.

I don't know why we didn't implement Onyx in a way that the data read from the Onyx storage is stored inside a single React Context. I agree that we can create a new Context for the reports object as a temporary fix. However, I believe the long time solution is to create a singular context for the Onyx data. Having multiple instances of a value under the same Onyx key seems like a waste of memory. We can consider this for future refactoring and add it to our backlog.

Otherwise, we can switch to FlatList on web and still keep using FlashList on mobile platforms.

I understand that FlatList has issues with our particular usage of it as the rendering of a list item is slow in our implementation. However, from the above comment, you seem to assume that the issue caused by the slow renderItem is only present on browsers. Could you explain why that is the case? Is that because FlashList is in beta on web?

if we are to replace FlashList with FlatList and we know that now it works very well, we still should dedicate some effort in making the renderItems light weight.

Why is the rendering of the chat item slow?

The ideal solution we can have is either Onyx.get() which helps us retrieve the updated value wherever we need it. If we can't do it for some reasons, we can create a single file with onyx connection and then read the data from it using a method, say Report.getAllReports()

Wouldn't this issue be resolved if we implement a context for reports as you mentioned earlier?

We have cache mechanism in place for getOrderedReportIDs

You're talking about this part of the code, right? I agree with you. The cache key seems to be using the entire list of chats, which probably is unique every time. Let's remove this cache mechanism. (Even if the cache were useful sometimes, the key should have been a hashed version of the stringified parameters.)

As a general question, why is the chats object (or array?) so huge? This is the object that is used to display the list of chats (i.e. reports) on the LHN, right? Why do we have so much data inside this object when we just need to display a few pieces of information for each chat item on LHN?

@mountiny
Copy link
Contributor

mountiny commented Feb 8, 2024

Really impressive, thank you! What would be the specific next steps which do not requite many Onyx changes in this case.

Regarding this FlashList, we have just migrated over to Flashlist due to its advantages in native side too, I think we want to going back to FlatList. We would like to focus on improving the render time

@hayata-suenaga
Copy link
Contributor Author

The migration to Fabric has finished. Removing the hold

@mallenexpensify mallenexpensify added Internal Requires API changes or must be handled by Expensify staff and removed Internal Requires API changes or must be handled by Expensify staff labels Apr 23, 2024
@melvin-bot melvin-bot bot changed the title [$500] [Wave Collect] [Ideal Nav] Chat List and Workspace List sometimes loads slowly & App sometimes hangs [$250] [Wave Collect] [Ideal Nav] Chat List and Workspace List sometimes loads slowly & App sometimes hangs Apr 23, 2024
Copy link

melvin-bot bot commented Apr 23, 2024

Triggered auto assignment to Contributor Plus for review of internal employee PR - @ntdiary (Internal)

@mallenexpensify
Copy link
Contributor

@ntdiary reassigning, please take over as C+. If you don't have bandwidth, unassign yourself. Thanks

@mallenexpensify mallenexpensify changed the title [$250] [Wave Collect] [Ideal Nav] Chat List and Workspace List sometimes loads slowly & App sometimes hangs [$500] [Wave Collect] [Ideal Nav] Chat List and Workspace List sometimes loads slowly & App sometimes hangs Apr 23, 2024
Copy link

melvin-bot bot commented Apr 23, 2024

Upwork job price has been updated to $500

@ntdiary
Copy link
Contributor

ntdiary commented Apr 24, 2024

@ntdiary reassigning, please take over as C+. If you don't have bandwidth, unassign yourself. Thanks

Unassign myself due to lack of bandwidth. :)

@ntdiary ntdiary removed their assignment Apr 24, 2024
@melvin-bot melvin-bot bot added Weekly KSv2 Awaiting Payment Auto-added when associated PR is deployed to production and removed Weekly KSv2 labels Apr 29, 2024
@melvin-bot melvin-bot bot changed the title [$500] [Wave Collect] [Ideal Nav] Chat List and Workspace List sometimes loads slowly & App sometimes hangs [HOLD for payment 2024-05-06] [$500] [Wave Collect] [Ideal Nav] Chat List and Workspace List sometimes loads slowly & App sometimes hangs Apr 29, 2024
Copy link

melvin-bot bot commented Apr 29, 2024

The solution for this issue has been 🚀 deployed to production 🚀 in version 1.4.67-7 and is now subject to a 7-day regression period 📆. Here is the list of pull requests that resolve this issue:

If no regressions arise, payment will be issued on 2024-05-06. 🎊

For reference, here are some details about the assignees on this issue:

  • @hurali97 does not require payment (Contractor)
  • @situchan requires payment (Needs manual offer from BZ)

Copy link

melvin-bot bot commented Apr 29, 2024

BugZero Checklist: The PR fixing this issue has been merged! The following checklist (instructions) will need to be completed before the issue can be closed:

  • [@hayata-suenaga] The PR that introduced the bug has been identified. Link to the PR:
  • [@hayata-suenaga] The offending PR has been commented on, pointing out the bug it caused and why, so the author and reviewers can learn from the mistake. Link to comment:
  • [@hayata-suenaga] A discussion in #expensify-bugs has been started about whether any other steps should be taken (e.g. updating the PR review checklist) in order to catch this type of bug sooner. Link to discussion:
  • [@hurali97 / @situchan] Determine if we should create a regression test for this bug.
  • [@hurali97 / @situchan] If we decide to create a regression test for the bug, please propose the regression test steps to ensure the same bug will not reach production again.
  • [@lschurr] Link the GH issue for creating/updating the regression test once above steps have been agreed upon:

@hayata-suenaga
Copy link
Contributor Author

hayata-suenaga commented Apr 30, 2024

[@hayata-suenaga] The PR that introduced the bug has been identified. Link to the PR:

The performance regression happens with different changes and due to past architectural decisions. One change that affected this was the introduction of a cache for the sorted report IDs that required the string cache key to be compared on each render. However, because the cache key changes almost all the time, this cache wasn't used at all.

[@hayata-suenaga] The offending PR has been commented on, pointing out the bug it caused and why, so the author and reviewers can learn from the mistake. Link to comment:

No specific PR.

[@hayata-suenaga] A discussion in #expensify-bugs has been started about whether any other steps should be taken (e.g. updating the PR review checklist) in order to catch this type of bug sooner. Link to discussion:

https://expensify.slack.com/archives/C049HHMV9SM/p1714496660088649

@lschurr
Copy link
Contributor

lschurr commented May 6, 2024

Payment summary:

@lschurr lschurr added Daily KSv2 and removed Weekly KSv2 labels May 7, 2024
@lschurr
Copy link
Contributor

lschurr commented May 7, 2024

Bump on that offer @situchan :)

@melvin-bot melvin-bot bot added the Overdue label May 10, 2024
@lschurr
Copy link
Contributor

lschurr commented May 10, 2024

Waiting on @situchan to accept.

@melvin-bot melvin-bot bot removed the Overdue label May 10, 2024
@lschurr lschurr added Weekly KSv2 and removed Daily KSv2 labels May 10, 2024
@lschurr
Copy link
Contributor

lschurr commented May 16, 2024

All paid!

@lschurr lschurr closed this as completed May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting Payment Auto-added when associated PR is deployed to production Bug Something is broken. Auto assigns a BugZero manager. Engineering Internal Requires API changes or must be handled by Expensify staff Weekly KSv2
Projects
Archived in project
Status: Release 1: Ideal Nav & Collect Simplfied Profile, Members, Categories, Workflows (approvals)
Development

No branches or pull requests