Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve migration performance #3

Open
svenseeberg opened this issue Dec 22, 2023 · 20 comments
Open

Improve migration performance #3

svenseeberg opened this issue Dec 22, 2023 · 20 comments
Assignees
Labels
enhancement New feature or request

Comments

@svenseeberg
Copy link
Member

Currently, the migration into the Matrix server is a single process because data needs to be imported in a chronological order. We should investigate how performance can be increased with multi-threading or processing.

@HerHde
Copy link
Collaborator

HerHde commented Dec 22, 2023

I think a lot of the things can be parallelised. The creation of users can run in parallel, as well as rooms and non-threaded, non-status events (messages, but events might cause problems).

I haven't used a multi-process library for quite some time, but maybe even a bit more (but still limited) async in single process might change a lot, as I suspect that the HTTP requests to synapse are the bottleneck for the most part.

@chagai95
Copy link

chagai95 commented Mar 5, 2024

Do you have any test data (maybe a dump of a test server that has been used a lot for testing) which I can use to test the performance? Or maybe some tips on how to write a script to generate some test data?

@janonym1
Copy link

janonym1 commented Mar 5, 2024

I am not a programmer, so take my opinion with a grain of salt :)

Maybe I am overlooking something but shouldn't it be possible, to just split up the userdata when exporting the DB in roughly equal parts (~1/n) and then just run the script with the magic of GNU parallels (https://www.gnu.org/software/parallel/)?

@HerHde
Copy link
Collaborator

HerHde commented Mar 15, 2024

Do you have any test data (maybe a dump of a test server that has been used a lot for testing) which I can use to test the performance?

As Rocket.Chat wants to force me to do things I don't want when I set up a test instance, I don't have one and rely on data from a real instance. However, I'm currently working on some test data which wouldn't be enough to test performance, though.

split up the userdata when exporting the DB in roughly equal parts (~1/n) and then just run the script with the magic of GNU parallels

It's not that easy because of some cross-references, unfortunately. So we need to have all users and rooms before creating the messages, so we'd need to wait for each step in each process before proceeding (something like a semaphore).
Messages in threads are skipped if their root isn't mapped, so they need to be handled in some sequence or the data split accordingly.

Finally and most importantly, we don't know yet, where the bottleneck is. It would probably reduce the overall runtime if entities of the same type are handled in a more concurrent manner, but maybe the HTTP requests to Synapse are the slow action, rendering such a change useless.
And we're using a SQLite DB to save the mappings, I don't know how parallel you can read and write in these.

@chagai95
Copy link

chagai95 commented Apr 4, 2024

We started working on a script that generates test data:
https://git.fairkom.net/hosting/chat/rocketchatmatrixmigration/

We don't yet generate a proper random timestamp, but that would be an important next step.

We tried testing this script and tried understanding what might be the bottleneck, but for now we have more questions than answers so we will continue trying to understand this, but we just wanted to share this work in progress script.

@grvn-ht
Copy link

grvn-ht commented Apr 4, 2024

Hi everyone,
As said before, migration's performance is a big issue.
I also think there are 2 bottlenecks one on this tool side and the other on synapse side.
On synapse side I added some workers to help main process treat events:
stream_writers: https://matrix-org.github.io/synapse/latest/workers.html#worker-configuration
On this side I divided messages in batchs to treat them in parallel.
I don't mix messages from one room in different batchs in order to keep their order.
My work around is:
first handle users and rooms
then divide messages in batch and treat them in parallel, i found 2 ways:

  • use npm run compile && concurrently "node dist/app.js <message_file1>" "node dist/app.js <message_file2>" ...
  • run one process for each batch inside different containers, copy and persist db.sqlite to each of them to be able to restart where we failed

I never worked with npm before this project, don't know if it can have an impact on performance.
Could be easier to handle parallelization with a go program.

@chagai95
Copy link

chagai95 commented Apr 4, 2024

Sounds great! Could we help you test this in any way?

@HerHde
Copy link
Collaborator

HerHde commented Apr 4, 2024

Great stuff I'm reading here!

@chagai95 That script really looks helpful to test performance. For edge cases we can use the manually added test data (I mentioned the commit f99771b adding it, somewhere), so you don't need to worry about that in the script.

Have you considered using Faker to generate the random data resembling the purpose, instead of random strings? It provides a lot of providers to generate names, texts, dates and a lot of other stuff. It can even wrap it in JSON conveniently.


@grvn-ht, you said:

first handle users and rooms then divide messages in batch and treat them in parallel

I think that's a sane approach, the number of messages should be magnitudes larger than the rest (my assumption for instances so big, that they need a higher performance).

I could help with some jq magic to split the messages, but grep could do the trick, as well. And I think I need to implement the parts that read configurable files, anyhow.

The question I didn't spend much time on, yet, is, whether there is a convenient and performant way of parallelisation within the app, that would make it unnecessary to adapt the db. The storage adapter/TypeORM allows us to use another db for parallelised access, nonetheless.


Out of curiosity I ask you all naively why you need a significantly lower execution time for such an one-off migration? Can you name any numbers, yet? Are you intending to migrate different chats regularly? What's the story?

@rasos
Copy link

rasos commented Apr 5, 2024

Why do we need some performance? We have to migrate at least two large RC instances with 10k+ users each. We simply need to avoid a downtime of weeks, maybe a maximum of 3 days (weekend in summer) is acceptable to run the migration script. Unless we find some mechansim like with rsync, that allows to re-sync only what has changed. So if that would be possible, we could stop registrating new users in RC, start syncing, and after a week or however how long it takes we fetch again only the newest messages to have them all in matrix.

@grvn-ht
Copy link

grvn-ht commented Apr 5, 2024

Out of curiosity I ask you all naively why you need a significantly lower execution time for such an one-off migration? Can you name any numbers, yet? Are you intending to migrate different chats regularly? What's the story?

you're right, I only need to do it twice, one for a small RC servers: 500 users, 3000 rooms, 120 000 messages
for this one i could use this project as is.
On the other hand the second one is much bigger: 1000 users, 5000 rooms, 4 500 000 messages
tell me if you found other values but after some test i remember being able to process 4 messages per second on one instance. for this server it would leads to 12 migration's days.
With parallelization I could do it in ~ 2days.

like @rasos I would like to be able to run migration script on a week-end to avoid synchronization issues between RC and Matrix. But it's true that I could also do main migration and then migrate onother time with fresh data on same db.sqlite.

But I found interresting to try improve performances ^^

@HerHde
Copy link
Collaborator

HerHde commented Apr 8, 2024

4 messages per second is pretty slow indeed, I understand the need.

My current approach is to let the script run multiple times (on the same DB, with different inputs). This works mostly fine as far as I can tell, but it doesn't detect any changes to already processed (and thus mapped) entities. Thus I'm not entirely happy with this solution, which is more like a crash-resistant design.

@HerHde
Copy link
Collaborator

HerHde commented Apr 17, 2024

I experimented a bit with handling multiple rooms and messages concurrently, but I can't check the results for correctness for a lack of tests. I suspect that it misses most threaded messages.
Anyhow, the times are significantly reduced and vary depending on the number of parallel entities. If anyone wants to have a look:
https://git.verdigado.com/NB-Public/rocketchat2matrix/pulls/101

@HerHde HerHde added the enhancement New feature or request label Jun 17, 2024
@HerHde
Copy link
Collaborator

HerHde commented Jul 16, 2024

Now I've implemented concurrency with the aforementioned PR (or commits b48400a and b48400a) after testing it with end to end tests and fixing some bugs.

I would really appreciate your feedback on it, if you could test it. I didn't see any problems testing with our database, but maybe I missed some points ;-)

@HerHde HerHde self-assigned this Jul 16, 2024
@desto12
Copy link

desto12 commented Jul 18, 2024

Hi, i run updated app, rooms are working well but when it tries to import messages, I got error:
image
the same situation is for limit set to 1, I have very big data set ~ 5GB of messages in json file :)

@HerHde
Copy link
Collaborator

HerHde commented Jul 18, 2024

Oh my, of course there has to be a problem with my simple approach. Thank you for reporting this!

So, I interpret this as the queue getting too long. My first thought would be to read the file and enqueue the handling jobs depending on the queue.

Or maybe another approach which uses more CPU cores 🤷

@flying-scorpio
Copy link

Now I've implemented concurrency with the aforementioned PR (or commits b48400a and b48400a) after testing it with end to end tests and fixing some bugs.

I would really appreciate your feedback on it, if you could test it. I didn't see any problems testing with our database, but maybe I missed some points ;-)

As raised in #9 (comment), concurrency in messages messes with the ordering, as explained in the WARNING of https://spec.matrix.org/v1.11/application-service-api/#timestamp-massaging.
I don't know how it would be possible to implement concurrency of different rooms while removing concurrency for messages within a room/thread.

@HerHde
Copy link
Collaborator

HerHde commented Jul 25, 2024

Problems with the message order are also raised in #22.
Thanks for mentioning the warning.

@HerHde
Copy link
Collaborator

HerHde commented Jul 25, 2024

For now I reverted the concurrency in 698062c to allow a functioning migration.

I don't know how it would be possible to implement concurrency of different rooms while removing concurrency for messages within a room/thread.

I would also create a queue for each room, handling these messages sequentially and handling the rooms in parallel, as you mention. Maybe with a different library like Promise Pool.

@flying-scorpio
Copy link

Problems with the message order are also raised in #22. Thanks for mentioning the warning.

Indeed, I missed that!

I would also create a queue for each room, handling these messages sequentially and handling the rooms in parallel, as you mention. Maybe with a different library like Promise Pool.

I can try to tackle that tomorrow, but I'm no typescript expert!

@flying-scorpio
Copy link

I implemented the sequential messages per room concurrency in #31.
I generated RC data with 100 rooms, each holding 20 messages, to compare with and without concurrency. It got me with 2:46s for a conversion without concurrency and 1:46s with the concurrency.
This obviously needs more testing, but anyways I'm not sure we can improve performance significantly more, as this is mostly useful for when there are many rooms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants