Support ingesting SST files generated by a live DB #12750

cbi42 · 2024-06-10T20:55:47Z

Summary: ... to enable use cases like using RocksDB to merge sort data for ingestion. A new file ingestion option IngestExternalFileOptions::allow_db_generated_files is introduced to allows users to ingest SST files generated by live DBs instead of SstFileWriter. For now this only works if the SST files being ingested have zero as their largest sequence number AND do not overlap with any data in the DB (so we can assign seqno 0 which matches the seqno of all ingested keys).

The feature is marked the option as experimental for now.

Main changes needed to enable this:

ignore CF id mismatch during ingestion
ignore the missing external file version table property

Rest of the change is mostly in new unit tests.

A previous attempt is in #5602.

Test plan:

new unit tests

facebook-github-bot · 2024-06-11T04:15:36Z

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-06-11T16:57:03Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-06-11T17:31:31Z

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

jowlyzhang

Thanks for working on this feature, it's a very useful one. LGTM! Just some minor comments/questions.

jowlyzhang · 2024-06-12T20:51:34Z

include/rocksdb/options.h

+  // - ignore cf_id mismatch between cf_id in the files and the CF they are
+  // being ingested into.
+  // Warning: This ONLY works for SST files where all keys have sequence number
+  // zero and with no duplicated user keys (this should be guaranteed if the


Not something in this PR, but maybe we can add a boolean flag in table property to indicate if it has duplicate user keys. This can help make it easier to do a sanity check at the ingestion time.

Yes, I'm considering adding largest seqno to table property. For no duplicate user key but with non-zero sequence number, I think it should work but largest seqno suffices for now and is easier to reason about.

We may add an API like MoveColumnFamily() that is called on the source CF side so that we can sanity check FileMetaData::fd::largest_seqno.

This MoveColumnFamily API is used to merge (ingest) one column family's data into another column family in the same DB, right? I wonder how snapshots held by other column families could affect the data preprocessing of the ingestion preparation column family. Since it needs to do a full compaction to get rid of duplicate user keys, does that mean all snapshots should be released before this full compaction. And maybe that's requirement is not straightforward to guarantee when there are other column families.

Yeah, snapshot can prevent such keys to be dropped. Maybe the temporary CF should in a temporary DB..

table/block_based/block_based_table_reader.cc

db/external_sst_file_test.cc

cbi42

Thanks for the quick review.

cbi42 · 2024-06-13T22:51:45Z

include/rocksdb/options.h

+  // - ignore cf_id mismatch between cf_id in the files and the CF they are
+  // being ingested into.
+  // Warning: This ONLY works for SST files where all keys have sequence number
+  // zero and with no duplicated user keys (this should be guaranteed if the


Yes, I'm considering adding largest seqno to table property. For no duplicate user key but with non-zero sequence number, I think it should work but largest seqno suffices for now and is easier to reason about.

We may add an API like MoveColumnFamily() that is called on the source CF side so that we can sanity check FileMetaData::fd::largest_seqno.

db/external_sst_file_test.cc

table/block_based/block_based_table_reader.cc

facebook-github-bot · 2024-06-14T00:37:25Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-06-17T17:27:30Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

cbi42 · 2024-06-17T17:30:37Z

Since the feature if not forward-compatible, marked the option as experimental for now in case we want a more explicit opt-in, e.g., introduce MANIFEST versioning and require setting a new MANIFEST version.

facebook-github-bot · 2024-06-17T22:19:07Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-06-17T22:19:58Z

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ajkr · 2024-07-17T19:45:25Z

For now this only works if the SST files being ingested have zero as their largest sequence number
...
add a new field ignore_seqno_in_file to FileMetaData (and persisted in MANIFEST) to let table reader know to use largest sequence number as the global sequence number for reading the file.

Sorry I'm confused. Won't the sequence number be zero regardless of whether we get it from the internal key footer, or from interpreting the largest_seqno as a global sequence number? Is this field meant to prepare for some future scenario where the internal key footer might not have the right sequence number?

cbi42 · 2024-07-18T05:58:10Z

Sorry I'm confused. Won't the sequence number be zero regardless of whether we get it from the internal key footer, or from interpreting the largest_seqno as a global sequence number? Is this field meant to prepare for some future scenario where the internal key footer might not have the right sequence number?

Sorry, by largest largest_seqno, I meant the field FileDescriptor::largest_seqno that is persisted in MANIFEST. This is the global sequence number that's assigned when the file is ingested:

rocksdb/db/external_sst_file_ingestion_job.cc

Line 471 in de6d0e5

f.assigned_seqno, false, f.file_temperature, kInvalidBlobFileNumber,

and can be non-zero.

ajkr · 2024-07-18T20:32:08Z

Sorry I'm confused. Won't the sequence number be zero regardless of whether we get it from the internal key footer, or from interpreting the largest_seqno as a global sequence number? Is this field meant to prepare for some future scenario where the internal key footer might not have the right sequence number?

Sorry, by largest largest_seqno, I meant the field FileDescriptor::largest_seqno that is persisted in MANIFEST. This is the global sequence number that's assigned when the file is ingested:

rocksdb/db/external_sst_file_ingestion_job.cc

Line 471 in de6d0e5

f.assigned_seqno, false, f.file_temperature, kInvalidBlobFileNumber,

and can be non-zero.

Got it, thanks. I thought we used to have some trick, like setting smallest_seqno == largest_seqno == global_seqno would tell RocksDB to apply that seqno to all keys. It seems ok because in the rare case where it happens naturally (e.g., a file with one key), applying largest_seqno as a global seqno is not harmful. Do you think it works here?

ajkr · 2024-07-18T20:54:08Z

I thought we used to have some trick, like setting smallest_seqno == largest_seqno == global_seqno would tell RocksDB to apply that seqno to all keys. It seems ok because in the rare case where it happens naturally (e.g., a file with one key), applying largest_seqno as a global seqno is not harmful.

No, wait, that's worse for downgrade compatibility, as it'd silently not interpret the largest_seqno as a global seqno.

Something better for downgrade compatibility could be restricting this feature to ingesting with seqno zero.

cbi42 · 2024-07-19T00:24:58Z

No, wait, that's worse for downgrade compatibility, as it'd silently not interpret the largest_seqno as a global seqno.

Something better for downgrade compatibility could be restricting this feature to ingesting with seqno zero.

That makes sense. I guess then we won't need the ignore_seqno_in_file flag since all keys already have seqno 0.

jowlyzhang · 2024-07-19T00:45:41Z

That makes sense. I guess then we won't need the ignore_seqno_in_file flag since all keys already have seqno 0.

BTW, Zippy has some plans to use bulk loading to ingest files from another DB into a column family to restore that column family. https://docs.google.com/document/d/1T7RyAD31zaDP-cPXqW7jKpThjyOI8JtbO05CDCNafxM/edit#heading=h.mygkiilv6ocy

One way that I can think of for them to do this is to do multiple bulk loading, one for each of the level of the original column family. For this to work, I think we need to support arbitrary sequence numbers from live DB's files.

ajkr · 2024-07-19T01:16:10Z

That makes sense. I guess then we won't need the ignore_seqno_in_file flag since all keys already have seqno 0.

BTW, Zippy has some plans to use bulk loading to ingest files from another DB into a column family to restore that column family. https://docs.google.com/document/d/1T7RyAD31zaDP-cPXqW7jKpThjyOI8JtbO05CDCNafxM/edit#heading=h.mygkiilv6ocy

One way that I can think of for them to do this is to do multiple bulk loading, one for each of the level of the original column family. For this to work, I think we need to support arbitrary sequence numbers from live DB's files.

Does CreateColumnFamilyWithImport() work for that use case?

jowlyzhang · 2024-07-19T01:56:43Z

Does CreateColumnFamilyWithImport() work for that use case?

This seems like the closest thing that's already available, for their in place back up use case. I think they just need to do something like create a new column family from importing the backed up files to restore and then drop the old column family altogether. I will check it more and discuss with them.

facebook-github-bot · 2024-07-19T04:40:02Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-07-19T04:49:54Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-07-19T06:07:44Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-07-19T06:10:04Z

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-07-19T19:01:50Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-07-19T19:16:27Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-07-19T19:25:01Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-07-19T19:41:44Z

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-07-19T21:37:29Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-07-19T21:38:38Z

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-07-19T23:19:11Z

@cbi42 merged this pull request in 4384dd5.

wwl2755 · 2024-08-13T09:11:12Z

Hi @cbi42,

This feature looks quite interesting and promising in terms of performance optimization. I hope you don't mind if I have a couple of questions:

Is this feature already fully implemented in the pull request or is it just a prototype? It seems to "allow" users to ingest live SST files, but when I attempted to do so, the keys still could not be read in the new instance/column family, because a normal live SST usually has key sequence numbers larger than 0.

Additionally, I tried to implement this on my own but still encountered issues reading the ingested keys, even after unifying the SST file version and key sequence number. Could there be any restrictions on the reader side, such as a difference in session ID, that might be preventing the keys from being read?

Thank you for your time and assistance!

cbi42 · 2024-08-19T21:33:03Z

Hi @wwl2755.

Is this feature already fully implemented in the pull request or is it just a prototype?

This is implemented. It's in "EXPERIMENTAL" state, but should be usable.

when I attempted to do so, the keys still could not be read in the new instance/column family

This is not expected. The unit tests have several example usages. Did the ingestion return OK status?

Could there be any restrictions on the reader side, such as a difference in session ID, that might be preventing the keys from being read?

I don't think there are such restrictions. The main challenge in read path is to use a global sequence number for reading these ingested files, even though keys in such files do not have this sequence number. Requiring all keys to have sequence number = 0 makes the read path (like seeking into data blocks) easier. Also we have this assumption where all keys have unique user_key@seqno. So ingested files should not have duplicated keys since they will be assigned the same sequence number.

facebook-github-bot added the CLA Signed label Jun 10, 2024

cbi42 force-pushed the ingest-live-files branch from c1cfdc7 to 8d99eb8 Compare June 10, 2024 23:55

cbi42 marked this pull request as ready for review June 11, 2024 04:15

cbi42 requested a review from ajkr June 11, 2024 04:15

cbi42 requested a review from jowlyzhang June 11, 2024 19:51

jowlyzhang approved these changes Jun 12, 2024

View reviewed changes

cbi42 commented Jun 14, 2024

View reviewed changes

cbi42 force-pushed the ingest-live-files branch from 223e759 to 630bb1b Compare June 14, 2024 00:37

cbi42 force-pushed the ingest-live-files branch from 630bb1b to 3fe919b Compare June 17, 2024 17:27

cbi42 force-pushed the ingest-live-files branch from 038c3d3 to 47e4519 Compare July 19, 2024 04:39

cbi42 added 12 commits July 19, 2024 12:01

mark as experimental

d364c82

Require assigned seqno to be 0.

020d723

remove flag ignore_seqno_in_file

0132349

Remove stress test.

2b6ea65

Update option comment.

22af78f

remove one more

f8dd92d

Remove not related changes.

fc2827b

fix CI

cfca2f1

return error for move_files

4b73edc

rename option to allow_db_generated_files

624d2ed

scan files when allow_db_generated_files

45171ea

rename tests

5327565

cbi42 force-pushed the ingest-live-files branch from 959429c to 5327565 Compare July 19, 2024 19:01

cbi42 added 2 commits July 19, 2024 12:05

fix

ebde77b

fix corrupt version msg

6cfb5bd

update option comment

283f661

minor fix

5f10c8f

facebook-github-bot closed this in 4384dd5 Jul 19, 2024

facebook-github-bot added the Merged label Jul 19, 2024

cbi42 mentioned this pull request Aug 22, 2024

Support ingesting db generated files using hard link #12959

Open

Support ingesting SST files generated by a live DB #12750

Support ingesting SST files generated by a live DB #12750

Conversation

cbi42 commented Jun 10, 2024 • edited Loading

facebook-github-bot commented Jun 11, 2024

facebook-github-bot commented Jun 11, 2024

facebook-github-bot commented Jun 11, 2024

jowlyzhang left a comment

Choose a reason for hiding this comment

jowlyzhang Jun 12, 2024

Choose a reason for hiding this comment

cbi42 Jun 13, 2024

Choose a reason for hiding this comment

jowlyzhang Jun 14, 2024

Choose a reason for hiding this comment

cbi42 Jun 17, 2024

Choose a reason for hiding this comment

cbi42 left a comment

Choose a reason for hiding this comment

cbi42 Jun 13, 2024

Choose a reason for hiding this comment

facebook-github-bot commented Jun 14, 2024

facebook-github-bot commented Jun 17, 2024

cbi42 commented Jun 17, 2024

facebook-github-bot commented Jun 17, 2024

facebook-github-bot commented Jun 17, 2024

ajkr commented Jul 17, 2024

cbi42 commented Jul 18, 2024

ajkr commented Jul 18, 2024

ajkr commented Jul 18, 2024

cbi42 commented Jul 19, 2024

jowlyzhang commented Jul 19, 2024

ajkr commented Jul 19, 2024

jowlyzhang commented Jul 19, 2024

facebook-github-bot commented Jul 19, 2024

facebook-github-bot commented Jul 19, 2024

facebook-github-bot commented Jul 19, 2024

facebook-github-bot commented Jul 19, 2024

facebook-github-bot commented Jul 19, 2024

facebook-github-bot commented Jul 19, 2024

facebook-github-bot commented Jul 19, 2024

facebook-github-bot commented Jul 19, 2024

facebook-github-bot commented Jul 19, 2024

facebook-github-bot commented Jul 19, 2024

facebook-github-bot commented Jul 19, 2024

wwl2755 commented Aug 13, 2024 • edited Loading

cbi42 commented Aug 19, 2024

cbi42 commented Jun 10, 2024 •

edited

Loading

wwl2755 commented Aug 13, 2024 •

edited

Loading