Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ingesting SST files generated by a live DB #12750

Closed
wants to merge 21 commits into from

Conversation

cbi42
Copy link
Member

@cbi42 cbi42 commented Jun 10, 2024

Summary: ... to enable use cases like using RocksDB to merge sort data for ingestion. A new file ingestion option IngestExternalFileOptions::allow_db_generated_files is introduced to allows users to ingest SST files generated by live DBs instead of SstFileWriter. For now this only works if the SST files being ingested have zero as their largest sequence number AND do not overlap with any data in the DB (so we can assign seqno 0 which matches the seqno of all ingested keys).

The feature is marked the option as experimental for now.

Main changes needed to enable this:

  • ignore CF id mismatch during ingestion
  • ignore the missing external file version table property

Rest of the change is mostly in new unit tests.

A previous attempt is in #5602.

Test plan:

  • new unit tests

@cbi42 cbi42 marked this pull request as ready for review June 11, 2024 04:15
@cbi42 cbi42 requested a review from ajkr June 11, 2024 04:15
@facebook-github-bot
Copy link
Contributor

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@cbi42 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@cbi42 cbi42 requested a review from jowlyzhang June 11, 2024 19:51
Copy link
Contributor

@jowlyzhang jowlyzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this feature, it's a very useful one. LGTM! Just some minor comments/questions.

// - ignore cf_id mismatch between cf_id in the files and the CF they are
// being ingested into.
// Warning: This ONLY works for SST files where all keys have sequence number
// zero and with no duplicated user keys (this should be guaranteed if the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not something in this PR, but maybe we can add a boolean flag in table property to indicate if it has duplicate user keys. This can help make it easier to do a sanity check at the ingestion time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm considering adding largest seqno to table property. For no duplicate user key but with non-zero sequence number, I think it should work but largest seqno suffices for now and is easier to reason about.

We may add an API like MoveColumnFamily() that is called on the source CF side so that we can sanity check FileMetaData::fd::largest_seqno.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This MoveColumnFamily API is used to merge (ingest) one column family's data into another column family in the same DB, right? I wonder how snapshots held by other column families could affect the data preprocessing of the ingestion preparation column family. Since it needs to do a full compaction to get rid of duplicate user keys, does that mean all snapshots should be released before this full compaction. And maybe that's requirement is not straightforward to guarantee when there are other column families.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, snapshot can prevent such keys to be dropped. Maybe the temporary CF should in a temporary DB..

table/block_based/block_based_table_reader.cc Outdated Show resolved Hide resolved
db/external_sst_file_test.cc Outdated Show resolved Hide resolved
db/external_sst_file_test.cc Outdated Show resolved Hide resolved
db/external_sst_file_test.cc Outdated Show resolved Hide resolved
db/external_sst_file_test.cc Show resolved Hide resolved
Copy link
Member Author

@cbi42 cbi42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick review.

// - ignore cf_id mismatch between cf_id in the files and the CF they are
// being ingested into.
// Warning: This ONLY works for SST files where all keys have sequence number
// zero and with no duplicated user keys (this should be guaranteed if the
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm considering adding largest seqno to table property. For no duplicate user key but with non-zero sequence number, I think it should work but largest seqno suffices for now and is easier to reason about.

We may add an API like MoveColumnFamily() that is called on the source CF side so that we can sanity check FileMetaData::fd::largest_seqno.

db/external_sst_file_test.cc Outdated Show resolved Hide resolved
db/external_sst_file_test.cc Show resolved Hide resolved
db/external_sst_file_test.cc Outdated Show resolved Hide resolved
db/external_sst_file_test.cc Outdated Show resolved Hide resolved
table/block_based/block_based_table_reader.cc Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@cbi42 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@cbi42 has updated the pull request. You must reimport the pull request before landing.

@cbi42
Copy link
Member Author

cbi42 commented Jun 17, 2024

Since the feature if not forward-compatible, marked the option as experimental for now in case we want a more explicit opt-in, e.g., introduce MANIFEST versioning and require setting a new MANIFEST version.

@facebook-github-bot
Copy link
Contributor

@cbi42 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ajkr
Copy link
Contributor

ajkr commented Jul 17, 2024

For now this only works if the SST files being ingested have zero as their largest sequence number
...
add a new field ignore_seqno_in_file to FileMetaData (and persisted in MANIFEST) to let table reader know to use largest sequence number as the global sequence number for reading the file.

Sorry I'm confused. Won't the sequence number be zero regardless of whether we get it from the internal key footer, or from interpreting the largest_seqno as a global sequence number? Is this field meant to prepare for some future scenario where the internal key footer might not have the right sequence number?

@cbi42
Copy link
Member Author

cbi42 commented Jul 18, 2024

Sorry I'm confused. Won't the sequence number be zero regardless of whether we get it from the internal key footer, or from interpreting the largest_seqno as a global sequence number? Is this field meant to prepare for some future scenario where the internal key footer might not have the right sequence number?

Sorry, by largest largest_seqno, I meant the field FileDescriptor::largest_seqno that is persisted in MANIFEST. This is the global sequence number that's assigned when the file is ingested:

f.assigned_seqno, false, f.file_temperature, kInvalidBlobFileNumber,
and can be non-zero.

@ajkr
Copy link
Contributor

ajkr commented Jul 18, 2024

Sorry I'm confused. Won't the sequence number be zero regardless of whether we get it from the internal key footer, or from interpreting the largest_seqno as a global sequence number? Is this field meant to prepare for some future scenario where the internal key footer might not have the right sequence number?

Sorry, by largest largest_seqno, I meant the field FileDescriptor::largest_seqno that is persisted in MANIFEST. This is the global sequence number that's assigned when the file is ingested:

f.assigned_seqno, false, f.file_temperature, kInvalidBlobFileNumber,

and can be non-zero.

Got it, thanks. I thought we used to have some trick, like setting smallest_seqno == largest_seqno == global_seqno would tell RocksDB to apply that seqno to all keys. It seems ok because in the rare case where it happens naturally (e.g., a file with one key), applying largest_seqno as a global seqno is not harmful. Do you think it works here?

@ajkr
Copy link
Contributor

ajkr commented Jul 18, 2024

I thought we used to have some trick, like setting smallest_seqno == largest_seqno == global_seqno would tell RocksDB to apply that seqno to all keys. It seems ok because in the rare case where it happens naturally (e.g., a file with one key), applying largest_seqno as a global seqno is not harmful.

No, wait, that's worse for downgrade compatibility, as it'd silently not interpret the largest_seqno as a global seqno.

Something better for downgrade compatibility could be restricting this feature to ingesting with seqno zero.

@cbi42
Copy link
Member Author

cbi42 commented Jul 19, 2024

No, wait, that's worse for downgrade compatibility, as it'd silently not interpret the largest_seqno as a global seqno.

Something better for downgrade compatibility could be restricting this feature to ingesting with seqno zero.

That makes sense. I guess then we won't need the ignore_seqno_in_file flag since all keys already have seqno 0.

@jowlyzhang
Copy link
Contributor

That makes sense. I guess then we won't need the ignore_seqno_in_file flag since all keys already have seqno 0.

BTW, Zippy has some plans to use bulk loading to ingest files from another DB into a column family to restore that column family. https://docs.google.com/document/d/1T7RyAD31zaDP-cPXqW7jKpThjyOI8JtbO05CDCNafxM/edit#heading=h.mygkiilv6ocy

One way that I can think of for them to do this is to do multiple bulk loading, one for each of the level of the original column family. For this to work, I think we need to support arbitrary sequence numbers from live DB's files.

@ajkr
Copy link
Contributor

ajkr commented Jul 19, 2024

That makes sense. I guess then we won't need the ignore_seqno_in_file flag since all keys already have seqno 0.

BTW, Zippy has some plans to use bulk loading to ingest files from another DB into a column family to restore that column family. https://docs.google.com/document/d/1T7RyAD31zaDP-cPXqW7jKpThjyOI8JtbO05CDCNafxM/edit#heading=h.mygkiilv6ocy

One way that I can think of for them to do this is to do multiple bulk loading, one for each of the level of the original column family. For this to work, I think we need to support arbitrary sequence numbers from live DB's files.

Does CreateColumnFamilyWithImport() work for that use case?

@jowlyzhang
Copy link
Contributor

Does CreateColumnFamilyWithImport() work for that use case?

This seems like the closest thing that's already available, for their in place back up use case. I think they just need to do something like create a new column family from importing the backed up files to restore and then drop the old column family altogether. I will check it more and discuss with them.

@facebook-github-bot
Copy link
Contributor

@cbi42 has updated the pull request. You must reimport the pull request before landing.

2 similar comments
@facebook-github-bot
Copy link
Contributor

@cbi42 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@cbi42 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@cbi42 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@cbi42 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@cbi42 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@cbi42 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@cbi42 merged this pull request in 4384dd5.

@wwl2755
Copy link

wwl2755 commented Aug 13, 2024

Hi @cbi42,

This feature looks quite interesting and promising in terms of performance optimization. I hope you don't mind if I have a couple of questions:

Is this feature already fully implemented in the pull request or is it just a prototype? It seems to "allow" users to ingest live SST files, but when I attempted to do so, the keys still could not be read in the new instance/column family, because a normal live SST usually has key sequence numbers larger than 0.

Additionally, I tried to implement this on my own but still encountered issues reading the ingested keys, even after unifying the SST file version and key sequence number. Could there be any restrictions on the reader side, such as a difference in session ID, that might be preventing the keys from being read?

Thank you for your time and assistance!

@cbi42
Copy link
Member Author

cbi42 commented Aug 19, 2024

Hi @wwl2755.

Is this feature already fully implemented in the pull request or is it just a prototype?

This is implemented. It's in "EXPERIMENTAL" state, but should be usable.

when I attempted to do so, the keys still could not be read in the new instance/column family

This is not expected. The unit tests have several example usages. Did the ingestion return OK status?

Could there be any restrictions on the reader side, such as a difference in session ID, that might be preventing the keys from being read?

I don't think there are such restrictions. The main challenge in read path is to use a global sequence number for reading these ingested files, even though keys in such files do not have this sequence number. Requiring all keys to have sequence number = 0 makes the read path (like seeking into data blocks) easier. Also we have this assumption where all keys have unique user_key@seqno. So ingested files should not have duplicated keys since they will be assigned the same sequence number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants