Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dropOutOfOrderRecord config to drop out-of-order events #11811

Merged
merged 4 commits into from
Oct 25, 2023

Conversation

tibrewalpratik17
Copy link
Contributor

labels:

  1. feature
  2. upsert
  3. release-notes

Related to issue: #11796

This patch adds a new config for upsert: dropOutOfOrderRecord which when set to true, doesn't persist out-of-order events in the segment. This saves disk-usage and also avoids any confusion when using skipUpsert.

@codecov-commenter
Copy link

codecov-commenter commented Oct 15, 2023

Codecov Report

Merging #11811 (3a5c650) into master (36be75f) will decrease coverage by 12.89%.
Report is 10 commits behind head on master.
The diff coverage is 88.23%.

@@              Coverage Diff              @@
##             master   #11811       +/-   ##
=============================================
- Coverage     63.16%   50.27%   -12.89%     
+ Complexity     1141        6     -1135     
=============================================
  Files          2343     2267       -76     
  Lines        126420   122690     -3730     
  Branches      19442    18886      -556     
=============================================
- Hits          79849    61686    -18163     
- Misses        40900    56878    +15978     
+ Partials       5671     4126     -1545     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (ø)
integration <0.01% <0.00%> (ø)
integration1 <0.01% <0.00%> (ø)
integration2 0.00% <0.00%> (ø)
java-11 50.21% <88.23%> (-12.92%) ⬇️
java-17 50.12% <88.23%> (-12.91%) ⬇️
java-20 50.13% <88.23%> (-12.89%) ⬇️
temurin 50.27% <88.23%> (-12.89%) ⬇️
unittests 67.36% <88.23%> (+4.21%) ⬆️
unittests1 67.36% <88.23%> (+0.02%) ⬆️
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...cal/upsert/BasePartitionUpsertMetadataManager.java 40.78% <100.00%> (+0.16%) ⬆️
...t/local/upsert/BaseTableUpsertMetadataManager.java 76.76% <100.00%> (+0.23%) ⬆️
...t/ConcurrentMapPartitionUpsertMetadataManager.java 85.21% <100.00%> (+0.21%) ⬆️
...psert/ConcurrentMapTableUpsertMetadataManager.java 33.33% <ø> (ø)
...rg/apache/pinot/spi/config/table/UpsertConfig.java 86.53% <100.00%> (+0.82%) ⬆️
...local/indexsegment/mutable/MutableSegmentImpl.java 67.80% <60.00%> (-0.12%) ⬇️

... and 497 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@xiangfu0 xiangfu0 added feature release-notes Referenced by PRs that need attention when compiling the next release notes upsert labels Oct 16, 2023
@tibrewalpratik17 tibrewalpratik17 changed the title Add upsert config - dropOutOfOrderRecord to drop out-of-order events Add upsert config - dropOutOfOrderRecord config to drop out-of-order events Oct 16, 2023
@tibrewalpratik17 tibrewalpratik17 changed the title Add upsert config - dropOutOfOrderRecord config to drop out-of-order events Add dropOutOfOrderRecord config to drop out-of-order events Oct 16, 2023
canTakeMore = numDocsIndexed++ < _capacity;
_partitionUpsertMetadataManager.addRecord(this, recordInfo);
// if record doesn't need to be dropped, then persist in segment and update metadata hashmap
if (!_partitionUpsertMetadataManager.shouldDropRecord(recordInfo)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is already performed in doAddRecord(). Suggest modifying addRecord() to return boolean which indicates whether the record should be kept. Then we may modify this part into

Suggested change
if (!_partitionUpsertMetadataManager.shouldDropRecord(recordInfo)) {
if (_partitionUpsertMetadataManager.addRecord(this, recordInfo)) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Jackie-Jiang at present _partitionUpsertMetadataManager.addRecord is called after addNewRow. Do you foresee any issues in changing the order here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One scenario I can think of where this might cause an issue is that we add a record in the metadata hashmap but somehow addNewRow fails due to some indexing issues. At present, it throws an exception which doesn't call the _partitionUpsertMetadataManager.addRecord as well which makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and there should be no issue changing the order because the row is not added to the metadata manager if it is out of order

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and there should be no issue changing the order because the row is not added to the metadata manager if it is out of order

What about a row which is not out-of-order? It gets added to metadata manager but addNewRow call fails due to some issue. Will that not cause issues when further updates are received for that record?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it throws exception, in both cases _numDocsIndexed won't be updated, and we will end up re-indexing the same docId. That needs to be solved separately, but very good point!

Copy link
Contributor Author

@tibrewalpratik17 tibrewalpratik17 Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm i think there can be a scenario something like:

  • Key: A, Current Segment: S1, Valid Doc ID: 2 and say addNewRow call fails for this.
    MetadataMap: A -> RecordLocation (S1, 2)
  • Key: B, Current Segment: S1, Valid Doc ID: 2 (This also gets valid Doc ID as 2 as _numDocsIndexed didn't update).
    MetadataMap: A -> RecordLocation (S1, 2), B -> RecordLocation (S1, 2)
  • Now say, we receive a record for key A again after sometime when we are processing segment S2.
    In that scenario, we will invoke a removeDocID command for (S1, 2) here -
    which will invalidate docID for B and will stop responding it in queries.

Did I miss anything? 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The analysis is correct. What I meant is that currently it is anyway not handled correctly, and we should solve this problem in a separate PR. Assuming addNewRow() failed, dictionary was already updated, and the first several columns might also be added already. Currently it works because that call never fails. The proper fix should be to ensure even if it throws exception, we should continue on other columns (e.g. put a default value for the failed column), and increase the doc id anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me! Updated the patch based on this.

@tibrewalpratik17
Copy link
Contributor Author

Hey @Jackie-Jiang updated the patch based on suggestions. Can you review once?

@xiangfu0 xiangfu0 merged commit 91bba48 into apache:master Oct 25, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature release-notes Referenced by PRs that need attention when compiling the next release notes upsert
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants