Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for murmur3 as a partition function #12049

Merged
merged 12 commits into from
Dec 5, 2023

Conversation

aishikbh
Copy link
Contributor

@aishikbh aishikbh commented Nov 27, 2023

Currently pinot supports murmur2 as a partition function. This PR adds support for murmur3 as a partition function.

Release Notes

  • Added support for Murmur3 as a partition function.
  • Added support for adding optional field seed and variant for the hash in functionConfig field of columnPartitionMap for Murmur3.
  • default value for seed is 0.
  • Added support for 2 variants of Murmur3: x86_32 and x64_32 configurable using the variant field in functionConfig
    • x86_32 is Guava's version of the original implementation of Murmur3 32 bit hash for x86 architecture.
    • x64_32 was not part of the original implementation but is a variant which reduces the bits from x64_128 variant of the implementation. I have taken this implementation from Infinispan which is also used by Debezium.
    • If no variant is provided we choose to keep the x86_32 variant as it was part of the original implementation.
  • Examples of functionConfig;
    "tableIndexConfig": {
          ..
          "segmentPartitionConfig": {
            "columnPartitionMap": {
              "memberId": {
                "functionName": "Murmur3",
                "numPartitions": 3 
              },
              ..
            }
          }
    

Here there is no functionConfig configured, so the seed value will be 0 and variant will be x86_32.

```   
"tableIndexConfig": {
      ..
      "segmentPartitionConfig": {
        "columnPartitionMap": {
          "memberId": {
            "functionName": "Murmur3",
            "numPartitions": 3,
            "functionConfig": {
               "seed": "9001"
             },
          },
          ..
        }
      }
```

Here the seed is configured as 9001 but as no variant is provided, x86_32 will be picked up.

```   
"tableIndexConfig": {
      ..
      "segmentPartitionConfig": {
        "columnPartitionMap": {
          "memberId": {
            "functionName": "Murmur3",
            "numPartitions": 3,
            "functionConfig" :{
               "seed": "9001"
               "variant": "x64_32"
             },
          },
          ..
        }
      }
```

Here the variant is mentioned so Murmur3 will use the x64_32 variant with 9001 as seed.

  • Note on users using Debezium and Murmur3 as partitioning function :
    • The partitioning key should be set up on either of byte[], String or long[] columns.
    • On pinot variant should be set as x64_32 and seed should be set as 9001.

@aishikbh aishikbh changed the title Add support for murmur3 as partition function. Add support for murmur3 as partition function Nov 27, 2023
@codecov-commenter
Copy link

codecov-commenter commented Nov 27, 2023

Codecov Report

Attention: Patch coverage is 86.36364% with 27 lines in your changes missing coverage. Please review.

Project coverage is 61.66%. Comparing base (b25f7cf) to head (8101c43).
Report is 1293 commits behind head on master.

Files with missing lines Patch % Lines
...egment/spi/partition/Murmur3PartitionFunction.java 86.22% 20 Missing and 7 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #12049      +/-   ##
============================================
+ Coverage     61.63%   61.66%   +0.03%     
- Complexity     1151     1152       +1     
============================================
  Files          2389     2390       +1     
  Lines        129839   130036     +197     
  Branches      20085    20110      +25     
============================================
+ Hits          80020    80181     +161     
- Misses        43996    44023      +27     
- Partials       5823     5832       +9     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 61.61% <86.36%> (+0.03%) ⬆️
java-21 61.53% <86.36%> (+0.03%) ⬆️
skip-bytebuffers-false 61.63% <86.36%> (+0.03%) ⬆️
skip-bytebuffers-true 61.51% <86.36%> (+0.02%) ⬆️
temurin 61.66% <86.36%> (+0.03%) ⬆️
unittests 61.65% <86.36%> (+0.03%) ⬆️
unittests1 46.97% <86.36%> (+0.06%) ⬆️
unittests2 27.58% <0.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@aishikbh aishikbh force-pushed the addMurmur3PartitionFunctionv2 branch from c59fa5f to a604d31 Compare November 27, 2023 09:22
@aishikbh aishikbh changed the title Add support for murmur3 as partition function Add support for murmur3 as a partition function Nov 27, 2023
snleee
snleee previously approved these changes Nov 27, 2023
@snleee snleee dismissed their stale review November 27, 2023 19:14

i approved for testing a feature


@Override
public int getPartition(Object value) {
return (Hashing.murmur3_32_fixed(_hashSeed).hashBytes(value.toString().getBytes(UTF_8)).asInt() & Integer.MAX_VALUE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks that murmur3 has murmur3_32 / murmur3_128. Can we add TODO comment to add 128 bit support when needed in the future?

* </ul>
*/
@Test
public void testMurmur3Partitioner() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the test comparing against some hand computed values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

* Implementation of {@link PartitionFunction} which partitions based on 32 bit murmur3 hash
*/
public class Murmur3PartitionFunction implements PartitionFunction {
private static final String NAME = "Murmur3";
Copy link
Contributor

@walterddr walterddr Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interestingly we already have MURMUR3 in

* distributed with this work for additional information

i wonder where the impl is and if we can reuse :-/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that we used the same implementation but for 128 bits here. So chose to use the same one, but the 32_x86 variant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Also Guava already provides the murmur3 implementation (used in HashUtils), do we need to copy it from another library?

Copy link
Contributor

@snleee snleee Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to add the support murmur3 that Debezium uses as default. Unfortunately, Debezium is using a special variant x64_32 that is not available for many libraries including Guava and Apache. I think that @aishikbh had to copy x64_32 variant implementation from infinispan that Debezium is directly depend upon
https://github.com/infinispan/infinispan/blob/main/commons/all/src/main/java/org/infinispan/commons/hash/MurmurHash3.java

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snleee summarised it perfectly.

Just to add : murmur3 has 3 variants as part of the original implementation : x86 32 bits, x64 128 bits and x86 128 bits and Guava supports these implementations. Since we are generating partition numbers x86 32 bits should cover most of the use cases.

The issue comes from the users of Debezium. Debezium uses a variant of the x64 128 version reduced to 32 bits. I wanted to add that as well, as some users might want to use Debezium + Murmur3. I have kept the x86 32 bits variant as the default as that is part of the original implementation and x64 32 bits as the configurable one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Jackie-Jiang Jackie-Jiang added feature documentation release-notes Referenced by PRs that need attention when compiling the next release notes labels Nov 30, 2023
@aishikbh aishikbh force-pushed the addMurmur3PartitionFunctionv2 branch from 148323c to b350dc1 Compare December 1, 2023 08:34
@snleee
Copy link
Contributor

snleee commented Dec 1, 2023

@aishikbh For release note, can you add some example configuration for hash function (including the seed)

@aishikbh
Copy link
Contributor Author

aishikbh commented Dec 1, 2023

@aishikbh For release note, can you add some example configuration for hash function (including the seed)

sure I will do that. I plan to add to the documentation as well.

@aishikbh aishikbh force-pushed the addMurmur3PartitionFunctionv2 branch 3 times, most recently from 94b9464 to 4203546 Compare December 4, 2023 12:09
Copy link
Contributor

@snleee snleee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we follow up on Jackie/Rong's question?

LGTM otherwise.

@Override
public int getPartition(Object value) {
if (_variant.equals("x86_32")) {
return (murmurHash332BitsX86(value.toString().getBytes(UTF_8), _hashSeed) & Integer.MAX_VALUE) % _numPartitions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

murmur3Hash32BitsX64?

Copy link
Contributor Author

@aishikbh aishikbh Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. This name is better for readability.

@aishikbh aishikbh force-pushed the addMurmur3PartitionFunctionv2 branch from 968b3f8 to 8101c43 Compare December 5, 2023 08:12
@snleee
Copy link
Contributor

snleee commented Dec 5, 2023

@walterddr @Jackie-Jiang Are we good with this PR? @aishikbh answered on your question on why not reusing the function from guava lib. If there's no further question, I will merge this.

@Jackie-Jiang Jackie-Jiang merged commit 49d0ff0 into apache:master Dec 5, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation feature ingestion release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants