Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-8739: custom codec providing Zstandard compression/decompression #439

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

praveennish
Copy link

@praveennish praveennish commented Nov 15, 2021

Description

#9784

Lucene currently supports LZ4 and Zlib compression/decompression for StoredFieldsFormat. We propose Zstandard (https://facebook.github.io/zstd/) compression/decompression for StoredFieldsFormat for the following reasons:

Solution

  • Developed a custom codec to enable Zstandard compression/decompression support.
  • This custom codec also provides flexibility to add any other compression algorithms.

Tests

  • Added required test case for testing new custom codec. Ran with option -Dtests.codec for newly developed custom codec

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.

/** Stored field format used by plugaable codec */
public class Lucene90CustomCompressionStoredFieldsFormat extends StoredFieldsFormat {
// chunk size for zstandard
private static final int ZSTD_BLOCK_LENGTH = 10 * 8 * 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use bigger blocks, like for BEST_COMPRESSION.

Suggested change
private static final int ZSTD_BLOCK_LENGTH = 10 * 8 * 1024;
private static final int ZSTD_BLOCK_LENGTH = 10 * 48 * 1024;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 48KB in PR

private static final int NUM_SUB_BLOCKS = 10;
private final int level;
public static final int defaultLevel = 3;
private static final int DICT_SIZE_FACTOR = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do 6 like for BEST_COMPRESSION?

Suggested change
private static final int DICT_SIZE_FACTOR = 2;
private static final int DICT_SIZE_FACTOR = 6;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 6 in PR

switch (mode) {
case ZSTD:
return new Lucene90CompressingStoredFieldsFormat(
"CustomCompressionStoredFieldsZSTD", ZSTD_MODE_NO_DICT, ZSTD_BLOCK_LENGTH, 1024, 10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4096 docs per chunk at most, like for BEST_COMPRESSION

Suggested change
"CustomCompressionStoredFieldsZSTD", ZSTD_MODE_NO_DICT, ZSTD_BLOCK_LENGTH, 1024, 10);
"CustomCompressionStoredFieldsZSTD", ZSTD_MODE_NO_DICT, ZSTD_BLOCK_LENGTH, 4096, 10);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 4096 in PR


case ZSTD_DICT:
return new Lucene90CompressingStoredFieldsFormat(
"CustomCompressionStoredFieldsZSTD", ZSTD_MODE_DICT, ZSTD_BLOCK_LENGTH, 1024, 10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"CustomCompressionStoredFieldsZSTD", ZSTD_MODE_DICT, ZSTD_BLOCK_LENGTH, 1024, 10);
"CustomCompressionStoredFieldsZSTD", ZSTD_MODE_DICT, ZSTD_BLOCK_LENGTH, 4096, 10);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to 4096


final Lucene90CustomCompressionCodec.Mode mode;

private static int compressionLevel;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not make it static?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed static

moduleImplementation project(':lucene:core')
moduleImplementation 'com.github.luben:zstd-jni:1.5.0-4'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor suggestion:
current version is 1.5.1-1
could bring another few percent compression speed
https://github.com/facebook/zstd/releases/tag/v1.5.1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Tobias,
I have tested 1.5.0-X and the data looks to be very promising with the current version. I have not tested 1.5.1-X JNI version but https://github.com/facebook/zstd/releases/tag/v1.5.1 looks good and can be picked up in the future since Zstd is very dynamic and we may see a lot of improvements over the current version!

Thanks!!

@ssigut
Copy link

ssigut commented Apr 19, 2022

Is this PR going to be merged? What release is this planned for?

@jpountz
Copy link
Contributor

jpountz commented Apr 19, 2022

See discussion on LUCENE-8739, this PR is unlikely going to get merged.

Copy link

github-actions bot commented Jan 9, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 9, 2024

// dictionary compression first
doCompress(bytes, off, dictLength, cctx, out);
cctx.loadDict(new ZstdDictCompress(bytes, off, dictLength, this.compressionLevel));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should close this ZstdDictCompress object, or it will cause a memory leak. I have used your code in Elasticsearch and found it to be useful, but this is a minor issue that needs to be fixed.

// decompress dictionary first
doDecompress(in, dctx, bytes, dictLength);

dctx.loadDict(new ZstdDictDecompress(bytes.bytes, 0, dictLength));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should close this ZstdDictDecompress object too.

@github-actions github-actions bot removed the Stale label Mar 8, 2024
Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Mar 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants