Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: speed up fsst decompression #2626

Merged
merged 5 commits into from
Jul 25, 2024

Conversation

broccoliSpicy
Copy link
Contributor

@broccoliSpicy broccoliSpicy commented Jul 21, 2024

before:
Screenshot 2024-07-21 at 4 24 44 PM

after:
Screenshot 2024-07-21 at 4 28 49 PM

to reproduce:
cargo run --release --example benchmark
in rust/lance-encoding/compression-algo/fsst

machine info:
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
Linux 192 5.10.0-28-amd64 #1 SMP Debian 5.10.209-2 (2024-01-31) x86_64 GNU/Linux

@codecov-commenter
Copy link

codecov-commenter commented Jul 21, 2024

Codecov Report

Attention: Patch coverage is 98.22485% with 3 lines in your changes missing coverage. Please review.

Project coverage is 79.39%. Comparing base (02294a1) to head (2a40120).
Report is 1 commits behind head on main.

Files Patch % Lines
...t/lance-encoding/compression-algo/fsst/src/fsst.rs 98.22% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2626      +/-   ##
==========================================
+ Coverage   79.35%   79.39%   +0.04%     
==========================================
  Files         213      213              
  Lines       62520    62706     +186     
  Branches    62520    62706     +186     
==========================================
+ Hits        49610    49784     +174     
- Misses       9996    10005       +9     
- Partials     2914     2917       +3     
Flag Coverage Δ
unittests 79.39% <98.22%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@broccoliSpicy
Copy link
Contributor Author

end-to-end test:

Screenshot 2024-07-22 at 5 34 13 PM

to reproduce:

  1. disable dictionary encoding
  2. download dataset
    wget https://huggingface.co/datasets/HuggingFaceFW/fineweb/resolve/main/data/CC-MAIN-2013-20/000_00000.parquet\?download\=true
  3. run script(set LANCE_USE_FSST environmental variable to enable fsst and unset it to disable)
from lance.file import LanceFileReader, LanceFileWriter
import pyarrow.parquet as pq

parquet_file_path = "/home/x/data.parquet"
data = pq.read_table(parquet_file_path)
lance_file_path = '/home/x/lance-experiments/fineweb/output.lance'
with LanceFileWriter(lance_file_path) as writer:
  writer.write_batch(data)

import datetime
import pyarrow.parquet as pq
from lance.file import LanceFileReader

start = datetime.datetime.now()
tab = pq.read_table(parquet_file_path)
end = datetime.datetime.now()
elapsed = (end - start).total_seconds()
print(f"Parquet elapsed: {elapsed}s")

start = datetime.datetime.now()
tab2 = LanceFileReader(lance_file_path).read_all().to_table()
end = datetime.datetime.now()
elapsed = (end - start).total_seconds()

import os  


lance_file_size = os.path.getsize(lance_file_path)
lance_file_size_mib = lance_file_size // 1048576
parquet_file_size = os.path.getsize(parquet_file_path)
parquet_file_size_mib = parquet_file_size // 1048576

if os.getenv("LANCE_USE_FSST") is not None:

  print(f"Parquet file size(fsst): {parquet_file_size_mib} Mbytes")
  print(f"Lance file size(fsst): {lance_file_size_mib} Mbytes")
  print(f"Lv2(fsst) elapsed: {elapsed}s")
else:
  print(f"Parquet file size(fsst): {parquet_file_size_mib} Mbytes")
  print(f"Lance file size(no fsst): {lance_file_size_mib} Mbytes")
  print(f"Lv2(no fsst) elapsed: {elapsed}s")

assert tab == tab2

print("Tables are equal")

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in getting back to this. I don't yet fully understand FSST and so I don't understand exactly this change. However, the tests pass and FSST is still guarded by env variable so lets merge this and I will try and set some time aside next week to really dig through in more detail.

@westonpace westonpace merged commit fbf7a4a into lancedb:main Jul 25, 2024
22 checks passed
@broccoliSpicy
Copy link
Contributor Author

ha, sorry that I didn't explain this PR, it is basically a rust translation from the original c++ implementation of decompression.
and uses rust's ptr::write_unaligned as a performant way to do store

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants