feat: add bitpack encoding for LanceV2 #2333

albertlockett · 2024-05-13T20:31:11Z

Work in progress

TODO

improve tests
support signed types
handle case where buffer is all 0s
handle case where num compressed bits = num uncompressed bits

github-actions · 2024-05-13T20:31:30Z

ACTION NEEDED

Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

westonpace

+1 Nice work so far. This looks like the correct general approach to me. Still some details to work out but nothing looks out of place.

westonpace · 2024-05-13T21:04:07Z

rust/lance-encoding/src/encodings/physical/bitpack.rs

+        _all_null: &mut bool,
+    ) {
+        // TODO -- not sure if this is correct
+        buffers[0].0 = self.uncompressed_bits_per_value / 8 * num_rows as u64;


This works as long as uncompressed_bits_per_value is a multiple of 8 and, for now, it should always be so. If we have to start handling cases where it isn't we will need to update this.

I've added a debug assert for now

westonpace · 2024-05-13T21:04:59Z

rust/lance-encoding/src/encodings/physical/bitpack.rs

+    }
+
+    fn decode_into(&self, rows_to_skip: u32, num_rows: u32, dest_buffers: &mut [BytesMut]) {
+        let mut bytes_to_skip = rows_to_skip as u64 * self.bits_per_value / 8;


rows_to_skip * self.bits_per_value isn't always going to be a multiple of 8. What happens when it isn't?

Yeah this logic wasn't correct. Reworked the decode_into method

westonpace · 2024-05-13T21:06:14Z

rust/lance-encoding/src/encodings/physical/bitpack.rs

+        // pre-add enough capacity to the buffer to hold all the values we're about to put in it
+        let capacity_to_add = dst.capacity() as i64 - dst.len() as i64 + num_rows as i64;
+        if capacity_to_add > 0 {
+            let bytes_to_add =
+                capacity_to_add as usize * self.uncompressed_bits_per_value as usize / 8;
+            dst.extend((0..bytes_to_add).into_iter().map(|_| 0));
+        }


You shouldn't need to do this. As long as update_capacity is returning a valid value then you should be able to safely assume the capacity is already there.

That being said, it doesn't really hurt to have this code. Maybe simpler to just put a debug_assert checking that there is enough capacity.

westonpace · 2024-05-13T21:07:03Z

rust/lance-encoding/src/encodings/physical/bitpack.rs

+        let mut mask = 0u64;
+        for _ in 0..self.bits_per_value {
+            mask = mask << 1 | 1;
+        }


I think this means you have a limit of 64 bits per value. This is probably fine but you should add a debug_assert somewhere verifying this.

westonpace · 2024-05-13T21:08:32Z

rust/lance-encoding/src/encodings/physical/bitpack.rs

+    fn num_buffers(&self) -> u32 {
+        // TODO ask weston what this is about
+        1
+    }


1 is correct. There are some cases (e.g. dictionary encoding) where we encode 1 input buffer into 2 output buffers.

westonpace · 2024-05-13T21:48:37Z

protos/encodings.proto

+
+  // additional metadata that should be present if bitpacking is used
+  optional BitpackMeta bitpack_meta = 4;
+}


Minor nit: I think of bitpacking less as an extension of Flat and more as it's own encoding that has another array encoding inside of it (like fixed_size_list). I don't know of any concrete reason that's better but I like thinking of these as small composable pieces rather than one piece with lots of options.

good call, made this change

westonpace · 2024-05-13T21:50:21Z

rust/lance-encoding/src/encodings/physical/bitpack.rs

+        let mut dest = vec![BytesMut::new()];
+        unit.decode_into(0, 7, &mut dest);
+
+        println!("{:?}", dest);


Nit: convert to an assert when ready to move out of draft.

I deleted this.. we have other tests covering this code path

westonpace · 2024-05-13T21:51:41Z

rust/lance-encoding/src/encodings/physical/buffers.rs

+        let mut packed_arrays = vec![];
+        for arr in arrays {
+            let packed = pack_array(arr.clone(), num_bits)?;
+            packed_arrays.push(packed.into());
+        }
+
+        let data_type = arrays[0].data_type();
+        let bits_per_value = 8 * data_type.byte_width() as u64;
+
+        Ok(EncodedBuffer {
+            bits_per_value: num_bits,
+            parts: packed_arrays,
+            bitpack_meta: Some(pb::BitpackMeta {
+                uncompressed_bits_per_value: bits_per_value,
+            }),
+        })


Do we want to conditionally bitpack based on the whether num_bits is less than "native num bits" if that makes sense? E.g. if a number is using the full range then don't bitpack?

Sure -- made this change

westonpace · 2024-05-13T21:52:50Z

rust/lance-encoding/src/encodings/physical/buffers.rs

+    T: ArrowPrimitiveType,
+    T::Native: PrimInt + AsPrimitive<u64>,
+{
+    let max = arrow::compute::bit_or(arr);


Well this is convenient :)

westonpace · 2024-05-13T21:53:59Z

rust/lance-encoding/src/encodings/physical/buffers.rs

+    let buffers = data.buffers();
+    let mut packed_buffers = vec![];
+    for buffer in buffers {
+        let packed_buffer = pack_bits(&buffer, num_bits, byte_len);
+        packed_buffers.push(packed_buffer);
+    }
+    packed_buffers.concat()


We only want to pack the values buffer, I think this will also try and pack the validity buffer.

I think we're actually OK here. This gets passed the result of array.to_data() here:

lance/rust/lance-encoding/src/encodings/physical/bitpack.rs

Lines 165 to 168 in d18b7df

match arr.data_type() {

DataType::UInt8 | DataType::UInt16 | DataType::UInt32 | DataType::UInt64 => Ok(

pack_buffers(arr.to_data(), num_bits, arr.data_type().byte_width()),

),

And the validity buffer doesn't get included in that result. For example:

let arr = UInt16Array::from(vec![Some(1), None, Some(2)]); let data = arr.to_data(); let buffers = data.buffers(); for buffer in buffers { println!("{:?}", buffer); }

prints:

Buffer { data: Bytes { ptr: 0x124704e80, len: 6, data: [1, 0, 0, 0, 2, 0] }, ptr: 0x124704e80, length: 6 }

rust/lance-encoding/src/encodings/physical/bitpack.rs

codecov-commenter · 2024-07-04T20:44:24Z

Codecov Report

Attention: Patch coverage is 94.77021% with 33 lines in your changes missing coverage. Please review.

Project coverage is 79.14%. Comparing base (b45393e) to head (5803461).
Report is 3 commits behind head on main.

Files	Patch %	Lines
...t/lance-encoding/src/encodings/physical/bitpack.rs	94.65%	12 Missing and 12 partials ⚠️
rust/lance-encoding/src/encoder.rs	85.71%	5 Missing and 1 partial ⚠️
...ust/lance-encoding/src/encodings/physical/value.rs	96.00%	0 Missing and 2 partials ⚠️
...ust/lance-encoding/src/encodings/physical/basic.rs	0.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2333      +/-   ##
==========================================
- Coverage   79.35%   79.14%   -0.22%     
==========================================
  Files         213      218       +5     
  Lines       62521    63615    +1094     
  Branches    62521    63615    +1094     
==========================================
+ Hits        49614    50345     +731     
- Misses       9996    10323     +327     
- Partials     2911     2947      +36

Flag	Coverage Δ
unittests	`79.14% <94.77%> (-0.22%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

westonpace

A few thoughts, only partyway through.

westonpace · 2024-07-10T14:14:33Z

protos/encodings.proto

+// Items are bitpacked in a buffer
+message Bitpacked {
+  // the number of bits used for a value in the buffer
+  uint64 compressed_bits_per_value = 1;
+
+  // the number of bits of the uncompressed value. e.g. for a u32, this will be 32
+  uint64 uncompressed_bits_per_value = 2;
+
+  // The items in the list
+  Buffer buffer = 3;
+}


No change required, but, as an interesting aside, this forces Bitpacked to be a "terminal" encoding. I honestly don't know of any cases where it wouldn't be terminal but in the BtrBlocks "pure tree of encodings" style it would be:

message Bitpacked { // The number of bits of the uncompressed value. e.g. for a u32, this will be 32 uint64 uncompressed_bits_per_value = 1; // The compressed bytes ArrayEncoding compressed_bytes = 2; }

E.g. this would open up the door to weird things like using dictionary encoding to encode the compressed byte buffer (which would be a bad idea since you should really apply dictionary encoding higher up but I use it as an example none the less).

Even if the difference was more than a philosophical one, it isn't a trivial change to make because then you would need to make sure all the ArrayEncoding encoders actually support a "bits_per_value" that isn't divisible by 8 which would be a headache.

westonpace · 2024-07-10T14:17:32Z

rust/lance-encoding/src/encoder.rs

+pub struct EncodedBufferMeta {
+    pub bits_per_value: u64,
+
+    pub bitpacked_bits_per_value: Option<u64>,
+
+    pub compression_scheme: Option<CompressionScheme>,


Can we change EncodedBuffer to contain these fields?

I wasn't sure about that .. there are some places where we construct an EncodedBuffer that it's not clear what we'd set for these fields.

For example here in EncodedArray.into_parts:

lance/rust/lance-encoding/src/encoder.rs

Line 91 in 314d636

.map(|b| EncodedBuffer { parts: b.parts })

And here in ZoneMapsFieldEncoder.maps_to_metadata:

lance/rust/lance-encoding-datafusion/src/zone.rs

Lines 500 to 501 in 314d636

Ok(EncodedBuffer {

parts: vec![Buffer::from(zone_maps_buffer)],

Wouldn't they both be None?

rust/lance-encoding/src/encoder.rs

westonpace · 2024-07-10T14:19:41Z

rust/lance-encoding/src/encoder.rs

+                ValueEncoder::try_new(Arc::new(CoreBufferEncodingStrategy {
+                    compression_scheme: get_compression_scheme(),
+                }))?,


Could the buffer encoding strategy be a property of CoreArrayEncodingStrategy instead of ValueEncoder? In other words, could this be something like...

_ => { let buffer_encoder = self.buffer_encoding_strategy.create_buffer_encoder(...); Ok(Box::new(BasicEncoder::new(Box::new(ValueEncoder::try_new(buffer_encoder)?)))) }

It would require changing array_encoder_from_type (taking in a data type) to create_array_encoder (taking in a slice of array ref) but that change is fine and I think I end up making that change anyways in some pending PR.

Yeah that would be straight forward.

We also instantiate ValueEncoder here for the ListOffsetEncoder:

lance/rust/lance-encoding/src/encodings/logical/list.rs

Lines 831 to 844 in da28952

impl ListOffsetsEncoder {

fn new(cache_bytes: u64, keep_original_array: bool, column_index: u32) -> Self {

Self {

accumulation_queue: AccumulationQueue::new(

cache_bytes,

column_index,

keep_original_array,

),

inner_encoder: Arc::new(BasicEncoder::new(Box::new(

ValueEncoder::try_new(Arc::new(CoreBufferEncodingStrategy {

compression_scheme: CompressionScheme::None,

}))

.unwrap(),

))),

Do you think it makes sense to pass the BufferEncodingStrategy through to this constructor from the constructor of ListFieldEncoder here? (e.g., make the BufferEncodingStrategy a property of ListOffsetEncoder)

lance/rust/lance-encoding/src/encodings/logical/list.rs

Lines 1092 to 1098 in da28952

impl ListFieldEncoder {

pub fn new(

items_encoder: Box<dyn FieldEncoder>,

cache_bytes_per_columns: u64,

keep_original_array: bool,

column_index: u32,

) -> Self {

Then we could create the inner ValueEncoder inline when in make_encode_task?

lance/rust/lance-encoding/src/encodings/logical/list.rs

Lines 876 to 877 in da28952

fn make_encode_task(&self, arrays: Vec<ArrayRef>) -> EncodeTask {

let inner_encoder = self.inner_encoder.clone();

westonpace · 2024-07-10T14:50:27Z

rust/lance-encoding/src/encoder.rs

+pub struct EncodedBufferMeta {
+    pub bits_per_value: u64,
+
+    pub bitpacked_bits_per_value: Option<u64>,


This feels like an odd property to be at this level. Should this be pb::ArrayEncoding instead?

This struct is the return value from BufferEncoder.encode, and then we use it to construct the pb::array_encoding::ArrayEncoding in the ValueEncoder. Is that OK?

lance/rust/lance-encoding/src/encodings/physical/value.rs

Lines 260 to 283 in da28952

let array_encoding =

if let Some(bitpacked_bits_per_value) = encoded_buffer_meta.bitpacked_bits_per_value {

pb::array_encoding::ArrayEncoding::Bitpacked(pb::Bitpacked {

compressed_bits_per_value: bitpacked_bits_per_value,

uncompressed_bits_per_value: encoded_buffer_meta.bits_per_value,

buffer: Some(pb::Buffer {

buffer_index: index,

buffer_type: pb::buffer::BufferType::Page as i32,

}),

})

} else {

pb::array_encoding::ArrayEncoding::Flat(pb::Flat {

bits_per_value: encoded_buffer_meta.bits_per_value,

buffer: Some(pb::Buffer {

buffer_index: index,

buffer_type: pb::buffer::BufferType::Page as i32,

}),

compression: encoded_buffer_meta

.compression_scheme

.map(|compression_scheme| pb::Compression {

scheme: compression_scheme.to_string(),

}),

})

};

Yes, let's proceed for now. I this I'll be doing a refactor at some point to introduce the concept of "data layouts" and I can take another look at this then. Let's leave it as is for now.

rust/lance-encoding/src/encodings/physical/bitpack.rs

westonpace · 2024-07-10T14:56:45Z

rust/lance-encoding/src/encodings/physical/bitpack.rs

+fn count_items_to_pack(arrays: &[ArrayRef]) -> usize {
+    let mut count = 0;
+    for arr in arrays {
+        count += arr.len();
+    }
+
+    count
+}


Minor nit: arrays.iter().map(|arr| arr.len()).sum::<usize>() might be more compact (I doubt it'd make a perf diff).

Sounds good, made this change

westonpace

Some minor suggestions but I think we're ready to include this.

I have one concern it would be nice to fix. The problem with environment variables and unit tests is that the unit tests often run in parallel and the environment variables are process-wide. So, if I run all of the tests in lance-encoding on my system, I regularly see the test_utf8 test fail. This can cause some noise in CI.

We could try and find a better approach than env variables but we could also just investigate why test_utf8 fails if LANCE_USE_BITPACKING is true. As long as all unit tests pass with LANCE_USE_BITPACKING=true I don't mind too much if some tests are running with bit packing sometimes and not other times. I'll try and investigate soon too.

westonpace · 2024-07-22T23:08:43Z

rust/lance-encoding/src/encoder.rs

+pub struct EncodedBufferMeta {
+    pub bits_per_value: u64,
+
+    pub bitpacked_bits_per_value: Option<u64>,


Yes, let's proceed for now. I this I'll be doing a refactor at some point to introduce the concept of "data layouts" and I can take another look at this then. Let's leave it as is for now.

westonpace · 2024-07-22T23:09:24Z

rust/lance-encoding/src/encoder.rs

+pub struct EncodedBufferMeta {
+    pub bits_per_value: u64,
+
+    pub bitpacked_bits_per_value: Option<u64>,
+
+    pub compression_scheme: Option<CompressionScheme>,


Wouldn't they both be None?

westonpace · 2024-07-22T23:11:27Z

rust/lance-encoding/src/encodings/physical/bitpack.rs

+        DataType::UInt16 => Some(num_bits_for_type::<UInt16Type>(arr.as_primitive())),
+        DataType::UInt32 => Some(num_bits_for_type::<UInt32Type>(arr.as_primitive())),
+        DataType::UInt64 => Some(num_bits_for_type::<UInt64Type>(arr.as_primitive())),
+        // TODO -- eventually we could support signed types as well


Can do in a follow-up but we also want the various temporal types to be encoded with bit packing too.

rust/lance-encoding/src/encodings/physical/bitpack.rs

rust/lance-encoding/src/encodings/physical/value.rs

albertlockett requested a review from westonpace May 13, 2024 20:31

github-actions bot added the enhancement New feature or request label May 13, 2024

westonpace reviewed May 13, 2024

View reviewed changes

albertlockett changed the title ~~feat: Add bitpac encoding for LanceV2~~ feat: Add bitpack encoding for LanceV2 May 13, 2024

westonpace mentioned this pull request May 14, 2024

Lance File Format Version 2 (technically v0.3) #1929

Open

30 tasks

broccoliSpicy reviewed May 18, 2024

View reviewed changes

rust/lance-encoding/src/encodings/physical/bitpack.rs Show resolved Hide resolved

broccoliSpicy reviewed May 18, 2024

View reviewed changes

rust/lance-encoding/src/encodings/physical/bitpack.rs Show resolved Hide resolved

albertlockett force-pushed the lw-compressions branch 2 times, most recently from d18b7df to 47afcba Compare May 28, 2024 23:23

albertlockett changed the title ~~feat: Add bitpack encoding for LanceV2~~ feat: add bitpack encoding for LanceV2 May 28, 2024

albertlockett force-pushed the lw-compressions branch from 056e140 to e07e788 Compare May 30, 2024 13:59

albertlockett force-pushed the lw-compressions branch from 00b3b96 to 849209f Compare July 4, 2024 20:30

albertlockett marked this pull request as ready for review July 8, 2024 16:31

albertlockett force-pushed the lw-compressions branch from b57a121 to c5786da Compare July 8, 2024 17:07

westonpace reviewed Jul 10, 2024

View reviewed changes

albertlockett and others added 11 commits July 22, 2024 10:12

feat: Add bitpacking encoding

d0fad46

fixup

83b3088

lint and clippy

18a2f68

bit more of cleanup

4217a35

linter

a418c45

clippy

9da1058

fixup some TODOs

03ea541

fix bug where buffer too long

8ff51a9

PR feedback

deabd81

PR feedback 2

fc3d473

added env var to control bitpacking being enabled

f59c037

albertlockett force-pushed the lw-compressions branch from f434e73 to f59c037 Compare July 22, 2024 14:26

cargo fmt

481b1af

fix broken test

131a19b

westonpace approved these changes Jul 22, 2024

View reviewed changes

albertlockett added 3 commits July 23, 2024 15:04

PR feedback 3

8ccc70d

PR feedback 4

6684334

license

5803461

albertlockett merged commit 782350e into main Jul 23, 2024
25 checks passed

albertlockett deleted the lw-compressions branch July 23, 2024 20:10

niyue mentioned this pull request Sep 23, 2024

Support dictionary encoding with bitpacked indices #2922

Open

	match arr.data_type() {
	DataType::UInt8 \| DataType::UInt16 \| DataType::UInt32 \| DataType::UInt64 => Ok(
	pack_buffers(arr.to_data(), num_bits, arr.data_type().byte_width()),
	),

	Ok(EncodedBuffer {
	parts: vec![Buffer::from(zone_maps_buffer)],

	impl ListOffsetsEncoder {
	fn new(cache_bytes: u64, keep_original_array: bool, column_index: u32) -> Self {
	Self {
	accumulation_queue: AccumulationQueue::new(
	cache_bytes,
	column_index,
	keep_original_array,
	),
	inner_encoder: Arc::new(BasicEncoder::new(Box::new(
	ValueEncoder::try_new(Arc::new(CoreBufferEncodingStrategy {
	compression_scheme: CompressionScheme::None,
	}))
	.unwrap(),
	))),

	impl ListFieldEncoder {
	pub fn new(
	items_encoder: Box<dyn FieldEncoder>,
	cache_bytes_per_columns: u64,
	keep_original_array: bool,
	column_index: u32,
	) -> Self {

	fn make_encode_task(&self, arrays: Vec<ArrayRef>) -> EncodeTask {
	let inner_encoder = self.inner_encoder.clone();

	let array_encoding =
	if let Some(bitpacked_bits_per_value) = encoded_buffer_meta.bitpacked_bits_per_value {
	pb::array_encoding::ArrayEncoding::Bitpacked(pb::Bitpacked {
	compressed_bits_per_value: bitpacked_bits_per_value,
	uncompressed_bits_per_value: encoded_buffer_meta.bits_per_value,
	buffer: Some(pb::Buffer {
	buffer_index: index,
	buffer_type: pb::buffer::BufferType::Page as i32,
	}),
	})
	} else {
	pb::array_encoding::ArrayEncoding::Flat(pb::Flat {
	bits_per_value: encoded_buffer_meta.bits_per_value,
	buffer: Some(pb::Buffer {
	buffer_index: index,
	buffer_type: pb::buffer::BufferType::Page as i32,
	}),
	compression: encoded_buffer_meta
	.compression_scheme
	.map(\|compression_scheme\| pb::Compression {
	scheme: compression_scheme.to_string(),
	}),
	})
	};

feat: add bitpack encoding for LanceV2 #2333

feat: add bitpack encoding for LanceV2 #2333

Conversation

albertlockett commented May 13, 2024 • edited Loading

github-actions bot commented May 13, 2024

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 4, 2024 • edited Loading

Codecov Report

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertlockett commented May 13, 2024 •

edited

Loading

codecov-commenter commented Jul 4, 2024 •

edited

Loading