Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement PackedPatternsV1 with packing and unpacking #5580

Merged
merged 27 commits into from
Sep 25, 2024

Conversation

sffc
Copy link
Member

@sffc sffc commented Sep 23, 2024

Part of #257

Replaces #5521

@sffc sffc marked this pull request as draft September 23, 2024 21:13
Manishearth
Manishearth previously approved these changes Sep 23, 2024
@sffc
Copy link
Member Author

sffc commented Sep 24, 2024

Still needs databake and serde impls, but the packing part is ready for review/landing.

For human-readable Serde, I was thinking of serializing something resembling the builder; maybe I will just derive Serde on that and use it directly.

@sffc sffc marked this pull request as ready for review September 24, 2024 18:29
@sffc
Copy link
Member Author

sffc commented Sep 24, 2024

The individual commits are not that useful; most of the changes are in just the one new file.

@sffc sffc changed the title Draft up docs and structure for PackedSkeletonDataV2 Implement PackedPatternsV1 with packing and unpacking Sep 24, 2024
Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed builder, still reviewing others

///
/// The LMS value determines which pattern index is used for the first column:
///
/// | LMS Value | Long Index | Medium Index | Short Index |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's add a column that contains the entries "LMS", "L, MS", "LM, S", and "L, M, S" since it's easier to read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6b30bd8

/// | Sc | S + 6 | 18-20 | Sa |
///
/// As a result, if there are no variants, bits 2 and higher will be all zero,
/// making the header int suitable for varint packing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: varint packing, as used by postcard and other optimized binary serialization formats.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 4e94347

/// | Lb | S + 1 | 3-5 | La |
/// | Mb | S + 2 | 6-8 | Ma |
/// | Sb | S + 3 | 9-11 | Sa |
/// | Lc | S + 4 | 12-14 | La |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: so inheritence is always from a? I thought c inherits from b first?

we could mark Lc as inheriting from "Lb, then La" etc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed both variants to inherit from standard because it was easier to implement, and in practice, if both variant0 and variant1 are present, they will definitely be different from each other. The main inheritance that matters is variant0 from standard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that's convincing.

/// The variants are currently used by year formatting:
///
/// - Standard: Year, which could be partial precision (2-digit Gregorain)
/// - Variant 0: Full Year, which is always full precision
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Not a fan of calling them variant0 variant1 in code here; can we primarily use descriptive names and in the docs mention that this is also "variant0" and "variant1"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your suggestion to call them full_year and with_era? Because I wanted these to be generalizable enough to other cases that might want to use this mechanism in the future. For instance, maybe some locale will want to use it for day periods or something ("6 in the evening", "7 in the evening", "8pm", "9pm").

{
if let Some(pattern) = pattern {
if pattern != fallback {
*chunk = match elements.iter().position(|p| p == pattern) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: potentially use a BTreeMap<pattern, index> to speed this up?

Copy link
Member Author

@sffc sffc Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No Ord impls available and I don't particularly want to add them; they would peculate everywhere (20+ inner types including all fields, etc). And we can't move the first 1-3 elements on the vec, anyway. So the linear search seems alright.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay for now. Worth checking how slow this codegen is.


// Check to see if we need to switch to Q=1 mode
#[allow(clippy::unwrap_used)] // the array is nonempty
if chunks.iter().max().unwrap() >= &0x8 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think we need to use hex here. Also minor style nit would be to have the * on the LHS

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistically, I prefer writing in hex when I'm emphasizing that I care more about the bit representation than the numeric value. "8" means "the number 8" and "0x8" means "an integer with the fourth bit on and all others off"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I guess so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it a constant in 6b30bd8


/// A builder for a [`PackedPatternsV1`].
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct PackedPatternsBuilder<'a> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: datagen-only?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in the next PR, I use this type for human-readable deserialization, not just datagen.

}

// Check to see if we need to switch to Q=1 mode
#[allow(clippy::unwrap_used)] // the array is nonempty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: if we make this datagen-only, the comment could say "the array is nonempty and this is datagen-only"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved the comment in c01686e

I wish we would have array::max one of these days: rust-lang/rust#78504

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could also use one of the workarounds in that issue to avoid the unwrap

components/datetime/src/provider/packed_pattern.rs Outdated Show resolved Hide resolved
Manishearth
Manishearth previously approved these changes Sep 24, 2024
components/datetime/src/provider/packed_pattern.rs Outdated Show resolved Hide resolved
components/datetime/src/provider/packed_pattern.rs Outdated Show resolved Hide resolved
@sffc sffc merged commit 1141c8e into unicode-org:main Sep 25, 2024
28 checks passed
@sffc sffc deleted the packed3 branch September 25, 2024 00:25
Manishearth added a commit that referenced this pull request Sep 25, 2024
It is impossible for an IndexN array to need more than a length integer
of size N, anyway, the max index is always `>=` the length.


Part of #5523

Builds on #5593


We could in theory just have a `VZVFormatCombo<Index, Len>` type that
allows free selection, however I'm trying to keep this minimal. Overall
the main use case for that is picking things like "a small array of
;argely-sized elements" and we could just expose Index16Len8 for that. I
can see that being useful for things like
#5580, though it also feels
like a data microoptimization.


The "total" lines in fingerprints.csv are interspersed in giant diffs,
and this basically only gets a max of 2-6 byte wins per data, but the
overall data size went down by 200KB. Not amazing, not terrible.

```rust
[18:26:22] मanishearth@manishearth-glaptop2 ~/dev/icu4x/provider/data ^_^ 
$ rg total | awk '{ gsub(/B,/, "", $3); s +=$3} END{print s}' 
23501369
[18:26:08] मanishearth@manishearth-glaptop2 ~/dev/icu4x/provider/data ^_^ 
$ rg total | awk '{ gsub(/B,/, "", $3); s +=$3} END{print s}' 
23391499
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants