Data key thoughts #4991

robertbastian · 2024-06-03T15:24:19Z

Current architecture:

pub trait DataMarker { ... }
pub trait DynamicDataProvider<M: DataMarker> { load(&self, DataKey, DataRequest) }

pub trait KeyedDataMarker: DataMarker { const KEY: DataKey; }
pub trait DataProvider<M: KeyedDataMarker> { load(&self, DataRequest) }

Data keys do not appear anywhere in the DataProvider paths, they are only used in DynamicDataProvider
- BufferProvider/AnyProvider are functionally DynamicDataProvider<BufferMarker/AnyMarker>
Conceptually, they can be seen as dyn KeyedDataMarker
Users never see data keys, in the API bounds they see KeyedDataMarkers
- Yet, we make them select data keys in datagen

Proposal 1: Align data marker names with data provider names

pub trait DynamicDataMarker { ... }
pub trait DynamicDataProvider<M: DynamicDataMarker> { load(&self, DataKey, DataRequest) }

pub trait DataMarker: DynamicDataMarker { const KEY: DataKey; }
pub trait DataProvider<M: DataMarker> { load(&self, DataRequest) }

Proposal 2: Get rid of "data key" terminology

Instead, call it what it really is: an erased DataMarker (new terminology)

pub trait DynamicDataMarker { ... }
pub trait DynamicDataProvider<M: DynamicDataMarker> { load(&self, DataMarkerInfo, DataRequest) }

pub trait DataMarker: DynamicDataMarker { const INFO: DataMarkerInfo; }
pub trait DataProvider<M: DataMarker> { load(&self, DataRequest) }

This also includes replacing any "data key" terminology in datagen by "data marker", as we'll have DataMarkerInfos on the API

Proposal 3: `DataMarkerInfo.path` becomes `DataMarkerInfo.name`

To me the utility of the path is not clear (it probably originated from FS provider)
Instead of the paths we store the name of the DataMarker (new terminology) in the tagged string.
If a client sees AndListV1Marker on the API, they can tell datagen to generate "AndListV1Marker" (currently they have to figure out that it's "list/and@1")
In error messages, the marker name is more useful than the path, because mapping this in reverse is basically impossible for clients (and I struggle with this as well)
Blob provider doesn't use the path
Baked provider uses it for macro names, but it should have just used the marker names
FS provider uses this for directory structure, but that's not a functional requirement; data can live at AndListV1Marker/en.json instead of list/and@1/en.json.
We might want to use full paths for the name (icu_list::provider::AndListV1Marker), FS provider can then do icu_list/AndListV1Marker/en.json

Proposal 4: Split `DataMarkerInfo`'s fields (maybe)

pub trait DataMarker: DynamicDataMarker { 
  const NAME: DataMarkerName;
  const IS_SINGLETON: bool;
  const FALLBACK_CONFIG: LocaleFallbackConfig;
}

Having IS_SINGLETON closer to the type system could allow further optimisations
- Might require pub trait SingletonDataMarker: DataMarker instead
The NAMEs can be optimised out if nothing uses them. The hash would live inside the name, so blob provider would still use the name, and static slicing would still work. However, in non-dynamic-data-provider binaries without debug logging, the names can be removed
In the dynamic code path, we'd need to replace the DataKeyInfo parameter by three parameters, so that's a downside to proposal

The text was updated successfully, but these errors were encountered:

sffc · 2024-06-03T21:25:53Z

Initial thoughts:

OK with Proposal 1
This is a somewhat radical change to the mental model, but I can buy into it because I've never been a huge fan of the term "key", and you're right that "data marker" and "data key" refer to very similar concepts
I agree with the problems you laid out, but I also feel that the directory-like structure has been useful to us. I think if we did Proposal 3 we should do the variant where we use the fully qualified marker name, maybe eliding the "provider" since it will be the same for all markers. icu_list/AndListV1Marker
I think I do prefer inlining the fields to the DataMarker trait directly.

sffc · 2024-06-03T21:33:33Z

We could have both inlined const fields and a helper DataMarkerInfo

pub trait DataMarker: DynamicDataMarker { 
  const NAME: DataMarkerName;
  const IS_SINGLETON: bool;
  const FALLBACK_CONFIG: LocaleFallbackConfig;
}

#[non_exhaustive]
pub struct DataMarkerInfo {
  pub name: DataMarkerName;
  pub is_singleton: bool;
  pub fallback_config: LocaleFallbackConfig;
}

impl DataMarkerInfo {
  pub fn from_marker<M: DataMarker>() -> Self { /* ... */ }
}

pub trait DynamicDataProvider<M: DynamicDataMarker> { load(&self, DataMarkerInfo, DataRequest) }

But I guess if you do that you might as well just make DataMarkerInfo a const field, as in Proposal 3.

#4991

sffc · 2024-06-06T19:40:14Z

@Manishearth - Overall I like merging the names. But the "Marker" name is overloaded.
@sffc - Marker is a common name that's used for the type-system level objects in Rust. I think there's some risk of overloading "marker" but I'm not too worried. THere are ways to make clear whether a thing is a DataMarker or some other type of marker.
@sffc - Maybe we can talk about the path structure.
@robertbastian - https://github.com/unicode-org/icu4x/blob/main/provider/datagen/src/registry.rs
@sffc - The Rust types already have a hierarchy. It would be nice to reflect that hierarchy in the string representation. But we can probably cut off the icu:: and the ::provider:: and then snake case the rest.
@robertbastian - I don't like modifying the rust path, because if you then get such a string in an error message, you still have to reverse the mapping, even if it's a more regular mapping than now
@robertbastian - We can use the Rust stringify! macro if we keep the exact same path. It also makes them more searchable.
@robertbastian - There's also Rename provider modules to provider_unstable in 2.0, and DataMarker::Yokeable to DataMarker::DataStruct? #4193

Summary of additional discussion later:

@robertbastian values converging on a single identifier. The dueling identifiers (Rust type name and string path) harm understandability, usability, and searchability.
@sffc is favorable toward using a single identifier. @Manishearth is neutral or slightly favors having two identifiers since they service different use cases.
@sffc values hierarchy and brevity. The hierarchy enables use cases such as filtering and diagnostics, and it makes it easier and less error-prone to write custom datagen logic. The brevity directly impacts binary size.
@sffc puts a higher value on these properties than on the Rust names and would prefer names such as DateTime_Year_Chinese_V1 which are both hierarchical and brief, despite not being conventional Rust names. @sffc argues that these are weird types that only appear in trait bounds so it is fine to establish a new paradigm for how we name them. "V1Marker" is already a paradigm; we're swapping one for another.
@robertbastian and @Manishearth put a higher value on Rust conventional type names. @robertbastian points out that filtering can still be implemented using custom application logic to parse the camel case. @Manishearth acknowledges that filtering gets more difficult.
@robertbastian proposed icu::datetime::data::year::ChineseV1 as the identifier. @sffc is potentially okay with this compromise, since it has hierarchy and filterability, but is disappointed that it sacrifices brevity.

Manishearth · 2024-06-06T21:56:07Z

But the "Marker" name is overloaded

to expand on this a little bit (but not to rabbithole too deep because I don't care about "Marker" strongly): I find Marker to be a good default, but it's nice when it can be replaced with a useful concept name instead. I think we may have options here.

one thing I proposed in that discussion that isn't included here was encoding the heirarchy in the trait, so you can have something like:

impl DataMarker for DaySymbolsV1Marker {
   const COMPONENT = "datetime";
   const SUBCOMPONENT = Some("symbols");
}

#4991

sffc · 2024-06-25T16:20:36Z

Notes from the scratch pad:

pub fn try_new_unstable<D>(
    provider: &D,
    locale: &DataLocale,
    options: CollatorOptions
) -> Result<Self, DataError>
where
    D: DataProvider<CollationSpecialPrimariesV1Marker> + DataProvider<CollationDataV1Marker> + DataProvider<CollationDiacriticsV1Marker> + DataProvider<CollationJamoV1Marker> + DataProvider<CollationMetadataV1Marker> + DataProvider<CollationReorderingV1Marker> + DataProvider<CanonicalDecompositionDataV1Marker> + DataProvider<CanonicalDecompositionTablesV1Marker> + ?Sized,

pub fn try_new_unstable<D>(
    provider: &D,
    locale: &DataLocale,
    options: CollatorOptions
) -> Result<Self, DataError>
where
    D: DataProvider<Collation_SpecialPrimaries_V1> + DataProvider<Collation_Data_V1> + DataProvider<Collation_Diacritics_V1> + DataProvider<Collation_Jamo_V1> + DataProvider<Collation_Metadata_V1> + DataProvider<Collation_Reordering_V1> + DataProvider<Collator_CanonicalDecomposition_Data_V1> + DataProvider<Collation_CanonicalDecomposition_Tables_V1> + ?Sized,

pub fn try_new_unstable<D>(
    provider: &D,
    locale: &DataLocale,
    options: FixedDecimalFormatterOptions
) -> Result<Self, DataError>
where
    D: DataProvider<DecimalSymbolsV1Marker> + ?Sized

pub fn try_new_unstable<D>(
    provider: &D,
    locale: &DataLocale,
    options: FixedDecimalFormatterOptions
) -> Result<Self, DataError>
where
    D: DataProvider<Decimal_Symbols_V1> + ?Sized

Collation_CanonicalDecomposition_Tables_V1
TimeZone_IanaMapper_ToBcp47_V2 or TimezoneIanamapperTobcp47V1Marker
TimeZone_IanaMapper_ToIana_V1
TimeZone_Metazone_Period_V1
TimezoneIanaToBcp47V1, TimezoneBcpToIanaV1, icu::timezone::provider::iana_mapper::ToIanaV1Marker

Current: DangiDatePatternV1Marker + “datetime/patterns/dangi/date@1”
Shane Proposes: DateTime_Patterns_Date_Dangi_V1
Rob Proposes: DatetimePatternsDateDangiV1Marker, icu::datetime::provider::patterns::date::DangiV1Marker

where: P: DataProvider<patterns::date::DangiV1Marker> + DataProvider<patterns::date::GregorianV1Marker> ...

where: P: DataProvider<DateTime_Patterns_Date_Dangi_V1> + DataProvider<DateTime_Patterns_Date_Gregorian_V1> ...

icu::datetime::provider::patterns::date::DangiV1Marker

datetime/patterns/date/DangiV1Marker/en.json

datetime/patterns/date/dangi@1/en.json

sffc · 2024-09-14T01:14:24Z

Discussion:

@Manishearth Users typically don't use the string key, but it has its use cases and might show up in errors from time to time. I think the status quo is fine. I don't think it's a good idea to make our data loading path depend on the Rust path. I want to be able to move things around in code without changing the data path. I think we've made decent choices in each case.
@hsivonen Do the string keys get stored as the full path somewhere, so if the string key is even longer, we take a binary size hit?
@robertbastian In the latest proposal, ... we use the hash more. The places these show up are in error messages and in fs.
@robertbastian The reason I proposed this is that it is confusing to have to IDs for the same thing. They see in docs ListV1Marker but in datagen they need to select list/and@1. I fixed this by accepting both formulations in datagen. But they still see the path3 in errors.
@sffc We could/should do an audit of our marker and path names. We've landed them but never really took a holistic view. Then we could establish a convention between the Rust IDs and the string IDs.
@sffc If datagen accepts marker names, then the marker names are effectively stable.
@robertbastian Doesn't seem 2.0 critical, except for the audit of the paths.
@robertbastian I think we should consider anything around markers to be unstable.

Some proposals:

Proposal A: Consider marker names to be stable. They can't be renamed in minor releases. They are accepted in datagen.
Proposal B: Consider marker names to be unstable. They can be renamed as we like. The paths remain stable. They are accepted in datagen because if you use the marker names, you are subject to their stability policy.
Proposal C: Same as Proposal B but reject marker names in datagen.

More discussion:

@robertbastian We've never needed to rename a marker. I don't see why we should make our UX worse to reserve this possibility.
@sffc I'm leaning toward Proposal B but not actually renaming anything until we do a proper audit, and in some future release we can do that audit. If we do it in say 2.2, then the 2.2 release notes say so.
@robertbastian People don't usually read point release notes. Things just shouldn't break. If the audit is not important enough for 2.0, maybe it's just not important
@sffc Data slicing and data providers are part of the ICU4X value proposition, so we should make them nice. If we had more resources, I would like to do it in 2.0, but I agree it's a lower priority.
@Manishearth If you're using them, you're using them. No one looks at them holistically other than us.
@robertbastian We should at least do an audit in datetime because they live 5 traits away and the documentation should explain how to figure out which markers you'll need

Conclusion:

Don't rename markers in 2.x. Accept them in datagen.
Could revisit in 3.0 but it might not be a high priority even then.
Perform an audit on neo datetime markers.
Add this auditing to the graduation checklist. For now it is a vague requirement; could evolve into something more specific in the future.

LGTM: @sffc @robertbastian @Manishearth

robertbastian added the discuss-priority Discuss at the next ICU4X meeting label Jun 3, 2024

This was referenced Jun 4, 2024

Rename marker types #4999

Merged

Rename data key to data marker [info] #5003

Merged

sffc pushed a commit that referenced this issue Jun 4, 2024

Rename marker types (#4999)

c225499

#4991

robertbastian added a commit that referenced this issue Jun 6, 2024

Rename data key to data marker [info] (#5003)

0217a09

#4991

sffc added this to the ICU4X 2.0 ⟨P1⟩ milestone Jun 13, 2024

sffc assigned robertbastian Jun 13, 2024

sffc added the C-data-infra Component: provider, datagen, fallback, adapters label Jun 13, 2024

robertbastian mentioned this issue Jun 14, 2024

Allow using marker names in datagen #5060

Merged

robertbastian added a commit that referenced this issue Jun 14, 2024

Allow using marker names in datagen (#5060)

aade9e1

#4991

robertbastian removed the discuss-priority Discuss at the next ICU4X meeting label Jul 11, 2024

robertbastian removed their assignment Sep 17, 2024

sffc modified the milestones: ICU4X 2.0 ⟨P1⟩, ICU4X 2.0 Stretch ⟨P2⟩ Sep 17, 2024

sffc self-assigned this Sep 17, 2024

sffc modified the milestones: ICU4X 2.0 Stretch ⟨P2⟩, ICU4X 2.0 ⟨P1⟩ Sep 17, 2024

sffc added the S-tiny Size: Less than an hour (trivial fixes) label Sep 17, 2024

sffc mentioned this issue Sep 17, 2024

Should we adopt a more consistent naming convention for traits and marker types? #5276

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data key thoughts #4991

Data key thoughts #4991

robertbastian commented Jun 3, 2024 •

edited

Loading

sffc commented Jun 3, 2024

sffc commented Jun 3, 2024

sffc commented Jun 6, 2024

Manishearth commented Jun 6, 2024

sffc commented Jun 25, 2024

sffc commented Sep 14, 2024

Data key thoughts #4991

Data key thoughts #4991

Comments

robertbastian commented Jun 3, 2024 • edited Loading

Proposal 1: Align data marker names with data provider names

Proposal 2: Get rid of "data key" terminology

Proposal 3: DataMarkerInfo.path becomes DataMarkerInfo.name

Proposal 4: Split DataMarkerInfo's fields (maybe)

sffc commented Jun 3, 2024

sffc commented Jun 3, 2024

sffc commented Jun 6, 2024

Manishearth commented Jun 6, 2024

sffc commented Jun 25, 2024

sffc commented Sep 14, 2024

robertbastian commented Jun 3, 2024 •

edited

Loading

Proposal 3: `DataMarkerInfo.path` becomes `DataMarkerInfo.name`

Proposal 4: Split `DataMarkerInfo`'s fields (maybe)