Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data key thoughts #4991

Open
robertbastian opened this issue Jun 3, 2024 · 6 comments
Open

Data key thoughts #4991

robertbastian opened this issue Jun 3, 2024 · 6 comments
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-tiny Size: Less than an hour (trivial fixes)

Comments

@robertbastian
Copy link
Member

robertbastian commented Jun 3, 2024

Current architecture:

pub trait DataMarker { ... }
pub trait DynamicDataProvider<M: DataMarker> { load(&self, DataKey, DataRequest) }

pub trait KeyedDataMarker: DataMarker { const KEY: DataKey; }
pub trait DataProvider<M: KeyedDataMarker> { load(&self, DataRequest) }
  • Data keys do not appear anywhere in the DataProvider paths, they are only used in DynamicDataProvider
    • BufferProvider/AnyProvider are functionally DynamicDataProvider<BufferMarker/AnyMarker>
  • Conceptually, they can be seen as dyn KeyedDataMarker
  • Users never see data keys, in the API bounds they see KeyedDataMarkers
    • Yet, we make them select data keys in datagen

Proposal 1: Align data marker names with data provider names

pub trait DynamicDataMarker { ... }
pub trait DynamicDataProvider<M: DynamicDataMarker> { load(&self, DataKey, DataRequest) }

pub trait DataMarker: DynamicDataMarker { const KEY: DataKey; }
pub trait DataProvider<M: DataMarker> { load(&self, DataRequest) }

Proposal 2: Get rid of "data key" terminology

  • Instead, call it what it really is: an erased DataMarker (new terminology)
pub trait DynamicDataMarker { ... }
pub trait DynamicDataProvider<M: DynamicDataMarker> { load(&self, DataMarkerInfo, DataRequest) }

pub trait DataMarker: DynamicDataMarker { const INFO: DataMarkerInfo; }
pub trait DataProvider<M: DataMarker> { load(&self, DataRequest) }
  • This also includes replacing any "data key" terminology in datagen by "data marker", as we'll have DataMarkerInfos on the API

Proposal 3: DataMarkerInfo.path becomes DataMarkerInfo.name

  • To me the utility of the path is not clear (it probably originated from FS provider)
  • Instead of the paths we store the name of the DataMarker (new terminology) in the tagged string.
  • If a client sees AndListV1Marker on the API, they can tell datagen to generate "AndListV1Marker" (currently they have to figure out that it's "list/and@1")
  • In error messages, the marker name is more useful than the path, because mapping this in reverse is basically impossible for clients (and I struggle with this as well)
  • Blob provider doesn't use the path
  • Baked provider uses it for macro names, but it should have just used the marker names
  • FS provider uses this for directory structure, but that's not a functional requirement; data can live at AndListV1Marker/en.json instead of list/and@1/en.json.
  • We might want to use full paths for the name (icu_list::provider::AndListV1Marker), FS provider can then do icu_list/AndListV1Marker/en.json

Proposal 4: Split DataMarkerInfo's fields (maybe)

pub trait DataMarker: DynamicDataMarker { 
  const NAME: DataMarkerName;
  const IS_SINGLETON: bool;
  const FALLBACK_CONFIG: LocaleFallbackConfig;
}
  • Having IS_SINGLETON closer to the type system could allow further optimisations
    • Might require pub trait SingletonDataMarker: DataMarker instead
  • The NAMEs can be optimised out if nothing uses them. The hash would live inside the name, so blob provider would still use the name, and static slicing would still work. However, in non-dynamic-data-provider binaries without debug logging, the names can be removed
  • In the dynamic code path, we'd need to replace the DataKeyInfo parameter by three parameters, so that's a downside to proposal
@robertbastian robertbastian added the discuss-priority Discuss at the next ICU4X meeting label Jun 3, 2024
@sffc
Copy link
Member

sffc commented Jun 3, 2024

Initial thoughts:

  1. OK with Proposal 1
  2. This is a somewhat radical change to the mental model, but I can buy into it because I've never been a huge fan of the term "key", and you're right that "data marker" and "data key" refer to very similar concepts
  3. I agree with the problems you laid out, but I also feel that the directory-like structure has been useful to us. I think if we did Proposal 3 we should do the variant where we use the fully qualified marker name, maybe eliding the "provider" since it will be the same for all markers. icu_list/AndListV1Marker
  4. I think I do prefer inlining the fields to the DataMarker trait directly.

@sffc
Copy link
Member

sffc commented Jun 3, 2024

We could have both inlined const fields and a helper DataMarkerInfo

pub trait DataMarker: DynamicDataMarker { 
  const NAME: DataMarkerName;
  const IS_SINGLETON: bool;
  const FALLBACK_CONFIG: LocaleFallbackConfig;
}

#[non_exhaustive]
pub struct DataMarkerInfo {
  pub name: DataMarkerName;
  pub is_singleton: bool;
  pub fallback_config: LocaleFallbackConfig;
}

impl DataMarkerInfo {
  pub fn from_marker<M: DataMarker>() -> Self { /* ... */ }
}

pub trait DynamicDataProvider<M: DynamicDataMarker> { load(&self, DataMarkerInfo, DataRequest) }

But I guess if you do that you might as well just make DataMarkerInfo a const field, as in Proposal 3.

sffc pushed a commit that referenced this issue Jun 4, 2024
@sffc
Copy link
Member

sffc commented Jun 6, 2024

Summary of additional discussion later:

  • @robertbastian values converging on a single identifier. The dueling identifiers (Rust type name and string path) harm understandability, usability, and searchability.
  • @sffc is favorable toward using a single identifier. @Manishearth is neutral or slightly favors having two identifiers since they service different use cases.
  • @sffc values hierarchy and brevity. The hierarchy enables use cases such as filtering and diagnostics, and it makes it easier and less error-prone to write custom datagen logic. The brevity directly impacts binary size.
  • @sffc puts a higher value on these properties than on the Rust names and would prefer names such as DateTime_Year_Chinese_V1 which are both hierarchical and brief, despite not being conventional Rust names. @sffc argues that these are weird types that only appear in trait bounds so it is fine to establish a new paradigm for how we name them. "V1Marker" is already a paradigm; we're swapping one for another.
  • @robertbastian and @Manishearth put a higher value on Rust conventional type names. @robertbastian points out that filtering can still be implemented using custom application logic to parse the camel case. @Manishearth acknowledges that filtering gets more difficult.
  • @robertbastian proposed icu::datetime::data::year::ChineseV1 as the identifier. @sffc is potentially okay with this compromise, since it has hierarchy and filterability, but is disappointed that it sacrifices brevity.

@Manishearth
Copy link
Member

But the "Marker" name is overloaded

to expand on this a little bit (but not to rabbithole too deep because I don't care about "Marker" strongly): I find Marker to be a good default, but it's nice when it can be replaced with a useful concept name instead. I think we may have options here.


one thing I proposed in that discussion that isn't included here was encoding the heirarchy in the trait, so you can have something like:

impl DataMarker for DaySymbolsV1Marker {
   const COMPONENT = "datetime";
   const SUBCOMPONENT = Some("symbols");
}

@sffc sffc added this to the ICU4X 2.0 ⟨P1⟩ milestone Jun 13, 2024
@sffc sffc added the C-data-infra Component: provider, datagen, fallback, adapters label Jun 13, 2024
@sffc
Copy link
Member

sffc commented Jun 25, 2024

Notes from the scratch pad:

pub fn try_new_unstable<D>(
    provider: &D,
    locale: &DataLocale,
    options: CollatorOptions
) -> Result<Self, DataError>
where
    D: DataProvider<CollationSpecialPrimariesV1Marker> + DataProvider<CollationDataV1Marker> + DataProvider<CollationDiacriticsV1Marker> + DataProvider<CollationJamoV1Marker> + DataProvider<CollationMetadataV1Marker> + DataProvider<CollationReorderingV1Marker> + DataProvider<CanonicalDecompositionDataV1Marker> + DataProvider<CanonicalDecompositionTablesV1Marker> + ?Sized,

pub fn try_new_unstable<D>(
    provider: &D,
    locale: &DataLocale,
    options: CollatorOptions
) -> Result<Self, DataError>
where
    D: DataProvider<Collation_SpecialPrimaries_V1> + DataProvider<Collation_Data_V1> + DataProvider<Collation_Diacritics_V1> + DataProvider<Collation_Jamo_V1> + DataProvider<Collation_Metadata_V1> + DataProvider<Collation_Reordering_V1> + DataProvider<Collator_CanonicalDecomposition_Data_V1> + DataProvider<Collation_CanonicalDecomposition_Tables_V1> + ?Sized,

pub fn try_new_unstable<D>(
    provider: &D,
    locale: &DataLocale,
    options: FixedDecimalFormatterOptions
) -> Result<Self, DataError>
where
    D: DataProvider<DecimalSymbolsV1Marker> + ?Sized

pub fn try_new_unstable<D>(
    provider: &D,
    locale: &DataLocale,
    options: FixedDecimalFormatterOptions
) -> Result<Self, DataError>
where
    D: DataProvider<Decimal_Symbols_V1> + ?Sized
  • Collation_CanonicalDecomposition_Tables_V1
  • TimeZone_IanaMapper_ToBcp47_V2 or TimezoneIanamapperTobcp47V1Marker
  • TimeZone_IanaMapper_ToIana_V1
  • TimeZone_Metazone_Period_V1
  • TimezoneIanaToBcp47V1, TimezoneBcpToIanaV1, icu::timezone::provider::iana_mapper::ToIanaV1Marker

Current: DangiDatePatternV1Marker + “datetime/patterns/dangi/date@1”
Shane Proposes: DateTime_Patterns_Date_Dangi_V1
Rob Proposes: DatetimePatternsDateDangiV1Marker, icu::datetime::provider::patterns::date::DangiV1Marker

where: P: DataProvider<patterns::date::DangiV1Marker> + DataProvider<patterns::date::GregorianV1Marker> ...

where: P: DataProvider<DateTime_Patterns_Date_Dangi_V1> + DataProvider<DateTime_Patterns_Date_Gregorian_V1> ...

icu::datetime::provider::patterns::date::DangiV1Marker

datetime/patterns/date/DangiV1Marker/en.json

datetime/patterns/date/dangi@1/en.json

@robertbastian robertbastian removed the discuss-priority Discuss at the next ICU4X meeting label Jul 11, 2024
@sffc
Copy link
Member

sffc commented Sep 14, 2024

Discussion:

  • @Manishearth Users typically don't use the string key, but it has its use cases and might show up in errors from time to time. I think the status quo is fine. I don't think it's a good idea to make our data loading path depend on the Rust path. I want to be able to move things around in code without changing the data path. I think we've made decent choices in each case.
  • @hsivonen Do the string keys get stored as the full path somewhere, so if the string key is even longer, we take a binary size hit?
  • @robertbastian In the latest proposal, ... we use the hash more. The places these show up are in error messages and in fs.
  • @robertbastian The reason I proposed this is that it is confusing to have to IDs for the same thing. They see in docs ListV1Marker but in datagen they need to select list/and@1. I fixed this by accepting both formulations in datagen. But they still see the path3 in errors.
  • @sffc We could/should do an audit of our marker and path names. We've landed them but never really took a holistic view. Then we could establish a convention between the Rust IDs and the string IDs.
  • @sffc If datagen accepts marker names, then the marker names are effectively stable.
  • @robertbastian Doesn't seem 2.0 critical, except for the audit of the paths.
  • @robertbastian I think we should consider anything around markers to be unstable.

Some proposals:

  • Proposal A: Consider marker names to be stable. They can't be renamed in minor releases. They are accepted in datagen.
  • Proposal B: Consider marker names to be unstable. They can be renamed as we like. The paths remain stable. They are accepted in datagen because if you use the marker names, you are subject to their stability policy.
  • Proposal C: Same as Proposal B but reject marker names in datagen.

More discussion:

  • @robertbastian We've never needed to rename a marker. I don't see why we should make our UX worse to reserve this possibility.
  • @sffc I'm leaning toward Proposal B but not actually renaming anything until we do a proper audit, and in some future release we can do that audit. If we do it in say 2.2, then the 2.2 release notes say so.
  • @robertbastian People don't usually read point release notes. Things just shouldn't break. If the audit is not important enough for 2.0, maybe it's just not important
  • @sffc Data slicing and data providers are part of the ICU4X value proposition, so we should make them nice. If we had more resources, I would like to do it in 2.0, but I agree it's a lower priority.
  • @Manishearth If you're using them, you're using them. No one looks at them holistically other than us.
  • @robertbastian We should at least do an audit in datetime because they live 5 traits away and the documentation should explain how to figure out which markers you'll need

Conclusion:

  • Don't rename markers in 2.x. Accept them in datagen.
  • Could revisit in 3.0 but it might not be a high priority even then.
  • Perform an audit on neo datetime markers.
  • Add this auditing to the graduation checklist. For now it is a vague requirement; could evolve into something more specific in the future.

LGTM: @sffc @robertbastian @Manishearth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-tiny Size: Less than an hour (trivial fixes)
Projects
None yet
Development

No branches or pull requests

3 participants