Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieve Canonical TZIDs #2909

Closed
nordzilla opened this issue Dec 20, 2022 · 11 comments · Fixed by #3498 or #4024
Closed

Retrieve Canonical TZIDs #2909

nordzilla opened this issue Dec 20, 2022 · 11 comments · Fixed by #3498 or #4024
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters T-core Type: Required functionality U-flutter User: Flutter Engine or Dart SDK U-fuchsia User: Fuchsia U-gecko User: Gecko U-google User: Google 1st party

Comments

@nordzilla
Copy link
Member

nordzilla commented Dec 20, 2022

For datetime formatting we need to have the canonical versions of the Time Zone identifiers (TZID). The CLDR time-zone identifiers never change and are not necessarily canonical, where as the time zone identifiers from the IANA database are allowed to update and change over time.

Some examples are the change of names from Pacific/Ponape to Pacific/Pohnpei, and Asia/Calcutta to Asia/Kolkata.

We can achieve this in two ways. Once the initial TZDB data is provider is landed, we could retrieve these IDs from the data provider, however we could also retrieve these identifiers from the CLDR BCP47 data, which was added in #606.

@nordzilla nordzilla changed the title Retrieve Canonical TZIDs from CLDR bcp47 data Retrieve Canonical TZIDs Dec 20, 2022
@nordzilla nordzilla added the discuss Discuss at a future ICU4X-SC meeting label Dec 20, 2022
@sffc
Copy link
Member

sffc commented Dec 20, 2022

Discussion: we want a many-to-one map from IANA to BCP47, and a one-to-one from BCP47 to IANA. We must read the CLDR sources to get the mappings to BCP47, but it seems like we should probably favor the TZIF sources for the most up-to-date IANA names.

@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Dec 20, 2022
@nordzilla nordzilla added T-core Type: Required functionality C-data-infra Component: provider, datagen, fallback, adapters labels Dec 22, 2022
@nordzilla
Copy link
Member Author

nordzilla commented Dec 22, 2022

This situation may be a bit more complicated due to the potential and inevitable out-of-sync-ness between IANA and CLDR. I'd like to bring up a particular example for discussion: a change to IANA between tzdb-2022e and tzdb-2022f.

I'll describe my initial plan for retrieving the canonical TZIDs and how the above change could cause issues given this approach.


Initial Plan

My initial plan was to simply generate the TZif data from the IANA database using the Makefile option BACKWARD=, which disables backward-compatible name aliases (i.e. the TZif files and TZID's will only the current and up-to-date IANA TZID's).

The full data creation generation process would be as follows:

  1. From CLDR, generate a many-to-one ICU4X mapping from all known TZIDs (canonical and aliased) to BCP47 IDs using the CLDR BCP47 data. (This work is already complete.)

  2. Given the mapping generated in step 1), generate from the IANA database a new one-to-one ICU4X mapping from BCP47 IDs to up-to-date IANA IDs.


The Problem

This method of generating data would work as long as the CLDR BCP47 data and the IANA data are up to date with each other.

However, as mentioned above, there is a change between the e and f versions of the IANA database that would cause problems in the above model, which is that Pacific/Pohnpei was merged with Pacific/Guadalcanal.

Before
tzdb-2022e

TZID(Pacific/Pohnpei):
    Alias(Pacific/Ponape) 
    BCP47(fmpni)

TZID(Pacific/Guadalcanal):
    BCP47(sbhir)

After
tzdb-2022f

TZID(Pacific/Guadalcanal):
    Alias(Pacific/Pohnpei)
    Alias(Pacific/Ponape)
    BCP47(sbhir)

As such, it appears that the BCP47 fmpni will no longer be used, however the CLDR BCP47 file is not up to date with this change.

This would cause an issue where the ICU4X data pulled from CLDR maps Pacific/Pohnpei and Pacific/Ponape to fmpni which would have no corresponding IANA time-zone ID in the ICU4X data pulled from IANA itself.

What should happen is that Pacific/Pohnpei and Pacific/Ponape should be mapped to sbhir, which is linked to Pacific/Guadalcanal.

Using the above model, this won't happen until the CLDR BCP47 data is updated to match the current IANA data.


Solutions

I would like to discuss some solutions:

  1. Manually parse more of the IANA database ourselves such that we can resolve these discrepancies in data provider and ensure that the mapping is up to date. This would allow us to solve some issues, such as when two existing TZIDs get merged as in the case described above, but it wouldn't allow us to solve discrepancies if a new TZID is added to IANA and there is no entry for it at all in CLDR or BCP47.

  2. Determine that it is not our responsibility to resolve discrepancies between different version of IANA and CLDR. Data should be generated with versions that were released around the same time, and when new changes to IANA are made, we just have to wait for the CLDR data to catch up.

  3. Other alternatives (open to discussion)

@nordzilla nordzilla added the discuss-priority Discuss at the next ICU4X meeting label Dec 22, 2022
@nordzilla
Copy link
Member Author

@sffc I've labled this as discuss priority, though we won't have another meeting for a while due to various holiday schedules.

If you have any thoughts about this async, I'd love to continue this discussion on here as well.

@sffc
Copy link
Member

sffc commented Dec 22, 2022

How much more difficult is it to do (1) rather than (2)?

I think it's not the end of the world if we do (2) and just wait for CLDR releases in order to get the most up-to-date set of names, if that is the easiest solution. It seems that we could make (1) be a "nice to have" improvement. But, if (1) is easy (less than a few hours' work), then we should do it since I think we all agree it is the better solution.

@sffc
Copy link
Member

sffc commented Jan 5, 2023

Discussion:

  • @sffc If IANA adds a new time zone, CLDR won't have the BCP-47 alias for it yet. Therefore, we should probably favor CLDR here. We aren't locking ourselves into anything since this is wholly a datagen-time configuration.

@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Jan 5, 2023
@sffc sffc added this to the 1.x Priority ⟨P2⟩ milestone Jan 5, 2023
@sffc sffc added the U-fuchsia User: Fuchsia label Jan 25, 2023
@Manishearth Manishearth added the U-flutter User: Flutter Engine or Dart SDK label Feb 2, 2023
@hsivonen hsivonen added the U-gecko User: Gecko label Feb 14, 2023
@justingrant
Copy link

The CLDR time-zone identifiers never change and are not necessarily canonical, where as the time zone identifiers from the IANA database are allowed to update and change over time.

Some examples are the change of names from Pacific/Ponape to Pacific/Pohnpei, and Asia/Calcutta to Asia/Kolkata.

@sffc pointed me to an encouraging sign: CLDR is investigating how to address the out-of-date alias problem. See CLDR-14453.

Note that AFAICT there are only 13 CLDR zones that use an out-of-date canonical identifier. So until CLDR figures out how to address the problem, maintaining a hard-coded list of <20 overrides (assuming the list grows a few per year) seems like a small price to pay to avoid the out-of-sync problem.

Besides staying in sync, another reason to favor CLDR data is avoiding IANA's aggressive merging of unrelated time zones. Here are some of the canonicalizations in the latest (2022g) IANA data:

  • Most Balkan countries => Europe/Belgrade
  • Sweden, Denmark, and much of central Europe => Europe/Berlin
  • Atlantic/Reykyavik => Africa/Abidjan (a different continent!)
  • Various northern-Canada zones => America/Panama (?)
  • 20+ African countries => Africa/Maputo, Africa/Lagos, Africa/Johannesburg, or Africa/Abidjan
  • Almost every Caribbean country => America/Puerto_Rico

Here's an excerpt from a CLDR-14453 comment explaining the problem in more detail.

I found it helpful to classify Links in IANA TDZB as “synonyms” or “merges”.

  • Synonyms - these are equivalent zones, like Asia/Calcutta vs. Asia/Kolkata, or PRC vs. Asia/Shanghai. Regardless of how time zones change in the future, these will always be the same.
  • Merges - these are zones representing different locations that just happen to share the same time zone rules since 1970, but some future change might cause them to diverge.

Synonyms are good, but merges can be problematic for a few reasons:

  • Data loss - When canonical identifiers are persisted (e.g. timestamps like 2025-01-25T10:00[Africa/Abidjan] for a future meeting in Reykjavik) valuable metadata about that timestamp is lost, leaving it brittle if Côte d'Ivoire or Iceland changes its time zone.
  • Cultural sensitivity - Time zone identifiers shouldn’t be shown to end-users, but even among technical users it’s likely to cause lots of confusion and frustration (also trolling and bad press) if every Balkan zone redirects to Belgrade, most Central European zones redirect to Berlin, and 20+ African countries merge down to 4 zones.
  • Backwards compatibility - Only 10% of the merges in IANA TZDB are currently followed by CLDR. (To verify, go here and filter by merges.) If tomorrow suddenly all merges were followed, it’d probably break a lot of apps.

CLDR aliases are only synonyms, not merges. This makes CLDR's canonicalization avoid the problems above.

So I'd be hesitant to use the IANA source for canonicalization purposes.

@sffc sffc added the U-google User: Google 1st party label Apr 22, 2023
@sffc
Copy link
Member

sffc commented Apr 22, 2023

Just noting that this issue now has 4 user tags on it, which means a lot of clients need it. We should prioritize implementing this issue in 1.3.

Note: I am assuming that two-way conversion between BCP-47 and IANA is in scope of this issue.

@justingrant
Copy link

@sffc - I assume that you didn't mean to close this issue via #3498, right?

@justingrant
Copy link

justingrant commented Jun 7, 2023

Also, while I'm here: there will likely need to be a way to retrieve the case-normalized variant of a time zone idenfiier, e.g. asia/ulaanbaatar => Asia/Ulaanbaatar, because TZDB identifiers are case-insensitive so any case should be accepted as input but output should always be case-normalized to match TZDB, even if those outputs are not canonicalized to the canonical Zone name in TZDB.

@Manishearth Manishearth reopened this Jun 7, 2023
@sffc
Copy link
Member

sffc commented Jun 7, 2023

OK I'm implementing this based on the CLDR data:

https://github.com/unicode-org/cldr-json/blob/main/cldr-json/cldr-bcp47/bcp47/timezone.json

I need to pick between "first in list" or "last in list". I'm currently using "last in list", which gives me Asia/Kolkata, but it gives me things like US/Central as the canonical ID for America/Chicago. I hope this is fixed up very soon with https://unicode-org.atlassian.net/browse/CLDR-14453

there will likely need to be a way to retrieve the case-normalized variant of a time zone idenfiier

Out of scope of my current PR, but we should make a follow-up issue for this, if we think it is in scope of ICU4X

@sffc
Copy link
Member

sffc commented Sep 11, 2023

The new spec in unicode-org/cldr#3105 says:

To maintain the stability of "long" IDs (for those inherited from the tz database), a special rule applied to the alias attribute in the <type> element for "tz" - the first "long" ID is the CLDR canonical "long" time zone ID. In addition to this, iana attribute specifies the preferred ID in the tz database if it's different from the CLDR canonical "long" ID.

So I will change my PR to read the first element of the list, and when CLDR 44 is rolled in, we can update datagen to start consuming the new iana field.

The PR also changes the data so that not every row has an alias field (adding a preferred field instead), so further changes may be needed here in CLDR 44.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters T-core Type: Required functionality U-flutter User: Flutter Engine or Dart SDK U-fuchsia User: Fuchsia U-gecko User: Gecko U-google User: Google 1st party
Projects
None yet
5 participants