Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add performance focused APIs #17

Open
BurntSushi opened this issue Jul 10, 2024 · 11 comments
Open

add performance focused APIs #17

BurntSushi opened this issue Jul 10, 2024 · 11 comments
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@BurntSushi
Copy link
Owner

BurntSushi commented Jul 10, 2024

In building out Jiff, my focus has generally not been on performance. There are some Criterion benchmarks, but they are by no means exhaustive. I would love to have more added.

While I think there's always room to optimize the implementation (and if you have ideas that require major refactoring, please file an issue), this issue is more about API changes or additions that are needed to improve performance. For example, I'm thinking about things like "datetime arithmetic requires the use of Span, and Span is a very bulky type." At time of writing, its size is 64 bytes. I presume that this means that there is some non-trivial amount of time being spent in various operations on Span memcpying around this enormous bag of bytes.

Getting rid of Span or splitting it up isn't really an option as far as I'm concerned. So in order to make things go faster, I think we really just need to be able to provide APIs that operate on more primitive data types. Like, for example, an i64 of nanoseconds. This would come with various trade-offs, so I'd expect these APIs to be a bit more verbosely named.

Here's a probably non-exhaustive list of API design choices that likely have an impact on performance:

  • A Timestamp is always a 96-bit integer.
  • A Span is a whopping 64 bytes. A Span is Copy and most APIs accept a Span instead of a &Span.
  • Dealing with spans generally takes a lot of work. So when you add, say, a Span to a Timestamp, it's not just a simple matter of ~one integer addition. There's work needed to be done to collapse all of the clock units on Span down to a single integer number of nanoseconds, and then i128 math is used. Perhaps there is a way to optimize a Span for the "single unit" case, although doing this without indirection will make a Span even bigger, and doing it with indirection will remove the Copy bound. Another alternative might be leaving Span as-is, but adding non-Span APIs to operate on durations. But that has some design work needed as well. See add more integration with std::time::Duration? #21.
  • A Zoned embeds a TimeZone which means a Zoned is not Copy and cloning/dropping a Zoned is, while cheap, not as cheap as a small Copy type. (This also means most things take a &Zoned instead of a Zoned.)
  • A default TimeZoneDatabase uses a thread safe shared cache internally. This means time zone lookups by name require some kind of synchronization. (Usually this overhead can be worked around by asking for all the time zones you need up-front, but this is difficult if you're deserializing zoned datetime strings.)
  • The use of ranged integers internally makes some kinds of layout optimizations more difficult than they would be otherwise. For example, it might be tempting to bit pack the representation of some types, but if you do that, you'd have to either define a new ranged integer abstraction for bitpacking (plausible... but maybe complicated) or abdicate range checking altogether. You might think you can just convert to and from ranged integers as you need them, but the whole point of Jiff's ranged integer abstraction is that they track min/max values when debug_assertions are enabled. So if you drop the ranged integer type to do bit-packing and then re-create the ranged integer after unpacking, you will have lost the min/max values that were originally attached. Anyway, see consider ripping out ranged integer types #11 for more on this.

Anyway, before we get too focused on API design, I would like to see concrete use cases first in the form of executable code and/or benchmarks.

@BurntSushi BurntSushi added help wanted Extra attention is needed question Further information is requested labels Jul 10, 2024
@Diggsey
Copy link

Diggsey commented Jul 22, 2024

I think Zoned being !Copy isn't too much of an issue given what the type is designed to do. I would expect the Timestamp type to be far more commonly used in practice.

Timezones only really come into play in the front-end (which Rust is not typically used for...) or for some specific kinds of applications such as calendars, where events need to be scheduled at a "civil time" in the future.

Overall this crate seems very well designed, but if I have one criticism it's that I think it over-emphasizes the use of Zoned. Clearly there are cases when you need to store a runtime timezone, but IME they are few and far between (at least in the domains where Rust is most popular). Having this functionality, and having it interoperate perfectly with timestamps is very useful, but starting out with "The primary type in this crate is Zoned." is going to mislead some users I think.

If you intend this crate to displace chrono and time as "the" general purpose time crate, I would like to suggest:

  • Splitting the Timestamp type (and other non-timezone-aware functionality) into a jiff-core or similarly named crate. This way other crates (eg. database bindings) can provide integrations with jiff without needing to pull in all the timezone logic.

  • Adding a pure time interval version of Span for use with Timestamp.

  • Adding some guidance to the documentation about how to choose a particular date-time type depending on the use-case, so that users are directed to only store a dynamic timezone when actually necessary.

@BurntSushi
Copy link
Owner Author

BurntSushi commented Jul 22, 2024

I think (3) is #24.

I think (2) is probably #21.

I am unsure about (1). I'm not sure if I agree that Timestamp is the right "default" to emphasize. Perhaps time zone aware datetimes aren't being used as much as they should be. I'm not sure. With that said, I am open to a jiff-core-style crate. But I think the API design and split will be challenging. In particular, I think it will be hard to do without sacrifices to API comprehensibility, which I am loathe to do. But like I said, I am open to it. I think we just need to collect more data. Another option is making time zone support a crate feature. Then other libraries can depend on jiff with a stripped down set of features.

@Diggsey
Copy link

Diggsey commented Jul 23, 2024

Perhaps time zone aware datetimes aren't being used as much as they should be.

Heh, my experience is the opposite - in both JS and Python I've experienced significant extra complexity and bugs from their datetime types coming with a "packaged" timezone being used in places where timezones were not relevant.

In Python the main issue has been use of datetime where a simple timestamp would be more appropriate, since 99% of cases I'm dealing with a point in time, and the timezone is only relevant in so far as displaying that instant to the user in their preferred timezone. Storage should (in this case) be done in UTC and that's what the database expects.

In JS, only a couple of days ago I encountered a bug at work where a Date was being used to display a date (no time) but because the user's local timezone was offset it ended up showing the wrong date in the UI, despite being stored correctly in the backend. (Why did it use a JS Date? Because that's what the 3rd party date picker component was built around...) Variants of this bug are everywhere in JS. In this case a civil::Date with no timezone would have been more appropriate.

In contrast, I haven't had any issues with chrono or time in Rust since they don't really have the capacity to mess up the timezone. They just don't support the rare case where you do care about it.

It's honestly quite hard to think of use-cases where I'd want to store a datetime + timezone. Storing a future event in a calendar is the go-to example, but even then the only motiviation for that (vs storing a timestamp) is to deal with the edge-case where timezone data changes between the event being scheduled and it actually occurring, and doing that comes at a cost: it becomes impossible to schedule events at "ambiguous" times (at least with the default serialization implementation of Zoned).

If I want to build a best-possible calendar app, what's more likely? That the timezone data changes in a meaningful way, or that someone wants to schedule an event during a DST change... Both seem similarly likely.

(As an aside: maybe there's a case for changing the serialization to disambiguate these cases... It doesn't help if the timezone data changes and the event lies on a DST change, but that's a legitimate ambiguity in that case.)

Another option is making time zone support a crate feature.

Yeah that also works, I only didn't suggest it because I assumed you'd want to make the timezone functionality be default-enabled, and I really hate default features in Rust, since nobody remembers to turn them off.

@BurntSushi
Copy link
Owner Author

It is had to disentangle your examples of bugs with the underlying datetime library. Both of the datetime libraries you mention have very serious deficiencies.

it becomes impossible to schedule events at "ambiguous" times (at least with the default serialization implementation of Zoned).

I don't see why this would be true. You can't schedule an event at 2am on March 10, 2024 in New York because that clock time didn't exist. That seems right to me?

Yeah that also works, I only didn't suggest it because I assumed you'd want to make the timezone functionality be default-enabled, and I really hate default features in Rust, since nobody remembers to turn them off.

Yes, it would be enabled by default.

I think we overall need to collect more data. That will take some time and require users. I don't know that you are wrong, but I think you are.

@BurntSushi
Copy link
Owner Author

BurntSushi commented Jul 23, 2024

My very high level idea is that datetime libraries have sucked for so long, and this may have been inhibiting use cases or causing other problems that are falsely attributed. This is just an opinion though. So basically, I am excited to see what, if anything, Jiff unlocks for folks. IMO, its killer feature is not just Zoned but also Span. I think Span is going to take more time for folks to discover because I don't think anything like it has existed yet (until Temporal).

@Diggsey
Copy link

Diggsey commented Jul 23, 2024

I don't see why this would be true. You can't schedule an event at 2am on March 10, 2024 in New York because that clock time didn't exist. That seems right to me?

That's a non-existent time, not an ambiguous time. If I schedule an event at a time that occurs twice due to clocks being put backwards, then jiff's Zoned type appears to be able to correctly disambiguate the actual time, but if I serialize/deserialize it, then that information will be lost. I would expect the default serialization format to faithfully round-trip the value.

IMO, its killer feature is not just Zoned but also Span. I think Span is going to take more time for folks to discover because I don't think anything like it has existed yet (until Temporal).

It's nice, but it's definitely not new. The interval type in PostgreSQL has always worked this way.

@BurntSushi
Copy link
Owner Author

BurntSushi commented Jul 23, 2024

That's a non-existent time, not an ambiguous time.

Ah yeah, both Temporal and Jiff refer to it as ambiguous. The thing that is ambiguous is the offset.

but if I serialize/deserialize it, then that information will be lost

No, it is not! This is a key feature of Jiff:

use jiff::{civil::date, tz::TimeZone};

fn main() -> anyhow::Result<()> {
    // The ambiguous civil datetime:
    let dt = date(2024, 11, 3).at(1, 30, 0, 0);
    // By default, the "compatible" disambiguation strategy is used. Which
    // selects the later time for a gap and the earlier time for a fold:
    let zdt = date(2024, 11, 3).at(1, 30, 0, 0).intz("America/New_York")?;
    assert_eq!(zdt.to_string(), "2024-11-03T01:30:00-04:00[America/New_York]");
    // One can override and pick whichever strategy they want:
    let tz = TimeZone::get("America/New_York")?;
    let zdt = tz.to_ambiguous_zoned(dt).later()?;
    // Notice the offset change from above:
    assert_eq!(zdt.to_string(), "2024-11-03T01:30:00-05:00[America/New_York]");

    Ok(())
}

See jiff::tz::AmbiguousZoned for more details.

The fact that Jiff losslessly roundtrips Zoned values is a key comparison point between it and Chrono.

It's nice, but it's definitely not new. The interval type in PostgreSQL has always worked this way.

It doesn't though? I'm not intimately familiar with PostgreSQL's interval type, but from a quick glance I can immediately see several differences:

  • It doesn't keep track of each unit separately. So it can't distinguish between 120 minutes and 2 hours.
  • It doesn't seem to support rounding/balancing.
  • It doesn't seem to support using a reference date for adding spans, or at least lets you add spans together even when they have non-uniform units.

So for example, on that last point, this is what PostgreSQL does:

postgres=# SELECT INTERVAL '1 months 27 days' + INTERVAL '3 days';
   ?column?
---------------
 1 mon 30 days
(1 row)

postgres=# SELECT INTERVAL '1 months 27 days' + INTERVAL '4 days';
   ?column?
---------------
 1 mon 31 days
(1 row)

postgres=# SELECT INTERVAL '1 months 27 days' + INTERVAL '5 days';
   ?column?
---------------
 1 mon 32 days
(1 row)

Compare this with Jiff:

use jiff::{civil::date, ToSpan};

fn main() -> anyhow::Result<()> {
    let span = 1.month().days(27);
    // The span has non-uniform units (months), so a
    // reference date is required to do addition. Otherwise,
    // you get an error:
    assert!(span.checked_add(3.days()).is_err());
    // So pick a date:
    assert_eq!(
        span.checked_add((3.days(), date(2024, 6, 1)))?,
        1.month().days(30)
    );
    // So pick a different date and notice that the span
    // returned reflects the actual duration of the
    // corresponding months:
    assert_eq!(
        span.checked_add((3.days(), date(2024, 1, 1)))?,
        2.months().days(1)
    );

    Ok(())
}

This all comes from Temporal and RFC 5545.

@Diggsey
Copy link

Diggsey commented Jul 23, 2024

No, it is not! This is a key feature of Jiff:

Ah, my mistake - I didn't realise it would use the timezone offset as a disambiguator like that, nice!

It doesn't keep track of each unit separately. So it can't distinguish between 120 minutes and 2 hours.

It keeps track of units with variable ratios separately:

PostgreSQL stores interval values as months, days, and microseconds.

  • Seconds, minutes, hours always exist in exact ratio to microseconds, so there is no need to store them separately.
  • Weeks are always 7 days, so no need to store separately.
  • Years are always 12 months, so no need to store separately.

It doesn't keep track of each unit separately. So it can't distinguish between 120 minutes and 2 hours.

Because 120 minutes is always 2 hours, so it makes sense for them to be equal?

It doesn't seem to support rounding/balancing.

I can't say I've taken exhaustive stock of all the PostgreSQL date/time related functions, so perhaps this doesn't exist anywhere? I would be surprised though.

It doesn't seem to support using a reference date for adding spans, or at least lets you add spans together even when they have non-uniform units.

I'm not quite sure I understand this? Months are added separately to days, which are added separately to microseconds, so it doesn't matter what start date you have, date + (interval + interval) is always the same as (data + interval) + interval.

@BurntSushi
Copy link
Owner Author

Because 120 minutes is always 2 hours, so it makes sense for them to be equal?

I didn't say that it didn't make sense. But the fact that Jiff can distinguish them (but also compare them as equal via Span::compare) is a feature. Sometimes you want to express intervals in smaller units even when they can be balanced up into bigger units. It's use case dependent and there are no universal rules. So Jiff, inspired by Temporal, lets the caller choose which units are used. For example, if you wanted to show the length of a film, that is by convention expressed in units of minutes, even when it exceeds 1 hour.

I'm not quite sure I understand this? Months are added separately to days, which are added separately to microseconds, so it doesn't matter what start date you have, date + (interval + interval) is always the same as (data + interval) + interval.

The example I showed above demonstrates the issue. PostgreSQL can't balance days up into months. It does balance months up into years, because as you say, there's always 12 months in a year. But it can't do it for days. So you end up with things like 1 month 32 days instead of 2 months N days.

You're focused on the correctness of the result. A Span isn't just about correctness. It's about being able to freely move between any units of a duration and do it correctly. I'm not saying PostgreSQL is incorrect, I'm saying that it isn't nearly as flexible as what Jiff provides.

My understanding of the Temporal design goal here was to 1) satisfy RFC 5545 (iCalendar) and 2) put all the hard logic of dealing with durations into the datetime library such that localizing a duration becomes a much simpler task. I asked more about it here: tc39/proposal-temporal#2915

@Diggsey
Copy link

Diggsey commented Jul 23, 2024

PostgreSQL can't balance days up into months.

reference_date + interval - reference_date will round sub-days up to days although not up to months.

But I think we can agree that:

  1. The representation used by PostgreSQL does not prevent such functionality being provided.
  2. Jiff provides more functionality out of the box on top of that representation.

@BurntSushi
Copy link
Owner Author

Yes. And we haven't even gotten to the time zone aware aspects of Span when used with a Zoned in Jiff. From the PostgreSQL docs:

The default time zone is specified as a constant numeric offset from UTC. It is therefore impossible to adapt to daylight-saving time when doing date/time arithmetic across DST boundaries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants