Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial timezone provider (location based + fallbacks) #96

Merged
merged 4 commits into from
Oct 9, 2020

Conversation

karlicoss
Copy link
Owner

@karlicoss karlicoss commented Oct 8, 2020

@seanbreckenridge this is what I've mentioned in #90 (comment)

The idea is that you can use localize method against a timezone-unaware date, and it will determine it from google location (later can think of using other location provider too, I've got some gpslogger stuff). If it can't do it based on the precise location data, it determines based on my.location.home.

For now the resolution is day-based, but I'll experiment a bit more and later will implement it to be more precise.

So far, I've only been using it on caller sites, whereas would be kind of cool to do it from within HPI itself (e.g. with sms provider, we already know it's UTC so no point in forcing the user to figure that out). But I feel like it would be good to experiment a bit more first to make sure. Also would need to make it a bit more defensive, so if the user doesn't have location data, it's not crashing everything.

@seanbreckenridge
Copy link
Contributor

seanbreckenridge commented Oct 8, 2020

Cool, looks like a good starting point for something else I wanted to do.

I've parsed the google takeout a bit differently (so Im not sure if the dateinfo on mine is the same); using lxml so I could keep my sanity, specifying lots of the other attributes from the takeout; it does take about a minute to parse the first time I launch it; after that its behind cachew. I also only have data going 3 years back from there, while I have data from facebook/forums going back to 2013 (when I was in a different timezone).

So I'm unsure if I can just copy/paste this, I think I'll have to create some system that doesn't primarily use the google takeout.

I still like the idea of being able to localize a datetime from some source into the timezone I was in at the time.

Personally, I don't change timezones (i.e. physically move) that often, and I havent for the past few years.

So, related to location data, my plans were to:

  • Add a block in my location config/(somewhere else?, could always import from a relative private config file in ~/.config/my) where I could manually specify when I moved houses across timeszones/dates I was in different timezones for extended periods of time. That way, I have some way to manually specify timezones.
  • For future data, look into using an android/computer application that periodically saves my current location, and infer timezone from that. Not sure if you have something similar or have any ideas there.
  • Lots of my data/exports have IP addresses as well, so I wrote this to get location data from that, could probably also be used to infer timezone.

@karlicoss
Copy link
Owner Author

I've parsed the google takeout a bit differently

Ah I see! Makes sense, just somehow didn't occur to me to do it via Union. But anyway I guess would be fairly easy to filter out locations from just based on isinstance check, and then just plug it here. I'm also parsing and assuming UTC (which I'm pretty sure is the case), so should be compatible here.

Add a block in my location config

Yep, that's implemented here as well! https://github.com/karlicoss/HPI/pull/96/files#diff-17afdef3985dd7b19cb4ffc9fb86179dR71-R76 Timezones are kind of indirect (determined via coordinates), but I thought it makes sense because that way you have both location data and timezone (as opposed to if only the timezone was specified).

Not sure if you have something similar or have any ideas there.

Yeah, gpslogger is great, I just need to play with its outputs for a bit, and maybe I'll even switch from google location completely.

I wrote this to get location data from that, could probably also be used to infer timezone

Yep, I've seen it, great idea! Don't think I have any ip data going back further than the location data I have, but it's certainly a kind of interaction would be very cool to support.

If implemented, it would probably belong here (I'm thinking now that main probably makes more sense than all... but still not sure, maybe you have a better idea?)
So if you imagine ipgeocache data integrated, your my.time.tz.main could look like:

from . import via_location
from . import via_ip

def localize(dt):
    for provider in [via_location, via_ip]:
        dt = provider.localize(dt)
        if dt.tzinfo is not None:
            return dt
    return dt # or raise error? depending on what user prefers

Alternatively, ipgeocache could be integrated into the location provider, and then IP data would be automatically handled by via_location. Similarly, feels more straightforward to me, because if you have location, you can infer timezone, but not vice versa, so probably makes sense to integrate it in the location provider.

@seanbreckenridge
Copy link
Contributor

seanbreckenridge commented Oct 9, 2020

I'm also parsing and assuming UTC (which I'm pretty sure is the case), so should be compatible here.

Ah, nice. I'll do the isinstance check then.

    class home:
        current = (1.0, 1.0)
        past = [
            ((40.7128, -74.0060), '2005-12-04'), # NY
        ]

nice! this is pretty much what I was planning to do

I thought it makes sense because that way you have both location data and timezone

Yeah, I agree. I had looked at timezonefinder to convert the location to a timezone before writing ipgeocache, seems youre using the same.

gpslogger

👍

I'm thinking now that main probably makes more sense than all... but still not sure, maybe you have a better idea?

I think I'd agree, yeah. someone familiar with python scanning the directory for the first time is more likely to know main is the file they should look at as the 'entrypoint' as opposed to all. all does make sense as well, because its merging 'all' sources. Suppose it depends on whether you want it to make sense semantically or to someone looking at it for the first time. Can't think of anything much better than that.

Alternatively, ipgeocache could be integrated into the location provider, and then IP data would be automatically handled

Only thing to watch out with the IP -> Location -> Timezone is that some locations given by the IP geolocation seems to just be wrong. I don't know if thats because they're IPs from about a decade ago, or if facebook/google/blizzard (sources for my IPs) had something configured wrong. Some of my logged IPs are in Indonesia/Thailand, when I definitely know I wasn't there....

99% of the time the location data from the IP data is fine though.

Summarizing, in order, my 'resolution order' for location (and therefore timezone) for some timeframe (not sure if this would just be day-based, or maybe one could specify a datetime.timedelta for a duration?) would probably be:

  • manually marked location/time
  • gpslogger
  • google data/ip data

seanbreckenridge added a commit to seanbreckenridge/HPI that referenced this pull request Oct 9, 2020
@karlicoss
Copy link
Owner Author

some locations given by the IP geolocation seems to just be wrong.

Indeed, good point. In many cases it's also ISP address, so not super accurate too (still good enough for timezone). Can be combated with some sort of outlier detection, but either way possible to fine tune by filtering out the offending ip ranges.

manually marked location/time

Hm I guess there are two usecases for manually marked stuff. One is the 'default/home' which I've implemented in this PR, used as a fallback when you have a gap in automatic data.
Another (I guess that you have in mind) is an override, for example when you know for sure the automatically logged data was wrong and want to correct it. I guess both modes are useful, also should be possible to support with minimal code duplication (merely by passing different configs).

@karlicoss karlicoss merged commit 1f9be2c into master Oct 9, 2020
@karlicoss karlicoss deleted the tz_provider branch October 9, 2020 20:09
@seanbreckenridge
Copy link
Contributor

Had a look at my.location, unsure if we have a different concept on how to use the my.config.location.home.past block. I made some changes to mine:

diff --git a/my/location/home.py b/my/location/home.py
index 896ebca..641a872 100644
--- a/my/location/home.py
+++ b/my/location/home.py
@@ -56,7 +56,17 @@ def get_location(dt: datetime) -> LatLon:
     """
     Interpolates the location at dt
     """
+    if not config._past:
+        return config.current
+    prev_dt: datetime = datetime.now()
     for loc, pdt in config._past:
-        if dt <= pdt:
+        # iterating moving from today to the past,
+        # if this datetime is in between the last time reported
+        # and this one, return the location of the last time reported LatLon
+        # (prev_dt would be the next place I moved to)
+        if prev_dt >= dt and pdt < dt:
             return loc
+        prev_dt = pdt
+    from ..core.warnings import medium
+    medium("Don't have any location going back further than {}, using current location".format(prev_dt))
     return config.current

If one just did dt <= pdt, that would mean I'd need to have multiple entries for each place I've lived, right? Have to mark both the start/end point, instead of just marking the start point.

@karlicoss
Copy link
Owner Author

Ah! I've specified them in the order from the oldest to newest in my config, whereas seems that you specified from the newest to oldest? I guess the safest thing would be to sort in _past method to avoid any confusion.

@seanbreckenridge
Copy link
Contributor

Now that I look back at it, sorting from oldest to newest makes the loop a lot nicer, yeah. Removes both of my checks/the warning message.

Will add a sort to _past on my fork as well.

@seanbreckenridge
Copy link
Contributor

Hmm. I still feel like I need to have two pointers through the loop though.

Sorted and reverted back to your get_location function

In [4]: from my.location.home import get_location, config

In [5]: config
Out[5]: home(current='here_now', past=[('location_one', '2019-05-15'), ('location_two', '2018-09-01')])

In [6]: config._past
Out[6]:
[('location_two', datetime.datetime(2018, 9, 1, 0, 0)),
 ('location_one', datetime.datetime(2019, 5, 15, 0, 0))]

In [7]: late_2018 = datetime.replace(datetime.now(), year=2018)

In [8]: late_2019 = datetime.replace(datetime.now(), year=2019)

In [11]: late_2018
Out[11]: datetime.datetime(2018, 10, 9, 16, 27, 4, 143679)

In [12]: late_2019
Out[12]: datetime.datetime(2019, 10, 9, 16, 27, 12, 221254)

In [9]: get_location(late_2019) # should be location_one
Out[9]: 'here_now'

In [10]: get_location(late_2018) # should be location_two
Out[10]: 'location_one'

My past block specifies:

past=[('location_one', '2019-05-15'), ('location_two', '2018-09-01')])

So, in english:

  • I moved to 'location_two' on 2018-09-01
  • I moved to 'location_one' on 2019-05-15

So any time between 2018-09-01 and 2019-05-15 (which late_2018 is) should return location_two, but right now it returns location_one

I may be misunderstanding something, dates always confuse me.

@karlicoss
Copy link
Owner Author

karlicoss commented Oct 10, 2020

Ah. past=[('location_one', '2019-05-15'), ('location_two', '2018-09-01')]) for that, my interpretation would be

  • moved away from location_two on 2018-09-01
  • moved away from location_one on 2019-05-15
  • since then, was in current

I guess it's a little counter-intuitive, but it's the only 'natural' way of interpreting past + current without duplicating any information.
My rationale was that there is a value in letting the user only specify the current location to make the config simpler in they only want meaningful current location. Otherwise, there could just be a single list of tuples, interpreted the way you did (moved to), and I guess the last date in the tuple would be your birthday, or some arbitrary old date. What do you think?

IMO it makes it easier to think if you sort the entires and squash together entries: ('location_two', '2018-09-01') ('location_one', '2019-05-15') current. Any location now is 'sandwitched' between two dates, which can be treated as 'moved to/moved from'. The only two exceptions are the first one (left is missing, which is sort of 'always lived there'), and the last one (right is missing, which is sort of 'live there now').
Maybe it's a meaningful way to represent this in the config? E.g. locations = ['location_two', '2018-09-01', 'location_one', '2019-05-15', 'current']. Slightly nasty type-wise, but not too bad with a runtime check that the types alternate.

@seanbreckenridge
Copy link
Contributor

seanbreckenridge commented Oct 11, 2020

I guess it's a little counter-intuitive, but it's the only 'natural' way of interpreting past + current without duplicating any information.

I agree that this is technically the best way to represent it, but if I asked someone to write down 'places they've lived', they almost always list the dates as when they moved there, not when they moved away.

I don't know if that's some bias I have, or if you think the same way.

locations = ['location_two', '2018-09-01', 'location_one', '2019-05-15', 'current']

I think this makes it more confusing, switching between types. Sort of annoying to document/explain as well

My rationale was that there is a value in letting the user only specify the current location to make the config simpler in they only want meaningful current location

IMO it makes it easier to think if you sort the entries and squash together entries

What about listing it like that instead?

I think it is nice to have the current there, but maybe it shouldn't be included, because a lot of the confusion with whether or not to duplicate results/dates are the start/end come from there.

Could instead have an interface like:

class location:
    home = [
        ((42.342, 120.13), '....199X'), # i.e. when you were born, first house you want to list
        ((...,...), 2015-...) # moved to this place
        (43.934, 110.43), 2019-..) # moved to this place in 2019
    ]

The last item in that list is your current location.

That way there's no duplication.

If the user doesn't want to specify multiple items, but just their current location, they could instead do:

class location:
    home = (43.324, 120.32) 

and that would be treated as current is now.

Something like

LatLon = Tuple[float, float]
DateIsh = Union[datetime, str]
LocationEntry = Tuple[LatLon, DateIsh]
class location(user_config):
    # either just one LatLon (i.e. 'current'), or a list with dates of when you moved there
    home: Union[Sequence[LocationEntry], LatLon]

@karlicoss
Copy link
Owner Author

karlicoss commented Oct 11, 2020

Yeah, agree, seems like a very good compromise! Will implement.
A minor thing, but maybe makes sense to swap and make it Tuple[DateIsh, LatLon]? Somehow feels more natural to me, although I might be wrong

@seanbreckenridge
Copy link
Contributor

seanbreckenridge commented Oct 11, 2020

but maybe makes sense to swap and make it Tuple[DateIsh, LatLon]? Somehow feels more natural to me, although I might be wrong

Yeah, I think I agree, sort of like:

'on DateIsh, I moved to LatLon'

DateIsh the PK here, LatLon could be duplicated if you lived in the same place twice.

@karlicoss
Copy link
Owner Author

done! Here's the new format in the test
https://github.com/karlicoss/HPI/pull/99/files#diff-17afdef3985dd7b19cb4ffc9fb86179dR83-R90

@seanbreckenridge
Copy link
Contributor

seanbreckenridge commented Oct 21, 2020

Updates related to this; (changes here)

Did the isinstance check on the google data, worked fine:

In [4]: list(via_location._iter_local_dates())[-5:]
Out[4]:
[DayWithZone(day=datetime.date(2017, 12, 12), zone='America/Los_Angeles'),
 DayWithZone(day=datetime.date(2017, 12, 12), zone='America/Los_Angeles'),
 DayWithZone(day=datetime.date(2017, 12, 12), zone='America/Los_Angeles'),
 DayWithZone(day=datetime.date(2017, 12, 12), zone='America/Los_Angeles'),
 DayWithZone(day=datetime.date(2017, 12, 12), zone='America/Los_Angeles')]

Also created a basic ip provider in my.location, ipgeocache returns both the location and the timezone, so its easily added to _iter_local_dates.

Have started using gpslogger, will write a module for that soon.

Think it'd be nice if this was more flexible, accepting any date, not just naive/UTC ones.

A lot of my datetimes are 'by default' serialized from epoch time (what they're stored as) to UTC, so I don't confuse the naive datetimes with PST times, so all of my data has tz_info.

Not sure if you want to change that function, or add another one, which:

  • if dt.tz_info is not None (is timezone aware), convert form that timezone to the timezone specified by _get_tz
  • else, assume naive, (maybe send warning? or at least a debug log; perhaps change logger in module to 'info' if thats too spammy) and use UTC.

Not sure what your view on logs is, I don't like them showing up in hpi doctor if I'm not expecting them, so I've monkey patched an environment variable. I typically do HPI_LOGS=debug hpi doctor if I want to see logs, else they're all set to warning typically.

@karlicoss
Copy link
Owner Author

karlicoss commented Oct 21, 2020

Did the isinstance check on the google data, worked fin

Nice! Also glad you managed to plug in ipgeocache

Think it'd be nice if this was more flexible, accepting any date, not just naive/UTC ones.

Yeah, I wasn't really sure what would make sense at first, but now that I played with it a bit more, feels like it's getting in the way more often than not, so a hard assert is definitely too annoying. Perhaps it could be configurable (i.e. assert/warn & localize/warn & keep the timezone), but silently converting the timezone makes the most sense.

else, assume naive and use UTC

Not sure if I got what you mean here? The way it's implemented now is always assuming naive (i.e. 'local') timestamp, and attaching the timezone then. What I'm not sure about the function is this cause, which could indeed benefit from assuming UTC: https://github.com/seanbreckenridge/HPI/blob/14a5dabe73bd734c6da3e3e371e67aef75ce3432/my/time/tz/via_location.py#L163-L164
Actually the way it's implemented now, _get_tz would never return None because of the fallbacks in the config.. Maybe it's a reasonable assumption and best to just rely on it.

I don't like them showing up in hpi doctor if I'm not expecting them,

Yep, agree, it's a bit spammy for me at times as well! The main reason I like them is because it gives a sense of progress. I was also thinking of sort of 'progress bar' via tqdm with a custom logging handler hooked to it. So you get both a sense of progress, but at the same time it's only taking a single line of screen space. Overall just need to give it a bit of a think how to make it more consistent. An env variable to toggle debug logs definitely makes sense, agree! Personally, for me perhaps the only non-warning stuff I actually look at often is the log from cachew, i.e. whether it's hit the cache or recomputing. Maybe it could also be controlled with a separate env/config variable.

@seanbreckenridge
Copy link
Contributor

else, assume naive and use UTC

Oh, yeah, never mind, I think this was just me being confused a bit.

If it receives some naive timezone, it should assume its local to the timezone you were in at the time, so pytz timezone.localize should handle that properly.

Think it would be nice to put a comment here explaining:

If naive, assume you were in timezone specified by _get_tz
If aware, convert from tz_info to timezone returned by _get_tz

Maybe it could also be controlled with a separate env/config variable.

Yeah, I think cachew is a majority of the spam/logs when I enable HPI_LOGS=debug, so would be nice if that was configurable as well

For reference, here are my current logs when using HPI_LOGS=debug

sometimes I have to change some external logger after its been imported to respect HPI_LOGS as well, not sure how progress bars with an unknown amount of logs/logs from an external module.

@seanbreckenridge
Copy link
Contributor

Have continued to use gpslogger, seems to work great; Created my.location.gpslogger to parse the .gpx (XML) files.

@karlicoss
Copy link
Owner Author

Oh nice! I wonder if you can also use some library for gpx files... less code and might be useful for other data providers!

@seanbreckenridge
Copy link
Contributor

Ah, for some reason I thought this was GPSLogger specific, Will try optionally importing this and parsing with that, falling back to basic XML parsing if that isnt installed

seanbreckenridge added a commit to seanbreckenridge/HPI that referenced this pull request Oct 30, 2020
@karlicoss
Copy link
Owner Author

Damn, we discussed a lot of stuff in this pull request 😂

Yep, agree, it's a bit spammy for me at times as well! The main reason I like them is because it gives a sense of progress. I was also thinking of sort of 'progress bar' via tqdm with a custom logging handler hooked to it. So you get both a sense of progress, but at the same time it's only taking a single line of screen space

Actually gave it a go in Promnesia karlicoss/promnesia@01ab844

Looks kinda cool: https://twitter.com/karlicoss/status/1323122289517408259, but have a feeling that properly doing that, so it doesn't break for random people would be a lot of work, so not really worth it...
Although tqdm is portable and should 'just work', plus its progressbar might map nicely on what hpi doctor is doing (i.e. going through iterables). But perhaps as a separate mode + wouldn't want to make it a hard dependency anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants