-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
my.reddit: refactor into module that supports pushshift/gdpr #179
Conversation
With no pushshift installed, and then with pushshift: $ # for backwards compatability
[ ~/Repos/HPI-fork | reddit-merge ] $ hpi query my.reddit.comments | jq length
1020
$ # using .all
[ ~/Repos/HPI-fork | reddit-merge ] $ hpi query my.reddit.all.comments | jq length
/home/sean/Repos/HPI-fork/my/core/source.py:24: UserWarning: Module _pushshift_comments.<locals>.<lambda> could not be imported, or isn't configured propertly
warn(f"Module {factory.__qualname__} could not be imported, or isn't configured propertly")
1020
[ ~/Repos/HPI-fork | reddit-merge ] $ hpi module install my.reddit.pushshift
Running: /usr/bin/python3 -m pip install git+https://github.com/seanbreckenridge/pushshift_comment_export
...
Successfully built pushshift-comment-export
Installing collected packages: pushshift-comment-export
Successfully installed pushshift-comment-export-0.1.2
[ ~/Repos/HPI-fork | reddit-merge ] $ hpi query my.reddit.all.comments | jq length
4914 |
I don't love that these are lambdas, but it does mean that this is all typed Also, the other thing I wasn't sure on was how to make the nested config backwards compatible |
As for as Does mean its more verbose (would need to add another argument like I cant personally think of anything that balances the 'easy to add another source', and is mypy friendly without being this verbose... Though to be fair, reddit is a more complicated module than normal; not a lot of sources are merging three different sources with different amounts of data per source |
Ah, this is quite annoying. Even if you import
Maybe thats OK and it'd motivate people to actually switch, not sure if theres a workaround for the time while transitioning |
ok, I've tinkered a bit and came up with some madness :) however the source line where the module gets imported is available... so
and...
Pretty hacky, but considering we'll want to migrate other modules to proper namespace packages eventually, might worth it? Could make it generic later. Could also wrap it all in try/catch just to be super paranoid -- worst case it will just be an extra warning |
Overall looks great thanks! Up to you if you wanna add gdpr in this PR or in a separate one; I'm off to bed for now, will check in the morning :) |
Sounds good, still working on a few things Im not happy with; will let you know/convert if from a draft PR when I think its ready |
Can't believe using a decorator to handle wrapping the import didn't click in my head -- though is still casting |
will continue to work on this over the weekend probably, fixed most of the issues you brought up, have a couple other things I want to implement |
When the package isn't installed and nothing is in your config as far as
When not installed but suppressed in the user config, it doesn't send any warnings:
(When installed, it just works as normal with no warnings) |
TODO:
GDPR would all be in this module so it wouldn't add any complications as far as imports go -- once I'm done with the above this can be merged in, and will add |
Actually, accesses to Edit: adding the |
Also moved the As far as changes to promnesia, I think this should be fine? diff --git a/src/promnesia/sources/reddit.py b/src/promnesia/sources/reddit.py
index 01ab6d0..28f1f35 100644
--- a/src/promnesia/sources/reddit.py
+++ b/src/promnesia/sources/reddit.py
@@ -10,7 +10,7 @@ from ..common import Visit, Loc, extract_urls, Results, logger
def index(*, render_markdown: bool = False, renderer: Optional['RedditRenderer'] = None) -> Results:
from . import hpi
- from my.reddit import submissions, comments, saved, upvoted
+ from my.reddit.all import submissions, comments, saved, upvoted
if renderer is not None:
assert callable(renderer), f"{renderer} is not a callable (should be a subclass of RedditRenderer)"
@@ -164,20 +164,6 @@ class RedditRenderer:
# todo hmm, could do similar stuff in HPI?
import typing
if typing.TYPE_CHECKING:
- from my.reddit import Submission, Comment, Save, Upvote
+ # These are Protocol classes defined in HPI
+ from my.reddit.common import Submission, Comment, Save, Upvote, RedditThing
-
-from datetime import datetime
-if typing.TYPE_CHECKING:
- # TODO define in rexport?
- from typing_extensions import Protocol
- from typing import Dict, Any
- class RedditThing(Protocol):
- @property
- def raw(self) -> Dict[str, Any]: ...
- @property
- def url(self) -> str: ...
- @property
- def created(self) -> datetime: ...
- @property
- def text(self) -> str: ...
- Do you think I should make those imports safe and import the old stuff if it fails? Like:
I guess we can do that for now and add a warning, then deprecate it in a few months? Does create a couple typing issues supporting both the old and new style, but I suppose I can just |
Unsure how promnesia tests against this, maybe have to do a pypi release after this is merged before the PR there would succeed (havent made it yet, commit is here) This is done as far as this PR goes @karlicoss; feel free to review. Will do GDPR at some point in the future |
yeah, I guess makes sense to move here! Maybe a good opportunity to rename too while we're at it, maybe also perhaps other protocols in
yeah, good idea! I'd probably also check against the exception message (since ImportError is a bit broad and might make the error misleading)
Yeah.. at the moment it's running against pypi version of HPI. When I want to test against some HPI change, I usually mess with these lines https://github.com/karlicoss/promnesia/blob/281db5dfc35a3cb0f563686e1838088a621f69c2/setup.py#L84-L87 I think ideally would be nice to have promnesia CI running both against HPI in pypi and HPI from master, but so far haven't set it up. |
I would've thought so yeah, but when I tried it ended up with: diff --git a/my/reddit/common.py b/my/reddit/common.py
index 46634ca..ff4a0f2 100644
--- a/my/reddit/common.py
+++ b/my/reddit/common.py
@@ -21,7 +21,7 @@ else:
# common fields across all the Protocol classes, so generic code can be written
-class RedditThing(Protocol):
+class RedditBase(Protocol):
@property
def raw(self) -> Json: ...
@property
@@ -35,62 +35,23 @@ class RedditThing(Protocol):
# Note: doesn't include GDPR Save's since they don't have the same metadata
-class Save(Protocol):
- @property
- def raw(self) -> Json: ...
- @property
- def created(self) -> datetime_aware: ...
- @property
- def id(self) -> str: ...
- @property
- def url(self) -> str: ...
- @property
- def text(self) -> str: ...
+class Save(RedditBase):
@property
def subreddit(self) -> str: ...
# Note: doesn't include GDPR Upvote's since they don't have the same metadata
-class Upvote(Protocol):
- @property
- def raw(self) -> Json: ...
- @property
- def id(self) -> str: ...
- @property
- def created(self) -> datetime_aware: ...
- @property
- def url(self) -> str: ...
- @property
- def text(self) -> str: ...
+class Upvote(RedditBase):
@property
def title(self) -> str: ...
# From rexport, pushshift and the reddit GDPR export
-class Comment(Protocol):
- @property
- def raw(self) -> Json: ...
- @property
- def id(self) -> str: ...
- @property
- def created(self) -> datetime_aware: ...
- @property
- def url(self) -> str: ...
- @property
- def text(self) -> str: ...
+class Comment(RedditBase):
+ pass
# From rexport and the GDPR export
-class Submission(Protocol):
- @property
- def raw(self) -> Json: ...
- @property
- def id(self) -> str: ...
- @property
- def created(self) -> datetime_aware: ...
- @property
- def url(self) -> str: ...
- @property
- def text(self) -> str: ...
+class Submission(RedditBase):
@property
def title(self) -> str: ...
For whatever sub-protocols (subclasses?) don't work in that way, which after some digging seems to be part of the rejected ideas on the PEP Will rename it to
Ok, will bring this up in the promnesia PR when I create it |
In relation to the no-subclassing, I also thought and tried abstract base classes, but that isn't as extendible when it comes to new sources, and requires that someone subclass the class specifically -- goes a bit against the extensibility argument |
Could perhaps also be somewhere where typeguards could be used? (proably requires a backported type like |
huh, indeed. However this suggests it should work if you inherit subprotocols from
^ this doesn't work:
but if you change B to:
it passes! |
Yep, that does seem to work, and works in promnesia as well when it expects the |
nice, thanks so much! I'm happy to merge, but it says there are conflicts, can you rebase? |
2605dfe
to
53e3099
Compare
ah, was because of the macos ci stuff not used to rebasing so hopefully that was right. feel free to squash this all |
Ah, I don't know how this wasn't caught by mypy -- I should've actually run this
when actually running it can't do that Should I make a PR to revert that? |
hmm one issue I noticed, in my hpi config I have this
(I'm keeping all this legacy garbage to 'dogfood' config migrations) So currently it stumbles over trying to treat |
Is |
yeah, hopefully it would mean that they use an old config, so can just assume else branch? |
this was a hacked together reddit file that combined pushshift and rexport, but that has since been merged into karlicoss' HPI modules karlicoss/HPI#179
keeping it backwards compatible + conditional warning similar to #179 follow up for seanbreckenridge/HPI#18 for now without the __path__ hacking, will do it in bulk later too lazy to run test_import_warnings.sh on CI for now, but figured I'd commit it for the reference anyway
keeping it backwards compatible + conditional warning similar to #179 follow up for seanbreckenridge/HPI#18 for now without the __path__ hacking, will do it in bulk later too lazy to run test_import_warnings.sh on CI for now, but figured I'd commit it for the reference anyway
keeping it backwards compatible + conditional warning similar to #179 follow up for seanbreckenridge/HPI#18 for now without the __path__ hacking, will do it in bulk later too lazy to run test_import_warnings.sh on CI for now, but figured I'd commit it for the reference anyway
not complete yet, plan to add
reddit.gdpr
and merge that in with the other data sources here
However, I've made lots of decisions on the
structure of this already so just wanted
to push in case anything here is too off
base before I continue working on this
Related issue with lots of discussion
on how to implement this; #89
Related to this now being a non-namespace package
this is something we discussed a bit here; in summary:
In order to keep this backwards compatible,
__init__.py
is justfrom .rexport import *
, so this should work with anything usingmy/reddit.py
currently. If someone wants gdpr/pushshift, they have to opt in by changing imports
to
from my.reddit.all
and updating their config. In the future (once promnesia has supportedmy.reddit.all
for a while, probably a few months at least), the__init__.py
can be removedso that this turns into a namespace package again