Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

my.reddit: refactor into module that supports pushshift/gdpr #179

Merged
merged 20 commits into from
Oct 31, 2021

Conversation

seanbreckenridge
Copy link
Contributor

@seanbreckenridge seanbreckenridge commented Oct 28, 2021

not complete yet, plan to add reddit.gdpr
and merge that in with the other data sources here

However, I've made lots of decisions on the
structure of this already so just wanted
to push in case anything here is too off
base before I continue working on this

Related issue with lots of discussion
on how to implement this; #89

Related to this now being a non-namespace package
this is something we discussed a bit here; in summary:

In order to keep this backwards compatible, __init__.py is just
from .rexport import *, so this should work with anything using my/reddit.py
currently. If someone wants gdpr/pushshift, they have to opt in by changing imports
to from my.reddit.all and updating their config. In the future (once promnesia has supported
my.reddit.all for a while, probably a few months at least), the __init__.py can be removed
so that this turns into a namespace package again

@seanbreckenridge
Copy link
Contributor Author

seanbreckenridge commented Oct 28, 2021

With no pushshift installed, and then with pushshift:

$ # for backwards compatability
[ ~/Repos/HPI-fork | reddit-merge ] $ hpi query my.reddit.comments | jq length
1020
$ # using .all
[ ~/Repos/HPI-fork | reddit-merge ] $ hpi query my.reddit.all.comments | jq length
/home/sean/Repos/HPI-fork/my/core/source.py:24: UserWarning: Module _pushshift_comments.<locals>.<lambda> could not be imported, or isn't configured propertly
  warn(f"Module {factory.__qualname__} could not be imported, or isn't configured propertly")
1020
[ ~/Repos/HPI-fork | reddit-merge ] $ hpi module install my.reddit.pushshift
Running: /usr/bin/python3 -m pip install git+https://github.com/seanbreckenridge/pushshift_comment_export
...
Successfully built pushshift-comment-export
Installing collected packages: pushshift-comment-export
Successfully installed pushshift-comment-export-0.1.2
[ ~/Repos/HPI-fork | reddit-merge ] $ hpi query my.reddit.all.comments | jq length
4914

@seanbreckenridge
Copy link
Contributor Author

I don't love that these are lambdas, but it does mean that this is all typed

Also, the other thing I wasn't sure on was how to make the nested config backwards compatible

@seanbreckenridge seanbreckenridge changed the title initial pushshift/rexport merge implementation my.reddit: refactor into module that supports pushshift/gdpr Oct 28, 2021
@seanbreckenridge
Copy link
Contributor Author

seanbreckenridge commented Oct 28, 2021

As for as import_source_iter goes, if we want to supress the warning if the user has it in core.disabled_modules, would need to pass the module name since them being functions makes the imports not traceable to the module.

Does mean its more verbose (would need to add another argument like module_name), but I'm also not totally attached to this method; feel free to suggest something else

I cant personally think of anything that balances the 'easy to add another source', and is mypy friendly without being this verbose... Though to be fair, reddit is a more complicated module than normal; not a lot of sources are merging three different sources with different amounts of data per source

my/reddit/common.py Outdated Show resolved Hide resolved
my/reddit/common.py Outdated Show resolved Hide resolved
my/reddit/common.py Outdated Show resolved Hide resolved
my/reddit/rexport.py Outdated Show resolved Hide resolved
my/reddit/all.py Outdated Show resolved Hide resolved
@seanbreckenridge
Copy link
Contributor Author

Ah, this is quite annoying. Even if you import my.reddit.all, since the __init__.py file runs, it sends the warning anyways:

import my.reddit.all
/home/sean/Repos/HPI-fork/my/reddit/__init__.py:21: UserWarning: DEPRECATED! Instead of my.reddit, import from my.reddit.all instead.
  W.high("DEPRECATED! Instead of my.reddit, import from my.reddit.all instead.")

Maybe thats OK and it'd motivate people to actually switch, not sure if theres a workaround for the time while transitioning

@karlicoss
Copy link
Owner

karlicoss commented Oct 29, 2021

Even if you import my.reddit.all, since the init.py file runs, it sends the warning anyways

ok, I've tinkered a bit and came up with some madness :)
first tried going up the stack to extract the top level module from importlib internals, but it seems that it's not really available (neither in pdb nor in traceback objects)

however the source line where the module gets imported is available... so

./good.py
import my.reddit.all

./bad.py
import my.reddit

./my/reddit/all.py
print("all.py")

./my/reddit/__init__.py
print("importing __init__")

# search for offending import in source code...
import traceback
warn = False
for f in traceback.extract_stack():
    line = f.line or '' # just in case it's None, who knows..
    # maybe best to use a regex here?
    # need to carefylly handle stuff like import my.reddit.rexport
    if 'import my.reddit ' in (line + ' '):
        warn = True
    if 'from my import reddit' in line:
        warn = True
print("will warn: ", warn)

and...

$ python3 good.py
importing __init__
will warn:  False
all.py
$ python3 bad.py
importing __init__
will warn:  True

Pretty hacky, but considering we'll want to migrate other modules to proper namespace packages eventually, might worth it? Could make it generic later. Could also wrap it all in try/catch just to be super paranoid -- worst case it will just be an extra warning

@karlicoss
Copy link
Owner

Overall looks great thanks! Up to you if you wanna add gdpr in this PR or in a separate one; I'm off to bed for now, will check in the morning :)

@seanbreckenridge
Copy link
Contributor Author

seanbreckenridge commented Oct 29, 2021

Sounds good, still working on a few things Im not happy with; will let you know/convert if from a draft PR when I think its ready

@seanbreckenridge
Copy link
Contributor Author

Can't believe using a decorator to handle wrapping the import didn't click in my head -- though is still casting Any I believe, at least it looks reasonable in the all.py file

@seanbreckenridge
Copy link
Contributor Author

will continue to work on this over the weekend probably, fixed most of the issues you brought up, have a couple other things I want to implement

@seanbreckenridge
Copy link
Contributor Author

When the package isn't installed and nothing is in your config as far as core.disabled_modules

  /home/sean/Repos/HPI-fork/my/core/source.py:55: UserWarning: Module my.reddit.pushshift (_pushshift_comments) could not be imported, or isn't configured propertly
  To hide this message, add my.reddit.pushshift to your core config disabled_classes, like:

  class core:
      disabled_modules = ['my.reddit.pushshift']

When not installed but suppressed in the user config, it doesn't send any warnings:

hpi doctor --skip-config-check my.reddit.all
✅ OK  : my.reddit.all
✅     - stats: {'saved': {'count': 51, 'last': datetime.datetime(2016, 9, 26, 0, 13, 19, tzinfo=<UTC>)}, 'comments': {'count': 1020}, 'submissions': {'count': 39}, 'upvoted': {'count': 923}}

(When installed, it just works as normal with no warnings)

@seanbreckenridge
Copy link
Contributor Author

seanbreckenridge commented Oct 29, 2021

TODO:

  • deprecation warning when importing from my.reddit
  • Create instructions to migrate (may already be handled by the warnings though)
  • Update references of my.reddit across the docs to reflect the new structure
  • Create PR in promnesia to import from my.reddit.all, falling back on my.reddit in the case that that causes a ImportError

GDPR would all be in this module so it wouldn't add any complications as far as imports go -- once I'm done with the above this can be merged in, and will add reddit.gdpr at some later date

@seanbreckenridge
Copy link
Contributor Author

seanbreckenridge commented Oct 29, 2021

Actually, accesses to .raw here don't work with the protocol defined in my.reddit.common (since that doesn't have raw property classes, since will eventually use GDPR as source as well which has nothing like raw) -- I think it'd be better to move that logic to HPI (I also have it duplicated a bit here, could consolidate it into a helper here instead)

Edit: adding the raw property seems to be a fine compromise

@seanbreckenridge
Copy link
Contributor Author

seanbreckenridge commented Oct 29, 2021

Also moved the RedditThing from promnesia to here, since theres a bunch of Protocol's here anyways

As far as changes to promnesia, I think this should be fine?

diff --git a/src/promnesia/sources/reddit.py b/src/promnesia/sources/reddit.py
index 01ab6d0..28f1f35 100644
--- a/src/promnesia/sources/reddit.py
+++ b/src/promnesia/sources/reddit.py
@@ -10,7 +10,7 @@ from ..common import Visit, Loc, extract_urls, Results, logger
 
 def index(*, render_markdown: bool = False, renderer: Optional['RedditRenderer'] = None) -> Results:
     from . import hpi
-    from my.reddit import submissions, comments, saved, upvoted
+    from my.reddit.all import submissions, comments, saved, upvoted
 
     if renderer is not None:
         assert callable(renderer), f"{renderer} is not a callable (should be a subclass of RedditRenderer)"
@@ -164,20 +164,6 @@ class RedditRenderer:
 # todo hmm, could do similar stuff in HPI?
 import typing
 if typing.TYPE_CHECKING:
-    from my.reddit import Submission, Comment, Save, Upvote
+    # These are Protocol classes defined in HPI
+    from my.reddit.common import Submission, Comment, Save, Upvote, RedditThing
 
-
-from datetime import datetime
-if typing.TYPE_CHECKING:
-    # TODO define in rexport?
-    from typing_extensions import Protocol
-    from typing import Dict, Any
-    class RedditThing(Protocol):
-        @property
-        def raw(self) -> Dict[str, Any]: ...
-        @property
-        def url(self) -> str: ...
-        @property
-        def created(self) -> datetime: ...
-        @property
-        def text(self) -> str: ...
-

Do you think I should make those imports safe and import the old stuff if it fails? Like:

try:
	from my.reddit.all import submissions, comments, saved, upvoted
except ImportErorr:
	from my.reddit import submissions, comments, saved, upvoted  # type: ignore[misc,no-redef]

I guess we can do that for now and add a warning, then deprecate it in a few months?

Does create a couple typing issues supporting both the old and new style, but I suppose I can just # type: ignore those

@seanbreckenridge seanbreckenridge marked this pull request as ready for review October 29, 2021 19:20
@seanbreckenridge
Copy link
Contributor Author

seanbreckenridge commented Oct 29, 2021

Unsure how promnesia tests against this, maybe have to do a pypi release after this is merged before the PR there would succeed (havent made it yet, commit is here)

This is done as far as this PR goes @karlicoss; feel free to review. Will do GDPR at some point in the future

@karlicoss
Copy link
Owner

Also moved the RedditThing from promnesia to here, since theres a bunch of Protocol's here anyways

yeah, I guess makes sense to move here! Maybe a good opportunity to rename too while we're at it, maybe RedditBase or something like that?

also perhaps other protocols in common.py should inherit RedditThing? That should work right, and will result in less duplication?

Do you think I should make those imports safe and import the old stuff if it fails

yeah, good idea! I'd probably also check against the exception message (since ImportError is a bit broad and might make the error misleading)

Unsure how promnesia tests against this, maybe have to do a pypi release after this is merged before the PR

Yeah.. at the moment it's running against pypi version of HPI. When I want to test against some HPI change, I usually mess with these lines https://github.com/karlicoss/promnesia/blob/281db5dfc35a3cb0f563686e1838088a621f69c2/setup.py#L84-L87

I think ideally would be nice to have promnesia CI running both against HPI in pypi and HPI from master, but so far haven't set it up.

@seanbreckenridge
Copy link
Contributor Author

seanbreckenridge commented Oct 31, 2021

also perhaps other protocols in common.py should inherit RedditThing? That should work right, and will result in less duplication?

I would've thought so yeah, but when I tried it ended up with:

diff --git a/my/reddit/common.py b/my/reddit/common.py
index 46634ca..ff4a0f2 100644
--- a/my/reddit/common.py
+++ b/my/reddit/common.py
@@ -21,7 +21,7 @@ else:
 
 
 # common fields across all the Protocol classes, so generic code can be written
-class RedditThing(Protocol):
+class RedditBase(Protocol):
     @property
     def raw(self) -> Json: ...
     @property
@@ -35,62 +35,23 @@ class RedditThing(Protocol):
 
 
 # Note: doesn't include GDPR Save's since they don't have the same metadata
-class Save(Protocol):
-    @property
-    def raw(self) -> Json: ...
-    @property
-    def created(self) -> datetime_aware: ...
-    @property
-    def id(self) -> str: ...
-    @property
-    def url(self) -> str: ...
-    @property
-    def text(self) -> str: ...
+class Save(RedditBase):
     @property
     def subreddit(self) -> str: ...
 
 # Note: doesn't include GDPR Upvote's since they don't have the same metadata
-class Upvote(Protocol):
-    @property
-    def raw(self) -> Json: ...
-    @property
-    def id(self) -> str: ...
-    @property
-    def created(self) -> datetime_aware: ...
-    @property
-    def url(self) -> str: ...
-    @property
-    def text(self) -> str: ...
+class Upvote(RedditBase):
     @property
     def title(self) -> str: ...
 
 
 # From rexport, pushshift and the reddit GDPR export
-class Comment(Protocol):
-    @property
-    def raw(self) -> Json: ...
-    @property
-    def id(self) -> str: ...
-    @property
-    def created(self) -> datetime_aware: ...
-    @property
-    def url(self) -> str: ...
-    @property
-    def text(self) -> str: ...
+class Comment(RedditBase):
+    pass
 
 
 # From rexport and the GDPR export
-class Submission(Protocol):
-    @property
-    def raw(self) -> Json: ...
-    @property
-    def id(self) -> str: ...
-    @property
-    def created(self) -> datetime_aware: ...
-    @property
-    def url(self) -> str: ...
-    @property
-    def text(self) -> str: ...
+class Submission(RedditBase):
     @property
     def title(self) -> str: ...
my/reddit/all.py: note: In function "_rexport_comments":
my/reddit/all.py:18: error: Incompatible types in "yield from" (actual type "rexport.dal.Comment", expected type
"my.reddit.common.Comment")  [misc]
        yield from rexport.comments()
        ^
my/reddit/all.py: note: In function "_rexport_submissions":
my/reddit/all.py:23: error: Incompatible types in "yield from" (actual type "rexport.dal.Submission", expected type
"my.reddit.common.Submission")  [misc]
        yield from rexport.submissions()
        ^
my/reddit/all.py: note: In function "_rexport_saved":
my/reddit/all.py:28: error: Incompatible types in "yield from" (actual type "rexport.dal.Save", expected type
"my.reddit.common.Save")  [misc]
        yield from rexport.saved()
        ^
my/reddit/all.py: note: In function "_rexport_upvoted":
my/reddit/all.py:33: error: Incompatible types in "yield from" (actual type "rexport.dal.Upvote", expected type
"my.reddit.common.Upvote")  [misc]
        yield from rexport.upvoted()
        ^
my/reddit/all.py: note: In function "_pushshift_comments":
my/reddit/all.py:38: error: Incompatible types in "yield from" (actual type "PComment", expected type "Comment")  [misc]
        yield from pcomments()
        ^
Found 5 errors in 1 file (checked 5 source files)

For whatever sub-protocols (subclasses?) don't work in that way, which after some digging seems to be part of the rejected ideas on the PEP

Will rename it to RedditBase, sure

since ImportError is a bit broad and might make the error misleading

Ok, will bring this up in the promnesia PR when I create it

@seanbreckenridge
Copy link
Contributor Author

In relation to the no-subclassing, I also thought and tried abstract base classes, but that isn't as extendible when it comes to new sources, and requires that someone subclass the class specifically -- goes a bit against the extensibility argument

@seanbreckenridge
Copy link
Contributor Author

seanbreckenridge commented Oct 31, 2021

Could perhaps also be somewhere where typeguards could be used? (proably requires a backported type like typing_extensions though) That is mostly runtime though and would result in a slight performance hit to check if something matches the RedditBase

@karlicoss
Copy link
Owner

For whatever sub-protocols (subclasses?) don't work in that way, which after some digging seems to be part of the rejected ideas on the PEP

huh, indeed. However this suggests it should work if you inherit subprotocols from Protocol?

from typing import Protocol

class AP(Protocol):
    a: int

class BP(AP):
    b: int

class C:
    a = 1
    b = 2

a: AP = C()
b: BP = C()

^ this doesn't work:

a.py:14: error: Incompatible types in assignment (expression has type "C", variable has type "BP")  [assignment]
    b: BP = C()

but if you change B to:

class BP(Protocol, AP):
    b: int

it passes!

@seanbreckenridge
Copy link
Contributor Author

seanbreckenridge commented Oct 31, 2021

Yep, that does seem to work, and works in promnesia as well when it expects the RedditBase. Great stuff

@karlicoss
Copy link
Owner

nice, thanks so much! I'm happy to merge, but it says there are conflicts, can you rebase?

@seanbreckenridge
Copy link
Contributor Author

seanbreckenridge commented Oct 31, 2021

ah, was because of the macos ci stuff

not used to rebasing so hopefully that was right. feel free to squash this all

@karlicoss karlicoss merged commit 8422c6e into karlicoss:master Oct 31, 2021
@seanbreckenridge
Copy link
Contributor Author

seanbreckenridge commented Oct 31, 2021

Ah, I don't know how this wasn't caught by mypy -- I should've actually run this

~/Repos/HPI-fork/my/reddit/common.py in <module>
     36
     37 # Note: doesn't include GDPR Save's since they don't have the same metadata
---> 38 class Save(Protocol, RedditBase):
     39     @property
     40     def subreddit(self) -> str: ...

TypeError: Cannot create a consistent method resolution
order (MRO) for bases object, RedditBase

when actually running it can't do that

Should I make a PR to revert that?

@karlicoss
Copy link
Owner

hmm one issue I noticed, in my hpi config I have this

class reddit:
    @classproperty
    def export_dir(self):
        return _get_reddit_files()

    unrelated = False
    rexport: Optional[PathIsh] = code / 'rexport'

(I'm keeping all this legacy garbage to 'dogfood' config migrations)

So currently it stumbles over trying to treat rexport as class rexport (this is a remnant of the times when these repositories weren't proper python packages #79 ). I guess other people might have a similar legacy config, so would be good to handle it carefully here? I suppose checking for isinstance(.., (Path, str)) would be good enough?

@seanbreckenridge
Copy link
Contributor Author

Is rexport a path to the repository in that situation? I guess in that case you would still expect export_dir or export_path to have the JSON data files?

@karlicoss
Copy link
Owner

yeah, hopefully it would mean that they use an old config, so can just assume else branch?

@seanbreckenridge seanbreckenridge deleted the reddit-merge branch October 31, 2021 21:34
seanbreckenridge added a commit to seanbreckenridge/HPI that referenced this pull request Oct 31, 2021
this was a hacked together reddit file that combined
pushshift and rexport, but that has since been merged
into karlicoss' HPI modules

karlicoss/HPI#179
karlicoss added a commit that referenced this pull request Feb 2, 2022
keeping it backwards compatible + conditional warning similar to #179

follow up for seanbreckenridge/HPI#18
for now without the __path__ hacking, will do it in bulk later

too lazy to run test_import_warnings.sh on CI for now, but figured I'd commit it for the reference anyway
karlicoss added a commit that referenced this pull request Feb 2, 2022
keeping it backwards compatible + conditional warning similar to #179

follow up for seanbreckenridge/HPI#18
for now without the __path__ hacking, will do it in bulk later

too lazy to run test_import_warnings.sh on CI for now, but figured I'd commit it for the reference anyway
karlicoss added a commit that referenced this pull request Feb 2, 2022
keeping it backwards compatible + conditional warning similar to #179

follow up for seanbreckenridge/HPI#18
for now without the __path__ hacking, will do it in bulk later

too lazy to run test_import_warnings.sh on CI for now, but figured I'd commit it for the reference anyway
karlicoss added a commit that referenced this pull request Feb 5, 2022
karlicoss added a commit that referenced this pull request Feb 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants