Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Implement LoadContext to handle multiple instances #209

Merged
merged 18 commits into from
Nov 14, 2022

Conversation

E-Aho
Copy link
Collaborator

@E-Aho E-Aho commented Nov 4, 2022

Fixes #206.

PR to add ability to persist a single instance that is referenced in multiple places.

Notes:

  1. Currently, this does save the state of the object multiple times, but only initialises once. This is done due to the issue discussed in Add Method type in dispatch calls #195 around not knowing which instance will be loaded first, particularly in complex cases with interwoven dependencies.

  2. At the moment, I've only added __id__ in the places needed to solve the xfail test, but if others are happy with how this looks I will expand this to other get_state_* methods :) This is now global, done at the high level get_state and get_instance functions so works for all types.


Regarding 1: I looked into any ways to store DAGs in JSON and bumped into a relatively new addition to JSON syntax, JSON-LD, which also happens to have a Python implementation. I feel like if we want to only save these objects once, we would need some kind of DAG implementation to hold dependencies. This, however, would need a major rework and feels like it might be overkill for the time being, but it's worth holding in mind.

@BenjaminBossan
Copy link
Collaborator

I think this looks pretty good, like the simplest way to solve the problem.

if others are happy with how this looks I will expand this to other get_state_* methods

This would mean that we need to add something like this everywhere, right?

def get_instance_foo(state, load_state):
    saved_id = state.get("__id__")
    if saved_id and saved_id in load_state.memo:
        # an instance has already been loaded, just return the loaded instance
        return load_state.get_instance(saved_id)
    ...
    if saved_id:
        load_state.memoize(loaded_obj, saved_id)
    return loaded_obj

So is it something we want to move into a decorator to avoid the boilerplate?

  1. Currently, this does save the state of the object multiple times

Let's leave it like this for now and bother about the problem later.

I looked into any ways to store DAGs in JSON and bumped into a relatively new addition to JSON syntax, JSON-LD,

I only took a glance but I hope we can manage to solve it without such a solution.

@E-Aho
Copy link
Collaborator Author

E-Aho commented Nov 7, 2022

This would mean that we need to add something like this everywhere, right?

Thankfully no, because of the way I structured it, the memo check is at the top level get_instance function, before it gets sent to the specific get_instance_foo for whatever type that is, so it will be able to handle any memoized things with minimal boilerplate.

We could switch to a decorator pattern, but I felt it would be easier to just pass __id__ as one of the state parameters for any types we want to persist as a singular instance. So, to implement elsewhere, you'd just need to add res["__id__"] = id(obj) in get_state_foo.

I only took a glance but I hope we can manage to solve it without such a solution.

I hope there's a nice solution I've missed too, but it feels like once you start getting into edge cases with nested dependencies and such, you do need some kind of DAG implementation...

@BenjaminBossan
Copy link
Collaborator

Thankfully no, because of the way I structured it, the memo check is at the top level get_instance function

Ah yes, nice, I missed that.

you'd just need to add res["__id__"] = id(obj) in get_state_foo.

Okay, this might be something we would want to move out of the get_state_foo functions, since it's easy to forget and not statically checked.

I hope there's a nice solution I've missed too, but it feels like once you start getting into edge cases with nested dependencies and such, you do need some kind of DAG implementation...

Hmm, not sure, let's see. Maybe we can learn how pickle implements it and copy that approach.

@E-Aho
Copy link
Collaborator Author

E-Aho commented Nov 7, 2022

I've added in a decorator that can be used to wrap get_state functions that we want to be able to persist as single instances.

I initially tried having it in the high level get_state function, but ran into issues I think were being caused with IDs being garbage collected and reused during persist for things which were not equivalent.

@BenjaminBossan
Copy link
Collaborator

I initially tried having it in the high level get_state function, but ran into issues I think were being caused with IDs being garbage collected and reused during persist for things which were not equivalent.

Interesting, do you still have the code for this? It's not clear to me why moving to a decorator solves this.

Assuming that code looks similar to what you have now:

        result = func(obj, save_state)
        result["__id__"] = id(obj)

did you try if this works instead?

        __id__ = save_state.memoize(obj)
        result = func(obj, save_state)
        result["__id__"] = __id__

@E-Aho
Copy link
Collaborator Author

E-Aho commented Nov 8, 2022

Interesting, do you still have the code for this? It's not clear to me why moving to a decorator solves this.

What worked was less just moving to a decorator, more not persisting certain types with an id. When we start to persist __id__ for:

  • get_state_dict
  • get_state_list
  • get_state_ndarray

Some tests start failing in a flaky way (sometimes they would work fine on reruns). I jumped into the debugger and saw that objects that shouldn't have the same id did, and it seemed to be objects that were being created during get_state (dicts, lists etc).

My intuition was that CPython might be garbage collecting these, then reusing the memory address, giving the same id.

I'll play with what you suggested tonight, and might try using a more unique ID for objects as well :)

@BenjaminBossan
Copy link
Collaborator

My intuition was that CPython might be garbage collecting these, then reusing the memory address, giving the same id.

I think this is the most likely explanation, which is why I suggested to memoize the object first. This should prevent gc and make the id stable.

@E-Aho
Copy link
Collaborator Author

E-Aho commented Nov 8, 2022

I suppose either method would work, but memoizing every object would mean basically never garbage collecting,

My intuition was that CPython might be garbage collecting these, then reusing the memory address, giving the same id.

I think this is the most likely explanation, which is why I suggested to memoize the object first. This should prevent gc and make the id stable.

That should work, although it would also mean holding a lot more things in memory at all times, which might give a minor performance hit. If we don't need to actually hold the object, just get a unique id, it might be easier to just use a UUID, but I'm happy to just memoize the objects as well.

@BenjaminBossan
Copy link
Collaborator

I suppose either method would work, but memoizing every object would mean basically never garbage collecting,

The memory is cleared once dumping is over, so gc should work after that. But it's true that during the dumping process, we could potentially see a memory spike, though I assume that most objects are held in memory anyway. If I'm not missing something, It should only really matter for attributes that are generated on the fly and require a lot of memory.

@E-Aho
Copy link
Collaborator Author

E-Aho commented Nov 8, 2022

I suppose either method would work, but memoizing every object would mean basically never garbage collecting,

The memory is cleared once dumping is over, so gc should work after that. But it's true that during the dumping process, we could potentially see a memory spike, though I assume that most objects are held in memory anyway. If I'm not missing something, It should only really matter for attributes that are generated on the fly and require a lot of memory.

No that's true, I'm overthinking it and these on-the-fly objects being held in memory should be a negligable memory hit.

I'll rework it this eve and check if it helps solve the flaky tests

@E-Aho
Copy link
Collaborator Author

E-Aho commented Nov 8, 2022

Memoizing seems to have solved the problem without needing any UUID rework, good call @BenjaminBossan!

I've added a test that checks it works for a few different object types, let me know if there's anything else you think would be worth testing!

@E-Aho
Copy link
Collaborator Author

E-Aho commented Nov 8, 2022

RE: CodeCov, as far as I can tell from the report, this PR is at 100% coverage, and just failing on an "indirect coverage change", so I think the CodeCov/project check failing is just noise, but let me know if this isn't the case

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good to me, I just have some minor comments/questions.

Regarding the naming, it's a bit unfortunate that we now have state and LoadState. When adding SaveState, I didn't think about this issue. Maybe we can find a better name (for both for consistency)?

skops/io/tests/test_persist.py Outdated Show resolved Hide resolved
skops/io/_dispatch.py Outdated Show resolved Hide resolved
@E-Aho
Copy link
Collaborator Author

E-Aho commented Nov 9, 2022

On the names, how do you feel about changing them to SaveContext and LoadContext @BenjaminBossan?

@BenjaminBossan
Copy link
Collaborator

On the names, how do you feel about changing them to SaveContext and LoadContext @BenjaminBossan?

Yes, I prefer that. Pinging @adrinjalali wdyt?

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, fantastic work.

@adrinjalali would you please also give this a look?

@E-Aho E-Aho changed the title FIX Implement LoadState to handle multiple instances FIX Implement LoadContext to handle multiple instances Nov 10, 2022
@E-Aho E-Aho force-pushed the FIX-multiple-instance-persistance branch from aeb5b12 to df3848a Compare November 10, 2022 16:39
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the minor comments, looks brilliant.

skops/io/tests/test_persist.py Show resolved Hide resolved
skops/io/_utils.py Outdated Show resolved Hide resolved
@E-Aho
Copy link
Collaborator Author

E-Aho commented Nov 11, 2022

Thanks for the PR reviews guys ❤️ I think it should be ready to merge in!

@adrinjalali adrinjalali changed the title FIX Implement LoadContext to handle multiple instances ENH Implement LoadContext to handle multiple instances Nov 14, 2022
@adrinjalali adrinjalali merged commit 80be9ce into skops-dev:main Nov 14, 2022
@E-Aho E-Aho deleted the FIX-multiple-instance-persistance branch November 14, 2022 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Single object referenced in multiple places doesn't load as single instance
3 participants