Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bigrams omitted from save and load pickle methods #93

Closed
k-dal opened this issue Jul 28, 2021 · 5 comments · Fixed by #96
Closed

Bigrams omitted from save and load pickle methods #93

k-dal opened this issue Jul 28, 2021 · 5 comments · Fixed by #96
Labels
enhancement New feature or request

Comments

@k-dal
Copy link

k-dal commented Jul 28, 2021

I noticed I was getting better results when doing a fresh load of the included dictionary and bigrams_dictionary vs. when I was loading from a previously saved pickle. It turned out to be because self._bigrams data (as well as a number of other params & attributes) weren't being pickled and were therefore missing after loading from pickle. Adding the missing bigrams attribute to the pickling save/load utils resolved the issues I was having.

I'm hardly an NLP expert, so before I waste your time with a pull request I have to ask: Is there a reason some attributes are omitted when pickling?

@mammothb
Copy link
Owner

@k-dal Hi, sorry, I think it's because pickling was implemented before bigrams and I forgot to add it in when implementing bigrams. If you have managed to fix the issue on your end, would you like to submit a pull request so bigrams is included during pickling?

@k-dal
Copy link
Author

k-dal commented Jul 28, 2021

@mammothb Got it, that makes sense (and no need to apologize!).

Since you're open to a pull request, I'm thinking a handful of pickler-related alterations/upgrades might be good to do at the same time:

  1. Add missing attributes to pickle save & load functions. Specifically:
    1. bigrams
    2. below_threshold_words
    3. replaced_words
    4. max_dictionary_edit_distance
    5. prefix_length
    6. count_threshold
  2. An accessory function to raise a warning if load_pickle imports a state where the loaded params conflict with the SymSpell instance's existing params.
    1. Since it looks like the dict attribute entries are heavily influenced by the parameter values (e.g., prefix_length, etc) at time of creation, it seems worthwhile to warn the user if she initializes the class one way but changes parameter states when loading her pickle.
    2. Does this seem sensible from your perspective? I'm hesitant to omit the int params entirely - it feels like important context/info would be missing from the pickle without the params.
  3. A flag to call pickle.dumps() and pickle.loads() instead of the file/stream .dump() and .load() functions.
    1. My motivation here is that strings may be preferable if the pickle is going to be stored in a database rather than written to file.
    2. It may also make things easier when using symspellpy in a serverless environment since you don't always have a persistent file system available.

Anyway, thank you for the quick reply! Let me know what you think.

@mammothb
Copy link
Owner

@k-dal

  1. I think pickling was originally intended to avoid having to parse the dictionary files again as some users have large dictionary files and it takes very long to load those. replaced_words contains the replaced/corrected words while running lookup_compound. I am picturing the following use case:
    The user loads a dictionary file once and would like to skip the creation of dictionary entries and use the same dictionary for multiple projects in the future, perhaps similar to loading model weights for a deep learning model. replaced_words behaves like some sort of output in this case, so I don't think it is necessary include it in the pickle.

  2. I agree, the int params are necessary if we are planning to load additional dictionary files to a previously processed pickle. There is some form of version checking here, perhaps the accessory function could be place around there.

  3. I think the current state of how saving and loading pickle work is partly from the discussion at Too much time for loading large dictionary #31 (comment). If the new save and load functions do not break existing use cases, I am fine adding a flag to call pickle.dumps() and pickle.loads().

@k-dal
Copy link
Author

k-dal commented Jul 28, 2021

@mammothb

  1. Got it, that explanation helps a lot. You're definitely right so I'll plan to include everything except for replaced_words and will make note of that deliberate omission in the docstring(s).

  2. Ah okay, yes, putting it in this area makes sense. I wasn't sure what data_version meant -- since it was hardcoded, I figured it was probably referring to the version of the included .txt dictionary files. Sounds like there should also be some contextual reflection of the "version" when pickled too. Maybe a pickled_at timestamp or optional version_name that to capture some context. Other ideas welcome.

  3. This context helps. I had the same experience with compression causing slow-downs, so now I understand why the compression boolean flag is there. Adding pickle.dumps() and pickle.loads() won't break any existing use cases. We can treat it exactly the same way as compression, except the default will be False so any existing uses are unaffected.

Great, I'm looking forward to contributing here!

(P.S. I'm getting married this weekend, and time is a bit tight this week and next while family's in town. So it may take me a few weeks to submit a PR. Let me know if any additional thoughts come to mind in the meantime.)

@mammothb
Copy link
Owner

@k-dal Congratulations! There's no rush on the PR.

  1. I think data_version was added when we thought we should break compatibility with earlier versions of the pickle and force users to recreate the pickle. Perhaps version_name would be helpful since we want to track the fields in the pickle relative to the state of the codebase/library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants