Bigrams omitted from save and load pickle methods #93

k-dal · 2021-07-28T00:16:24Z

I noticed I was getting better results when doing a fresh load of the included dictionary and bigrams_dictionary vs. when I was loading from a previously saved pickle. It turned out to be because self._bigrams data (as well as a number of other params & attributes) weren't being pickled and were therefore missing after loading from pickle. Adding the missing bigrams attribute to the pickling save/load utils resolved the issues I was having.

I'm hardly an NLP expert, so before I waste your time with a pull request I have to ask: Is there a reason some attributes are omitted when pickling?

The text was updated successfully, but these errors were encountered:

mammothb · 2021-07-28T00:44:58Z

@k-dal Hi, sorry, I think it's because pickling was implemented before bigrams and I forgot to add it in when implementing bigrams. If you have managed to fix the issue on your end, would you like to submit a pull request so bigrams is included during pickling?

k-dal · 2021-07-28T04:44:57Z

@mammothb Got it, that makes sense (and no need to apologize!).

Since you're open to a pull request, I'm thinking a handful of pickler-related alterations/upgrades might be good to do at the same time:

Add missing attributes to pickle save & load functions. Specifically:
1. bigrams
2. below_threshold_words
3. replaced_words
4. max_dictionary_edit_distance
5. prefix_length
6. count_threshold
An accessory function to raise a warning if load_pickle imports a state where the loaded params conflict with the SymSpell instance's existing params.
1. Since it looks like the dict attribute entries are heavily influenced by the parameter values (e.g., prefix_length, etc) at time of creation, it seems worthwhile to warn the user if she initializes the class one way but changes parameter states when loading her pickle.
2. Does this seem sensible from your perspective? I'm hesitant to omit the int params entirely - it feels like important context/info would be missing from the pickle without the params.
A flag to call pickle.dumps() and pickle.loads() instead of the file/stream .dump() and .load() functions.
1. My motivation here is that strings may be preferable if the pickle is going to be stored in a database rather than written to file.
2. It may also make things easier when using symspellpy in a serverless environment since you don't always have a persistent file system available.

Anyway, thank you for the quick reply! Let me know what you think.

mammothb · 2021-07-28T07:57:45Z

@k-dal

I think pickling was originally intended to avoid having to parse the dictionary files again as some users have large dictionary files and it takes very long to load those. replaced_words contains the replaced/corrected words while running lookup_compound. I am picturing the following use case:
The user loads a dictionary file once and would like to skip the creation of dictionary entries and use the same dictionary for multiple projects in the future, perhaps similar to loading model weights for a deep learning model. replaced_words behaves like some sort of output in this case, so I don't think it is necessary include it in the pickle.
I agree, the int params are necessary if we are planning to load additional dictionary files to a previously processed pickle. There is some form of version checking here, perhaps the accessory function could be place around there.
I think the current state of how saving and loading pickle work is partly from the discussion at Too much time for loading large dictionary #31 (comment). If the new save and load functions do not break existing use cases, I am fine adding a flag to call pickle.dumps() and pickle.loads().

k-dal · 2021-07-28T15:54:31Z

@mammothb

Got it, that explanation helps a lot. You're definitely right so I'll plan to include everything except for replaced_words and will make note of that deliberate omission in the docstring(s).
Ah okay, yes, putting it in this area makes sense. I wasn't sure what data_version meant -- since it was hardcoded, I figured it was probably referring to the version of the included .txt dictionary files. Sounds like there should also be some contextual reflection of the "version" when pickled too. Maybe a pickled_at timestamp or optional version_name that to capture some context. Other ideas welcome.
This context helps. I had the same experience with compression causing slow-downs, so now I understand why the compression boolean flag is there. Adding pickle.dumps() and pickle.loads() won't break any existing use cases. We can treat it exactly the same way as compression, except the default will be False so any existing uses are unaffected.

Great, I'm looking forward to contributing here!

(P.S. I'm getting married this weekend, and time is a bit tight this week and next while family's in town. So it may take me a few weeks to submit a PR. Let me know if any additional thoughts come to mind in the meantime.)

mammothb · 2021-07-29T00:41:16Z

@k-dal Congratulations! There's no rush on the PR.

I think data_version was added when we thought we should break compatibility with earlier versions of the pickle and force users to recreate the pickle. Perhaps version_name would be helpful since we want to track the fields in the pickle relative to the state of the codebase/library.

mammothb mentioned this issue Nov 21, 2021

Include bigram dictionary in save and load pickle methods #96

Merged

mammothb linked a pull request Nov 21, 2021 that will close this issue

Include bigram dictionary in save and load pickle methods #96

Merged

mammothb closed this as completed Nov 21, 2021

mammothb added the enhancement New feature or request label Nov 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bigrams omitted from save and load pickle methods #93

Bigrams omitted from save and load pickle methods #93

k-dal commented Jul 28, 2021

mammothb commented Jul 28, 2021

k-dal commented Jul 28, 2021

mammothb commented Jul 28, 2021

k-dal commented Jul 28, 2021 •

edited

Loading

mammothb commented Jul 29, 2021

Bigrams omitted from save and load pickle methods #93

Bigrams omitted from save and load pickle methods #93

Comments

k-dal commented Jul 28, 2021

mammothb commented Jul 28, 2021

k-dal commented Jul 28, 2021

mammothb commented Jul 28, 2021

k-dal commented Jul 28, 2021 • edited Loading

mammothb commented Jul 29, 2021

k-dal commented Jul 28, 2021 •

edited

Loading