Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Added CLI command to update skops files #343

Merged
merged 32 commits into from
Jun 27, 2023

Conversation

EdAbati
Copy link
Contributor

@EdAbati EdAbati commented Apr 10, 2023

Reference Issues/PRs

Closes #333

What does this implement/fix? Explain your changes.

Taking inspiration from the implementation from skops convert, this adds the update command to the CLI.
It allows to update a skops file by simply 'loading' it and 'dumping' it again.

I have also added MacOS files in the gitignore.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this.

Before going further, we need a few more things: tests and documentation (see docs/persistence.rst).

IMO it would also be nice to have a feature to check the protocol of the loaded file and if it's the same as the current protocol, to skip the conversion. If it's too complicated, we can also leave this open for a future PR, though.

@EdAbati EdAbati marked this pull request as ready for review April 19, 2023 16:03
@EdAbati
Copy link
Contributor Author

EdAbati commented Apr 19, 2023

Hi @BenjaminBossan

I finally moved this PR from "Draft" to "ready for review". 🙂

You can see that, for now, I have not implemented any extra checks that we discussed here and in the issue. I wanted to make sure I’m on the right track before adding anything else.

The interfaces of the functions are now slightly different from the ones in _convert. I am passing a logger argument to make testing easier.

@BenjaminBossan
Copy link
Collaborator

I did a quick pass and it looks good to me overall. @adrinjalali would you be fine with this PR as is, i.e. no extra checks etc.?

@adrinjalali
Copy link
Member

adrinjalali commented Apr 20, 2023

I'm happy as is, without the checks, but we should figure out the checks separately, we've needed them in a few places now.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, this is a very clean implementation, well done.

There are only a few minor code comments from my side, please take a look.

Regarding the more general points, I have two suggestions (not must haves):

  1. How about making it possible to call the command without -o to just overwrite the existing file? I.e. skops update my_model.skops would write the new, updated skops file to my_model.skops.

  2. There is no test for the actual work that is being done. So e.g. if the _update_file function would just create a copy, the tests would still pass fine. I know that creating a file in an old skops format to test the updating feature is not trivial. Maybe you could come up with something clever? Say, create the skops file, manually set the protocol to an older version, then checking after the update that the protocol is the current version now. WDYT?

skops/cli/_update.py Outdated Show resolved Hide resolved
skops/cli/_update.py Outdated Show resolved Hide resolved
skops/cli/tests/test_entrypoint.py Outdated Show resolved Hide resolved
@EdAbati
Copy link
Contributor Author

EdAbati commented Apr 23, 2023

Thank you very much for your review!

  1. How about making it possible to call the command without -o' to just overwrite the existing file? I.e. skops update my_model.skopswould write the new, updated skops file tomy_model.skops`.

I actually did that initially, but then I thought that if we did that by default, it could be a negative surprise for certain users that want to keep the original file. Another idea was to attach a prefix or a suffix to the file name provided as input if no output-file is provided. For example: my-model.skops could be updated by default to my-model-updated.skops. In the end, this is a design decision; I'm happy to do what you think is best :)

  1. There is no test for the actual work that is being done. So e.g. if the _update_file function would just create a copy, the tests would still pass fine. I know that creating a file in an old skops format to test the updating feature is not trivial. Maybe you could come up with something clever? Say, create the skops file, manually set the protocol to an older version, then checking after the update that the protocol is the current version now. WDYT?

You are right! I added the protocol check inside the update function so that we can also test it :) Let me know if it's what you had in mind

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing the feedback. Clever way of patching the protocol.

I think there is a bug with one of the log messages, please take a look at my comment. Otherwise, I just left some comments on how to potentially tighten the tests.

I actually did that initially, but then I thought that if we did that by default, it could be a negative surprise for certain users that want to keep the original file. Another idea was to attach a prefix or a suffix to the file name provided as input if no output-file is provided. For example: my-model.skops could be updated by default to my-model-updated.skops. In the end, this is a design decision; I'm happy to do what you think is best :)

Good points. My intuition for being okay with allowing to overwrite the existing file is that other terminal commands also allow to just overwrite existing files without warning (say, mv). Adding a prefix/suffix would be a way to prevent accidents but has its own complications (what if my-model-updated.skops already exists?). @adrinjalali wanna break the tie?

skops/cli/_update.py Outdated Show resolved Hide resolved
skops/cli/tests/test_update.py Outdated Show resolved Hide resolved
skops/cli/tests/test_update.py Show resolved Hide resolved
@adrinjalali
Copy link
Member

re: overwriting the existing file

I think kinda what black does would be my ideal:

only print info about the diffs by default, no change in any file, no new files created

-i would change file in place

-o creates a new file

WDYT?

@BenjaminBossan
Copy link
Collaborator

only print info about the diffs by default, no change in any file, no new files created

Hmm, how exactly would that diff look like for a skops file?

-i would change file in place

-o creates a new file

I can get behind the idea to allow inplace but requiring an extra argument. Not sure about -i, as it is often used to mark the input (especially when -o is used for the output). If there is some precedence for a given argument name, we should probably use that. Maybe -f or --force to force overwriting but IMO it doesn't quite fit here. cp has -n (no clobber) to prevent overwriting an existing file, making overwriting the default, but I don't like that too much. If nothing else comes up, we could require the full --inplace.

@adrinjalali
Copy link
Member

Hmm, how exactly would that diff look like for a skops file?

we can start by reporting if there'd be any change at all or not, we can improve the diff maybe later if needed. It could be something like the diff of the json schema, and reporting any other saved file (like numpy files) if they're changed.

I can get behind the idea to allow inplace but requiring an extra argument. Not sure about -i, as it is often used to mark the input (especially when -o is used for the output). If there is some precedence for a given argument name, we should probably use that. Maybe -f or --force to force overwriting but IMO it doesn't quite fit here. cp has -n (no clobber) to prevent overwriting an existing file, making overwriting the default, but I don't like that too much. If nothing else comes up, we could require the full --inplace.

Input is usually just an argument, w/o a parameter. grep uses -i for case insensitive for instance. But I don't really mind, I'm also happy with --inplace.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes. I missed the test for the error, my bad. There is a merge conflict, apart from that, the PR is ready to be merged.

@adrinjalali
Copy link
Member

I'm gonna wait for this one before pushing out a release :)

@EdAbati
Copy link
Contributor Author

EdAbati commented May 23, 2023

Hi both so sorry for the delay, I have been very busy lately. 🥲

I checked all the remaining comments and fixed the conflicts. Let me know if anything else should be updated.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks a lot for all your work. I don't have any blockers for merging, just two comments, please take a look. If the general thought is that they don't require changes, I'm fine with you merging @adrinjalali

)
return None

dump(input_model, output_file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an extreme edge case, but I wanted to bring it up: If the user cancels the process during the dumping operation, could they end up with a broken file? Since we write to the zip archive progressively, I think it could happen. Sometimes, I would start a command and see I did a mistake and would cancel it, so it's not impossible.

Now when the user uses inplace, they could even bust their original file, ending up with no working version. It is an extreme edge case for this to happen, not sure if we want to take measures against it. @adrinjalali WDYT?

To prevent this behavior, we could write the output to a temp file and then move it to output_file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is quite major, and you're right. We should simply create a temp file beside the existing file, and then replace the existing file with what's created if conversion is successful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi both :) is this what you had in mind? 3a30721

parser_subgroup.add_argument(
"--inplace",
help="Update and overwrite the input file in place.",
action="store_true",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That behavior would be fine. It would also be acceptable IMO to update inplace by default, since it would only be done if the current version is higher, but leaving it like this is also good.

Just one question: Right now, if using the defaults (no change to log level), a user would typically not get feedback at all, right? Do we want the default behavior to be not do anything, nor message anything?

Comment on lines 76 to 81
tmp_output_file = f"{output_file}.tmp"
logger.debug(f"Writing updated skops file to temporary path: {tmp_output_file}")
dump(input_model, tmp_output_file)
logger.debug(f"Moving updated skops file to output path: {output_file}")
os.replace(tmp_output_file, output_file)
logger.info(f"Updated skops file written to {output_file}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably don't need so much debug logging.

This should use a temp file from (https://docs.python.org/3/library/tempfile.html), and in a context manager to protect from random OS failures. Otherwise LGTM.

Copy link
Contributor Author

@EdAbati EdAbati Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it now. I used tempfile.TemporaryDirectory because I understood that tempfile.NamedTemporaryFile cannot be moved when opened on Windows ("Whether the name can be used to open the file a second time, while the named temporary file is still open, varies across platforms (it can be so used on Unix; it cannot on Windows).").

Am I overcomplicating this? 🤔

@adrinjalali
Copy link
Member

The issues on the CI are numpy / persistence related. Merging.

@adrinjalali adrinjalali enabled auto-merge (squash) June 21, 2023 12:50
@adrinjalali
Copy link
Member

@BenjaminBossan somehow here it shows you've requested changes

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the changes, this LGTM now.

@BenjaminBossan
Copy link
Collaborator

@EdAbati Could you please rebase on/merge with main to ensure the tests pass (#373 should have fixed the issues)? It's probably fine now, I just want to be sure.

@EdAbati
Copy link
Contributor Author

EdAbati commented Jun 26, 2023

Done now! 🤞
Thank you for the feedback and apologies for taking a bit long to update the PR :)

@EdAbati
Copy link
Contributor Author

EdAbati commented Jun 26, 2023

There is still a small issue with SciPy, that it shouldn't be caused by this PR

@EdAbati
Copy link
Contributor Author

EdAbati commented Jun 26, 2023

Made a separate PR in an attempt to fix the issues with the test_persist.py #374

@adrinjalali
Copy link
Member

The failing tests are unrelated to this PR, so I'll merge.

@adrinjalali adrinjalali merged commit 70185a9 into skops-dev:main Jun 27, 2023
9 of 12 checks passed
@EdAbati EdAbati deleted the cli-uupdate-skops-file branch June 27, 2023 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CLI: Command to update a skops file
3 participants