Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT CLI conversion #249

Merged
merged 48 commits into from
Jan 23, 2023
Merged

FEAT CLI conversion #249

merged 48 commits into from
Jan 23, 2023

Conversation

E-Aho
Copy link
Collaborator

@E-Aho E-Aho commented Dec 9, 2022

Addresses #241

This PR adds in an entrypoint, skops convert, which allows users to automatically convert files from the command line.

I didn't end up adding inspect, as it felt odd to me to inspect a .pkl file which would require loading it into skops format based on how we currently have the safety checks implemented. Happy to implement an entrypoint to do it if other maintainers think it would be useful, it would essentially just need to not save the output skops in the current conversion method.

@E-Aho E-Aho changed the title Feat cli conversion FEAT: CLI conversion Dec 9, 2022
@BenjaminBossan
Copy link
Collaborator

BenjaminBossan commented Dec 12, 2022

No detailed review from me yet, just some more general things to discuss:

  1. It would be nice to give some examples how to actually use the CLI :)
  2. I feel like using logging is overkill here, do we really need it? I have been burnt by the logging module in the past, so I would avoid it if possible :D. If print and print(..., file=sys.stderr) are not enough, we could consider using rich. If we want to support different "log levels", we could have action='count' for verbose to allow -v, -vv etc. (at least that's what I think it does)
  3. Usage-wise, I think allowing to (optionally) specify an output file name makes more sense than an output folder. That's more in line with the majority of CLI tools and gives the user more control.
  4. Regarding the testing, I think I'm fine with the general approach. Just throwing out there that we could also create real pickle files and shell out to call the CLI, which is a bit more end-to-end.

@E-Aho E-Aho marked this pull request as ready for review December 12, 2022 21:08
@E-Aho
Copy link
Collaborator Author

E-Aho commented Dec 12, 2022

Ready for review @skops-dev/maintainers :)

@E-Aho
Copy link
Collaborator Author

E-Aho commented Dec 12, 2022

Addressing @BenjaminBossan's comments:

  1. I definitely will include some examples in documentation, but for here, the usage looks like:
    skops convert abc.pkl will convert the abc.pkl file to abc.skops at the moment.

  2. I used logging as I felt it was the default, 'pythonic' way to allow varied verbosity of print statements to stdout. I thought it would be nice to allow for users to have slightly better visibility if something was not working, and it felt extensible to me. I haven't used rich much, but if that's something you and the other devs are happier with I'm happy to change to that.

  3. I think it would make sense to add support for that too. The reason I initially wasn't too keen to go in that direction was it might make the usage pattern for multiple files clunkier.
    skops convert a.pkl b.pkl c.pkl --output-dir my-skops-dir felt tidier than having to specify a file name/path for each file specifically, but we could also support different usage patterns for single vs batch file conversions.

  4. Happy to add in some integration tests on actual files for the main entrypoint to have visibility of the whole workflow.

@BenjaminBossan
Copy link
Collaborator

My impression is that some of the things we discuss depend on what the general expectations of a command line tool are.

I've never seen the use of logging for CLI, but maybe that's just me. Still, I'd argue we should at least evaluate how necessary it is and if we can reduce complexity overall by removing it, we should do that. For instance, we could just not print anything when there is no error (which is the standard) and if users want to see more, like unknown packages, they can do -v.

(To be clear, I don't think we need rich, it's just in case we want to make the output prettier.)

Regarding output dir, we could keep it (but I wonder if -t --target-directory would be more standard, but then we need something else for trusted types). But IMHO we could also do without it since it's easy to chain commands to achieve the same result:

$ skops convert *.pkl && mv *.skops my-target-dir/

On the other hand, having a way to indicate the output file name is something I'd expect from similar tools, that's why I would have added it (but we could do that later).

Regarding the --trusted argument, do we want to allow more fine grained control, i.e. letting users indicate the trusted types? If so, the warning should exactly indicate how to specify them.

@adrinjalali
Copy link
Member

As for logging, I'm in favor of using it rather than pure print. I find it much cleaner, and migrating to logging can become really hard later. In scikit-learn, we've wanted to move to logging for ages, but backward compatibility has been keeping us from doing it.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need an empty __init__.py file, do we?

Nothing major here, looks pretty good to me. I don't have a strong opinion on print vs logging, but if we're using logging, we should certainly get an instance for this library, and under that a CLI one, instead of using the global instance.

skops/io/_cli.py Outdated Show resolved Hide resolved
skops/io/_cli.py Outdated Show resolved Hide resolved
skops/io/_cli.py Outdated Show resolved Hide resolved
skops/io/_cli.py Outdated Show resolved Hide resolved
@E-Aho
Copy link
Collaborator Author

E-Aho commented Dec 17, 2022

Addressed your PR comments, feel free to have another look @adrinjalali :)

As it stands:

  1. Users can pass in multiple input files and output file names
  2. If a user does not specify an output file name for a given input file, the default is to save a model abc.xyz as abc.skops in the current working directory

I still want to add in documentation to cover the additions, I could maybe use someone pointing me in the right direction to write up changes in the docs folder as I've not got much experience writing documentation :')

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good, but there still some points to discuss. (Interesting how many design decisions still have to be made for such a seemingly simple task.)

but if we're using logging, we should certainly get an instance for this library,

I think this has not been addressed. IIUC, Adrin means that we should have a logger instance, i.e. logger = logging.getLogger(...) and call logger.info(...) etc.

Personally, I still find it strange to use logging here, but if you two agree, it's fine by me.

Users can pass in multiple input files and output file names

Really not sure if we need this or whether it makes our code more complicated then needs be? If users want to convert multiple files, there are already ways to achieve this in the command line, e.g. through globbing or a small bash script.

I still want to add in documentation to cover the additions, I could maybe use someone pointing me in the right direction to write up changes in the docs folder as I've not got much experience writing documentation :')

I'm sure you can do it, read the existing docs for persistence and then try to put yourself in the shoes of a user, what information would be most important to them? The outcome doesn't have to be perfect on first try, we can iterate in this review.

Could you please also add an entry to changes.rst?

One more minor thing, when I call skops --help, I get this error:

$ skops --help
usage: Skops {convert}
Skops: error: the following arguments are required: function

skops/cli/__init__.py Outdated Show resolved Hide resolved
skops/cli/_convert.py Outdated Show resolved Hide resolved
skops/cli/_convert.py Outdated Show resolved Hide resolved
skops/cli/_convert.py Outdated Show resolved Hide resolved
skops/cli/_convert.py Outdated Show resolved Hide resolved
skops/cli/_convert.py Outdated Show resolved Hide resolved
skops/cli/_convert.py Outdated Show resolved Hide resolved
skops/cli/_convert.py Outdated Show resolved Hide resolved
skops/cli/tests/test_cli.py Outdated Show resolved Hide resolved
skops/cli/tests/test_entrypoint.py Outdated Show resolved Hide resolved
@adrinjalali
Copy link
Member

On holidays, so I'll either review next year or happy for this to be merged, but one think which I think is important is for us to have the cli module as private so that we can change things easy at least for now. I'd rename skops.cli to skops._cli.

@BenjaminBossan
Copy link
Collaborator

I'll be on holiday too until next year, but will take a look from time to time when I have the opportunity.

@E-Aho
Copy link
Collaborator Author

E-Aho commented Jan 10, 2023

Ok, phew, should now be ready to review again @skops-dev/maintainers :)

A couple of notes:

  • Current behaviour is to overwrite when writing a file if it already exists, is this the expected behaviour?
  • I reintroduced the empty__init__.py in cli because Pytest was failing to initilaise because there is now a _utils file in .cli as well as in .io
  • I reintroduced the test_entrypoint test because of the new added complexity with parsers and subparsers
  • I didn't create a full integration test for converting a file from the command line, might be a good test to add for a new contributor, or something I can add at a later date
  • Let me know if I should add more/ reword the documentation at all, wasn't sure how much I should add there.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good, I tested it locally and it works as expected. Thanks also for adding to the docs.

Btw. I checked and it seems we already have a transient dependency on click, so it would theoretically not cost us anything to build our CLI with it. I assume that after all this work, you're not interested in rewriting it with click, which is fine, I just wanted to mention it.

There still a few points left that I commented on, please take a look. Also, I wonder about this comment:

but if we're using logging, we should certainly get an instance for this library,

I think this has not been addressed. IIUC, Adrin means that we should have a logger instance, i.e. logger = logging.getLogger(...) and call logger.info(...) etc.

docs/persistence.rst Outdated Show resolved Hide resolved
docs/persistence.rst Outdated Show resolved Hide resolved
docs/persistence.rst Outdated Show resolved Hide resolved
docs/persistence.rst Show resolved Hide resolved
docs/persistence.rst Show resolved Hide resolved
"""Takes in verbosity from a CLI entrypoint (number of times -v specified),
and sets the logger to the required log level"""

all_levels = [logging.WARNING, logging.INFO, logging.DEBUG]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, warning level logs would always be shown, even if I don't add -v. IMO, this should not be the case. So I would prefer that without -v, the log level should be ERROR or CRITICAL.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the python docs:

The default level is WARNING, which means that only events of this level and above will be tracked, unless the logging package is configured to do otherwise.

I followed this because it's the default for python packages.

We can adjust the default level for our CLI, or we could also add a -q, --quiet option to reduce the log level

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is true for logging in general, but for CLIs, there is the convention to not print anything if there is no issue. Ideally, I would like to avoid having a --quiet option, because it increases complexity and because it's unclear how it interacts with -v.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point.

I would raise that it could be considered an issue if untrusted types were present.

If a user had a file with untrusted types in it, and we converted it silently, they wouldn't know they needed to specify anything as trusted when using it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this is more of an issue for the person loading the file, and they will know (unless they use trusted=True). Personally, I would not report it by default, but I can live with the current way.

skops/cli/tests/test_convert.py Outdated Show resolved Hide resolved
skops/cli/tests/test_convert.py Outdated Show resolved Hide resolved
skops/cli/tests/test_convert.py Show resolved Hide resolved
skops/cli/tests/test_convert.py Outdated Show resolved Hide resolved
E-Aho and others added 10 commits January 17, 2023 11:17
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
@E-Aho
Copy link
Collaborator Author

E-Aho commented Jan 17, 2023

Thanks for the review @BenjaminBossan, I've made adjustments along the lines of most of your comments, let me know if there's anything I've missed.

WRT logging, I've moved to using a logger instance, but still set a basic config at the top level for our selected entrypoint, as I felt that the config should be set at the top level for whatever entrypoint we're using.

That way, if we do implement any different logging elsewhere in our codebase, it automatically keeps the config specified for whatever command we're calling.

Let me know if you think there's a better way to achieve this;

Also, the log format currently looks like:

INFO : No unknown types found in abc.

How does this look to all? We can add things like time, function call, etc at a later date if we need it.

Also, when we implement logging somewhere else, might make sense to move this formatting out to a common module.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing the changes, this looks great to me now. Let's wait for one more review before merging.

WRT logging, I've moved to using a logger instance, but still set a basic config at the top level for our selected entrypoint, as I felt that the config should be set at the top level for whatever entrypoint we're using.

Also, when we implement logging somewhere else, might make sense to move this formatting out to a common module.

I think it's good for the time being.

Also, the log format currently looks like:
INFO : No unknown types found in abc.
How does this look to all? We can add things like time, function call, etc at a later date if we need it.

Yes, let's keep it simple for now.


Parameters
----------
input_file : os.PathLike
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to say if a user who doesn't know the type would guess that str is a path-like. Anyway, I won't object to leaving it as is, especially since it means that the type annotation and docstring are identical, I just wanted to bring up the possible friction here.

f"While converting {model_name}, "
"the following unknown types were found: "
f"{untrusted_str}. "
f"When loading {output_file}, add 'trusted=True' to the skops.load call. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this touches on a more general point of how we should deal with trusted=True. I posted on discord about it, which is a better place to discuss than here :)

"""Takes in verbosity from a CLI entrypoint (number of times -v specified),
and sets the logger to the required log level"""

all_levels = [logging.WARNING, logging.INFO, logging.DEBUG]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is true for logging in general, but for CLIs, there is the convention to not print anything if there is no issue. Ideally, I would like to avoid having a --quiet option, because it increases complexity and because it's unclear how it interacts with -v.

skops/cli/tests/test_convert.py Outdated Show resolved Hide resolved
def _convert_file(
input_file: os.PathLike,
output_file: os.PathLike,
logger: logging.Logger = logging.getLogger(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should instantiate the logger just once at the root of the module, and then just set logger=logger here. Is there any disadvantage to that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we could, although that would mean it gets initialised even if we import functions from the file to use elsewhere.

I don't think it would really impact much, I just personally dislike writing anything other than definitions at the top level

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would mean it gets initialised even if we import functions from the file to use elsewhere.

That's true. On the other hand it's now being initialized each time the function is called. Given that CLI usage consists of creating a new process anyway, it makes indeed no difference either way. If we ever get to a point where we use the logger elsewhere, we might have to rethink this.

docs/persistence.rst Show resolved Hide resolved
def _convert_file(
input_file: os.PathLike,
output_file: os.PathLike,
logger: logging.Logger = logging.getLogger(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would mean it gets initialised even if we import functions from the file to use elsewhere.

That's true. On the other hand it's now being initialized each time the function is called. Given that CLI usage consists of creating a new process anyway, it makes indeed no difference either way. If we ever get to a point where we use the logger elsewhere, we might have to rethink this.

skops/cli/_convert.py Outdated Show resolved Hide resolved
"""Takes in verbosity from a CLI entrypoint (number of times -v specified),
and sets the logger to the required log level"""

all_levels = [logging.WARNING, logging.INFO, logging.DEBUG]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this is more of an issue for the person loading the file, and they will know (unless they use trusted=True). Personally, I would not report it by default, but I can live with the current way.

@E-Aho
Copy link
Collaborator Author

E-Aho commented Jan 20, 2023

@BenjaminBossan do we still want one more review before merging this in?

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the hard work. This LGTM and can be merged if @adrinjalali is also fine with it.

As mentioned in some of the comments, there are some things that I think can be improved, but they are not blocking this PR, as they can be added later without BC breakage.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @E-Aho . I think we might need improve docs when we get back to this, but happy to merge this and have it out.

@adrinjalali adrinjalali changed the title FEAT: CLI conversion FEAT CLI conversion Jan 23, 2023
@adrinjalali adrinjalali merged commit e9f6e19 into skops-dev:main Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants