Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make validate_docstrings.py ready for the CI #23514

Merged
merged 8 commits into from
Nov 7, 2018
Merged

Make validate_docstrings.py ready for the CI #23514

merged 8 commits into from
Nov 7, 2018

Conversation

datapythonista
Copy link
Member

This will allow us to start having validation rules for the docstings in the CI, for example, we can have:

$ ./scripts/validate_docstrings.py --prefix="pandas.read_" --format=azure --errors=EX03
##vso[task.logissue type=error;sourcepath=pandas/util/_decorators.py;linenumber=281;code=EX03;]pandas.read_excel: flake8 error: E231 missing whitespace after ',' (3 times)
##vso[task.logissue type=error;sourcepath=pandas/util/_decorators.py;linenumber=169;code=EX03;]pandas.read_stata: flake8 error: F821 undefined name 'do_something'

Which will finish with an exit status different than 0, validate only certain docstrings (the ones starting with pandas.read_), and validate only specific errors (in this case only EX03 which validates pep8 in the docstrings). The output format can be a JSON file, simple messages, or the messages formatted to be highlighted in azure.

@jorisvandenbossche @gfyoung if you can review, and give this a bit of priority... there will be conflicts soon, as there are several issues that add new rules. Thanks!

@pep8speaks
Copy link

Hello @datapythonista! Thanks for submitting the PR.

Copy link
Member

@gfyoung gfyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! Let's start with these comments and go from there.

scripts/validate_docstrings.py Outdated Show resolved Hide resolved
scripts/validate_docstrings.py Outdated Show resolved Hide resolved
scripts/validate_docstrings.py Outdated Show resolved Hide resolved
scripts/validate_docstrings.py Outdated Show resolved Hide resolved
scripts/validate_docstrings.py Outdated Show resolved Hide resolved
scripts/validate_docstrings.py Outdated Show resolved Hide resolved
scripts/validate_docstrings.py Show resolved Hide resolved
scripts/validate_docstrings.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor

Which will finish with an exit status different than 0, validate only certain docstrings (the ones starting with pandas.read_

In your opinion, what does our end goal look like? I'm just wondering whether we should have a list of prefixes that are succeeding, or a list of allowed failures. The benefit of the allowed failures is that new APIs (outside the existing prefixes) are automatically checked by default. If we go with a list of "good" prefixes, then we need to remember to add the new API to the list.

@datapythonista
Copy link
Member Author

@TomAugspurger that's a very good point, it was discussed a bit in #23481.

Ideally we want to run for all errors and all docstrings, and fail the CI if something is wrong. We have around 9,000 errors at the moment (and still adding new validations).

I agree excluding docstrings (like what we do with the doctests) is likely to be very useful. But I prefer to keep that for a second PR, so this one is not too complex, and I can play a bit with this before thinking on strategies to validate as much as possible in the best way.

So far with this we should be able to validate for specific errors (like pep8 in examples, or parameter mismatches) in an incremental way (start by Series.str, then Series, then Series and DataFrame...).

@TomAugspurger
Copy link
Contributor

I agree excluding docstrings (like what we do with the doctests) is likely to be very useful. But I prefer to keep that for a second PR, so this one is not too complex,

completely agree. Just thinking towards the end goal, but we're a ways from that right now :)

@datapythonista
Copy link
Member Author

@gfyoung I made the changes. I was thinking on an additional change, but I want to know your opinion. It'd be having all the error messages in a dictionary:

ERR_MSG = {
    ...
    'PR04': 'Parameter "{param}" has no type',
    ...

def validate():
    ...
    if some_cond:
        # may be the tuple (code, msg) could be created with a function, so this looks clearer
        errs.append(('PR04', ERR_MSG['PR04'].format(param=param)))
    ...

I'm unsure, because we will loose the error messages next to the conditions. But without the long messages in between all the conditions I think the code will be neater in that part (the most critical of this script). And we also have the reference of error codes to error messages clearer in one place (and it can be used externally, to map codes to messages, for example if I generate a report of the most frequent errors.

What do you think?

@gfyoung
Copy link
Member

gfyoung commented Nov 6, 2018

@datapythonista : I think putting it in a dictionary seems reasonable. That way the implementation of the function doesn't expand with more error messages, only with logic changes, if that makes sense.

@datapythonista
Copy link
Member Author

thanks for the feedback, I'll change it then, and we can see in practice

@jorisvandenbossche
Copy link
Member

+1 on gathering all codes and (base) error messages together in a dict, if possible. That would make it easier to see which codes/errors we have (but of course might make it a bit less clear in the code what the message is where it is raised)

@datapythonista
Copy link
Member Author

Made the changes. I do think it makes things clearer.

@codecov
Copy link

codecov bot commented Nov 6, 2018

Codecov Report

Merging #23514 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #23514   +/-   ##
=======================================
  Coverage   92.25%   92.25%           
=======================================
  Files         161      161           
  Lines       51169    51169           
=======================================
  Hits        47207    47207           
  Misses       3962     3962
Flag Coverage Δ
#multiple 90.64% <ø> (ø) ⬆️
#single 42.28% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c6366f5...f0ab4c3. Read the comment docs.

@datapythonista
Copy link
Member Author

All green. If there are no more changes required, could someone take another look and merge? There is some other work going on on adding validations, and would be nice to have this merged soon.

@jorisvandenbossche jorisvandenbossche merged commit 5938ce1 into pandas-dev:master Nov 7, 2018
@jorisvandenbossche
Copy link
Member

Thanks!

@datapythonista
Copy link
Member Author

Thanks @jorisvandenbossche. I just detected a typo in an error that wasn't tested. I'm fixing it now, and adding the test, will open a separate PR.

JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018
* validate_docstrings.py to exit with status code as the number of errors (so, 0 for no errors)

* Implemented different output types for the validate_all, and a prefix to filter which docstrings are validated

* Codifying errors

* Adding --errors parameter to be able to validate only specific errors
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
* validate_docstrings.py to exit with status code as the number of errors (so, 0 for no errors)

* Implemented different output types for the validate_all, and a prefix to filter which docstrings are validated

* Codifying errors

* Adding --errors parameter to be able to validate only specific errors
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
* validate_docstrings.py to exit with status code as the number of errors (so, 0 for no errors)

* Implemented different output types for the validate_all, and a prefix to filter which docstrings are validated

* Codifying errors

* Adding --errors parameter to be able to validate only specific errors
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
* validate_docstrings.py to exit with status code as the number of errors (so, 0 for no errors)

* Implemented different output types for the validate_all, and a prefix to filter which docstrings are validated

* Codifying errors

* Adding --errors parameter to be able to validate only specific errors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration Docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make validate_docstrings.py ready for the CI
5 participants