Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Use flake8 to check for PEP8 violations in doctests #23399

Merged
merged 13 commits into from
Nov 4, 2018

Conversation

FHaase
Copy link
Contributor

@FHaase FHaase commented Oct 28, 2018

  • tests added
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff

Improvement to PR #23381 regarding issue #23154

Enables --doctests flag, resulting in doctests being checked by flake8
Uses flake8.api.legacy to invoke the tests.

Downsides are:

  • Flake8 3.0.0 presently does not have a public, stable Python API.

  • provided options in flake8.get_style_guide don't have an effect. Thus setup.cfg had to be enhanced.
  • returns the count of each error within the complete file, not just the requested function. [Might be confusing]

python ./scripts/validate_docstrings.py pandas.read_excel

                [...]
                Parameter "convert_float" description should finish with "."
        Errors in doctest sections
                2 F721 syntax error in doctest
                2 F821 undefined name '_make_signature'

Help in the form of Use "flake8 pandas/util/_decorators.py" for further information. could be added to the output:

pandas/util/_decorators.py:320:26: F721 syntax error in doctest
pandas/util/_decorators.py:321:13: F721 syntax error in doctest
pandas/util/_decorators.py:322:15: F821 undefined name '_make_signature'
pandas/util/_decorators.py:322:31: F821 undefined name 'f'

Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
@pep8speaks
Copy link

Hello @FHaase! Thanks for submitting the PR.

@gfyoung
Copy link
Member

gfyoung commented Oct 29, 2018

@FHaase : Thanks for the PR! Given your first comment in the description, I would recommend opening an issue first. That's generally what we would recommend instead of jumping right to the PR.

cc @datapythonista

@gfyoung gfyoung added Docs Code Style Code style, linting, code_checks labels Oct 29, 2018
@datapythonista
Copy link
Member

Thanks @FHaase, this looks much better. Can you mention the issue you close, and the other PR related to this in the description please.

I just saw couple of things. Doesn't feel like we should be using a module named legacy. I guess you checked it, but doesn't flake8 have a newer module if they call that one legacy?

Also, feels like when we validate a docstring, we validate all the pep8 problems in the file, right? We should find a way to validate a single docstring.

When this is done, we should also implement tests for this, as the rest of validations.

In any case, good work, this will be very helpful once is done.

@FHaase
Copy link
Contributor Author

FHaase commented Oct 29, 2018

I just saw couple of things. Doesn't feel like we should be using a module named legacy. I guess you checked it, but doesn't flake8 have a newer module if they call that one legacy?

flake8.api is empty besides the legacy submodule. __init.py__ states:

Module containing all public entry-points for Flake8.
This is the only submodule in Flake8 with a guaranteed stable API. All other
submodules are considered internal only and are subject to change.

Also, feels like when we validate a docstring, we validate all the pep8 problems in the file, right? We should find a way to validate a single docstring.

Yes I mentioned it. The only way of fixing this might be to create a temp-file just with the docstring in it and run flake8 on that?

Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
setup.cfg Outdated
./pandas/computation
./pandas/core
./pandas/errors
./pandas/io
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a number of these have been removed recently, can you edit

* updated folders excluded from flake8 --doctests
* fixed failing doctests

Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
* Examples in `test_validate_docstrings.py` lacked imports
* Removed signature from function in generated file as not really needed
  at this point.
* Used `clean_doc` instead of `raw_doc` to get correct indentation.
* Added missing try: finally: block in `_file_representation()`

Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
@FHaase FHaase force-pushed the feature/validate_docstrings branch from 9d7b896 to 762de5d Compare October 30, 2018 17:58
@codecov
Copy link

codecov bot commented Oct 30, 2018

Codecov Report

Merging #23399 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #23399   +/-   ##
=======================================
  Coverage   92.23%   92.23%           
=======================================
  Files         161      161           
  Lines       51197    51197           
=======================================
  Hits        47220    47220           
  Misses       3977     3977
Flag Coverage Δ
#multiple 90.61% <ø> (ø) ⬆️
#single 42.27% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/datetimes.py 96.41% <0%> (ø) ⬆️
pandas/core/generic.py 96.81% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 047242b...1178cae. Read the comment docs.

@FHaase FHaase changed the title [WIP] ENH: Use flake8 to check for PEP8 violations in doctests ENH: Use flake8 to check for PEP8 violations in doctests Nov 1, 2018
@FHaase FHaase force-pushed the feature/validate_docstrings branch 2 times, most recently from 46cd0a4 to 844065c Compare November 2, 2018 13:50
@FHaase
Copy link
Contributor Author

FHaase commented Nov 2, 2018

I just saw couple of things. Doesn't feel like we should be using a module named legacy. I guess you checked it, but doesn't flake8 have a newer module if they call that one legacy?

Dug down into the code and basically recreated the legacy api.

@FHaase FHaase force-pushed the feature/validate_docstrings branch 2 times, most recently from 8b5c95d to 9c09bf1 Compare November 2, 2018 14:02
* added test
* simplified temporary file usage
* included flake8 by calling application directly
* removed doctests from setup.cfg

Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
@FHaase FHaase force-pushed the feature/validate_docstrings branch from 9c09bf1 to 9bc7f65 Compare November 2, 2018 14:25
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments, mainly about code clarity and simplicity. But looks good, I think we're almost there.

scripts/tests/test_validate_docstrings.py Outdated Show resolved Hide resolved
@@ -584,6 +587,29 @@ def prefix_pandas(self):
pass


class BadExamples(object):

def doctest(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doctest is not a very descriptive name if this test is to check whether pep8 should fail because of unknown np. The rest of the example is very verbose to just check that. I'd use the description to explain what is this test about. And in the examples just with >>> np.nan could be enough.

Then, couple more tests could be added, to test the typical pep8 issues we have, like >>> foo = 1+2 (missing spaces)...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to use flake8 to check for pep8 issues was to not have to write all these tests. flake8 should be tested enough to work properly.

This test basically only verifies that flake8 is actually testing the docstrings.

import tempfile
from contextlib import contextmanager

from flake8.main.application import Application as Flake8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it complicates things unnecessarily importing objects other than modules. import contextlib and import flake8.main.application seems like the simplest and best way. And then use accordinly @contextlib.contextmanager and app = flake8.main.application.Application. This way, just by reading those lines it's obvious what is the module and what is being used, and there is not need to go to the imports and start decoding aliases.

]

@contextmanager
def _file_representation(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is not very descriptive. I think this is the examples as temporary file, so something like _examples_as_temp_file seems much clearer.

Not sure if it's worth having two separate methods, I'd generate the file in the method that validates pep8. But if we have two methods, I'd have this first, as this is being used by the other (pep8_violations).

Temporarily creates file with current function inside.
The signature and body are **not** included.

:returns file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use numpydoc, not this syntax

lines = self.clean_doc.split("\n")
indented_lines = [(' ' * 4) + line if line else ''
for line in lines[1:]]
doc = '\n'.join([lines[0], *indented_lines])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused with this. I guess I'm missing something. If in our docstring we have something like:

Some desc.

Examples
--------
>>> val = 1+2
>>> print(val)

I'd expect to have in the temporary file to validate:

val = 1+2
print(val)

As we won't validate the description, or anything else. I think numpydoc returns the lines with code, so writing to the file should be a single line of code. Is there any advantage generating a function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well flake8 supports doctests by adding --doctests as an argument. And the Examples follow the doctest standard.

Examples
An optional section for examples, using the doctest format.

Having to carve out the examples is not trivial and would require writing multiple tests. So creating a file with a dummy-function is a simple hack, getting ugly cause it has to restore the indentation of the docstring. One test of the BadExamples class ensures it is working.

@@ -446,7 +486,7 @@ def validate_one(func_name):
if doc.summary != doc.summary.lstrip():
errs.append('Summary contains heading whitespaces.')
elif (doc.is_function_or_method
and doc.summary.split(' ')[0][-1] == 's'):
and doc.summary.split(' ')[0][-1] == 's'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be better to not do unrelated changes

return [
"{} {} {}".format(s.count, s.error_code, s.message)
for s in stats.statistics_for('')
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd personally prefer to yield from application.guide.stats.statistics_for('') instead of returning the string. In case the information is used in more than one place, having the components can be useful.

Also, it's an opinion, but I wouldn't use a property here, as this feels more like an action than an attribute. A method validate_pep8 seems easier to understand.

Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great. Still few comments. And we'll probably have to wait to #23161 to reuse the code extraction from the examples, but it should be merged very soon.


Examples
--------
>>> import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize before, but there is something tricky here. The convention is to assume numpy and pandas are always imported before the examples. I personally would be explicit, but that's how all the examples are made.

This means two things:

  • pep8 will fail because of this, unless we add the imports when validating
  • we should use different examples for the tests (may be not imorting math?)

And as I said in the previous review, I'd like to have more examples of pep8 failures, like missing spaces around an operator. Also, these examples are too long for what is being tested.

@@ -24,6 +24,10 @@
import inspect
import importlib
import doctest
import tempfile

from flake8.main import application as flake8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, just import flake8.main.application, and then use app = flake8.main.application.Application(). It makes things confusing to use aliases, and even more if the name of the alias is the name of the root module of what is being imported.

def validate_pep8(self):
create_function = 'def {name}():\n' \
' """\n{examples}\n """\n' \
' pass\n'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said, I have a preference to have only the examples code in the temporary file. Extracting the code from the examples is trivial, and once #23161 is merged we'll have a property in this class that returns the lines of code.

What you will have to do is before writing the code in the examples, write import numpy as np and same for pandas, so flake8 does not complain because of that.


application = flake8.Application()
application.initialize(["--doctests", "--quiet"])
application.run_checks([file.name])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just leave the writing to the file and this line inside the context manager, and creating the app object, initializing... have it outside?

scripts/validate_docstrings.py Outdated Show resolved Hide resolved
@@ -490,6 +514,13 @@ def validate_one(func_name):
for param_err in param_errs:
errs.append('\t{}'.format(param_err))

pep8_errs = [error for error in doc.validate_pep8()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is easier to read:

Suggested change
pep8_errs = [error for error in doc.validate_pep8()]
pep8_errs = list(doc.validate_pep8())

datapythonista and others added 2 commits November 3, 2018 17:45
Co-Authored-By: FHaase <haase.fabian@gmail.com>
Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good, added a couple of comments about what to expect from #23161, and some style preferences, but we're almost there.

I'm really looking forward to have this merged. After that I'll add this validation to the CI, so we don't need to keep reviewing the pep8 of examples manually, which is a significant amount of work.

Thanks for all the work on this.

"""
pass

def pandas_imported(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the misunderstanding. numpy_imported and pandas_imported and the corresponding tests are not needed, as that is already tested in #23161.

As mentioned in the previous review, forget about these two imports in the examples in this PR. You are repeating what's implemented in #23161 which will be merged earlier.

' are imported as pd or np',)),
('BadExamples', 'pandas_imported',
('F811 It\'s assumed that pandas and numpy'
' are imported as pd or np',)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two examples need to be removed

"""
content = ''.join(('import numpy as np; '
'import pandas as pd # noqa: F401,E702\n',
*self.examples_source_code))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be simpler to join by \n instead, and remove the semicolon and the noqa?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • regarding E702: with one line the great thing was to achieve intercepting of the error message.
  • regarding F401: All examples not using both: np and pd would fail because of an unused import.

The implementation of self.examples_source_code of PR #23161 returns an array with \n at the end, so joining with \n isn't possible.

if statistic.message.endswith('from line 1'):
statistic.message = "It's assumed that pandas and numpy" \
" are imported as pd or np"
yield statistic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess with the original yield from we may get a pep8 error of duplicate import, or something similar. I'm fine with that. Besides the pep8 errors, we'll have the error of the other section implemented in #23161, which should make it very easy for users to identify the problem.

So, I'd simply leave the yield from ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found F811 redefinition of unused 'np' from line 1 could be confusing when read first.

application = flake8.main.application.Application()
application.initialize(["--quiet"])
application.run_checks([file.name])
application.report()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I wasn't clear in my previous review about this. This is what I'd do:

application = flake8.main.application.Application()
application.initialize(["--quiet"])
with tempfile.NamedTemporaryFile(mode='w') as f:
    f.write(content)
    f.flush()
    application.run_checks([f.name])
application.report()

I think it's a good practice to have inside the context manager, the code that requires to be there.

Also, as this file is going quite significantly, I think it's better to keep all this functionality in a single function, so each validation is somehow isolated. I think in another context it makes more sense what you did of splitting, but I don't think we'll reuse the temporary file creation, and I think the improvement in simplicity makes it worth.

yield file

def validate_pep8(self):
if self.examples:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small thing, but my preference is having:

if not self.examples:
    return

this avoids having the rest of the function indented, and in my opinion increases readability.

Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
…docstrings

Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, thanks @FHaase

Just one suggestion, but ready to be merged on green I think. @jreback you requested changed, if you want to take a look.

if not self.examples:
return

content = ''.join(('import numpy as np # noqa: F401\n',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be we can add a comment before this line explaining why the F401 is required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the content basically never shows as it's only existing within tmp files. And if someone thinks it's unnecessary the tests would fail.
So I don't think commenting is needed there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment was not for the temp file, but our code, and I think it's useful to know what the F401 is about when reading this part of code (without having to remove it and see that fails). Anyway, this is merged now. :)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge when satisfied

@datapythonista datapythonista merged commit b9fc22d into pandas-dev:master Nov 4, 2018
@datapythonista
Copy link
Member

Thanks @FHaase, extremely useful PR

@FHaase FHaase deleted the feature/validate_docstrings branch November 4, 2018 19:19
JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Code Style Code style, linting, code_checks Docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants