Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pandas support #684

Merged
merged 4 commits into from
Sep 6, 2018
Merged

Add pandas support #684

merged 4 commits into from
Sep 6, 2018

Conversation

znicholls
Copy link
Contributor

@znicholls znicholls commented Aug 28, 2018

This pull request adds pandas support to pint (hence is related to #645, #401 and pandas-dev/pandas#10349).

An example can be seen in example-notebooks/basic-example.ipynb.

It's a little bit hacksih, feedback would be greatly appreciated by me and @andrewgsavage. One obvious example is that we have to run all the interface tests with pytest to fit with pandas test suite, which introduces a dependency for the CI and currently gives us this awkward testing setup (see the alterations we had to make to testsuite). This also means that our code coverage tests are fiddly too.

If you'd like us to squash the commits, that can be done.

If pint has a linter, it would be good to run that over this pull request too as we're a little bit all over the place re style.

Things to discuss:

  • general feedback and changes
  • test setup, especially need for pytest for pandas tests and hackish way to get around automatic discovery
  • squashing/rebasing
  • linting/other code style (related to Code Quality #664 and Added .editorconfig #628: we're happy with whatever, I've found using an automatic linter e.g. black and/or flake8 has made things much simpler in other projects)
  • including notebooks in the repo (if we want to, I'm happy to put them under CI so we can make sure they run)
  • setting up the docs correctly

@znicholls znicholls force-pushed the master branch 2 times, most recently from 0fa2a51 to 29296db Compare August 29, 2018 15:12
@znicholls
Copy link
Contributor Author

znicholls commented Aug 29, 2018

I don't understand why this is showing a drop in coverage/how to fix the coverage if there actually is a drop

@andrewgsavage
Copy link
Collaborator

Is it because more lines have been added, but only tested in the pandas job, so the non pandas job coverages have fallen?

That's what I get from comparing the most recent build https://coveralls.io/builds/18732747 to an older build https://coveralls.io/builds/18028346

@znicholls
Copy link
Contributor Author

Is it because more lines have been added, but only tested in the pandas job, so the non pandas job coverages have fallen?

Nice, found a solution.

@jondoesntgit
Copy link
Contributor

Good work guys!

Yes, coverage is a ratio of (tested lines)/(total lines). That goes down if you write lines of code without adding tests.

@znicholls
Copy link
Contributor Author

@hgrecco any possibility you could let us know whether this is even a feasible merge or whether we should start building a separate repo please?

@znicholls
Copy link
Contributor Author

Yes, coverage is a ratio of (tested lines)/(total lines). That goes down if you write lines of code without adding tests.

Thanks, I think we found a solution in the end. Basically making sure that coverage only considers the pandas stuff if the pandas integration is being tested.

@hgrecco
Copy link
Owner

hgrecco commented Sep 2, 2018

This is a long awaited feature. So I am eager to merge it , In general, I like the code and the API. But It is long PR and I need a extra couple of days to play with it.

@znicholls
Copy link
Contributor Author

Great thanks. Take as long as you need! For our research group, if we have a rough timeline then we can work around it (trying to work with no information is always the most difficult)

@jondoesntgit
Copy link
Contributor

Will probably want to include some docs. The example ipynb file is good, but folks who read the readthedocs page might not know to look for it on the Github repo.

@znicholls
Copy link
Contributor Author

Will probably want to include some docs

Yep sounds good! Neither @andrewgsavage nor I could get our heads around how the docs work, do you have a resource you can point to/can you explain where the best place to add them is?

@dalito
Copy link
Contributor

dalito commented Sep 3, 2018

To add to the current documentation structure I suggest to create a new file pint/docs/pandas.rst and reference it in pint/docs/index.rst. You may use pint/docs/numpy.rst as guideline for the new pandas.rst.

@znicholls
Copy link
Contributor Author

znicholls commented Sep 3, 2018

To add to the current documentation structure I suggest to create a new file pint/docs/pandas.rst and reference it in pint/docs/index.rst. You may use pint/docs/numpy.rst as guideline for the new pandas.rst.

Cheers, have had a crack. I don't really understand how the doctests work so they may blow up... Some other things are also not that nice but hopefully with some guidance we can turn this into a nice pull request. @andrewgsavage jump in with any additions you want!

@andrewgsavage
Copy link
Collaborator

Nice, docs! There's a few parts I should probably explain better. Will do so when I have a chance.

Have you tried using df.to_html? That might display the dfs like jupyter does.

Noticed that the dfs are now showing with the unit in the column. I think this is because of a change made to array without changing _format_values. I'm not sure which I prefer - it's nice to see the units but that reduces the space available for columns/is distracting.

znicholls added a commit to znicholls/pandas that referenced this pull request Sep 3, 2018
We are working on upgrading pint to be compatible with pandas, see hgrecco/pint#684

I am guessing that the line in the docs, 'If you’re building a library that implements the interface, please publicize it on Extension Data Types.', meant something like this pull request. If that's completely wrong, apologies.
@jondoesntgit
Copy link
Contributor

Good job with the docs!

I think I’m partial to showing units in the column header. Perhaps on a line below in parentheses (maybe printing using a beeline character). That keeps the data frame horizontally compact.

I think there are pros and cons to wherever you put the units though.

@znicholls
Copy link
Contributor Author

znicholls commented Sep 3, 2018

Have you tried using df.to_html? That might display the dfs like jupyter does.

Yep I thought about that. I didn't know what this 'doctest' stuff was doing though so got nervous and didn't...

I think I’m partial to showing units in the column header. Perhaps on a line below in parentheses (maybe printing using a beeline character). That keeps the data frame horizontally compact.

This sounds super nice. If we follow that convention then we force users to only have a single unit for a given column. My only wondering was whether we want that to be the case (I think it's the most sensible, but it's not the only option). We could, for example, allow a column to have different units within the column (e.g. the first value could be in metres, the second in centimetres, the third in millimetres etc.). That may quickly become a nightmare though...

@hgrecco
Copy link
Owner

hgrecco commented Sep 3, 2018

I agree that the docs are very nice. A few suggestions regarding the docs, but also the API:

  • In my experience, loading from a package resource brings trouble. Either load it from an URL or better, import the example dataframe (this can be done using the python machinery for doing so which is better than hard coding the location), then explain how to save it and then load it.

  • Does changing the default formatter affects how it is shown in the dataframe. If so, it would be nice to show it. If not, maybe we should implement it.

@andrewgsavage
Copy link
Collaborator

andrewgsavage commented Sep 3, 2018

I think I’m partial to showing units in the column header. Perhaps on a line below in parentheses (maybe printing using a beeline character). That keeps the data frame horizontally compact.

Same, but I don't believe this is possible. The .dequantify() adds an extra row to the columns which was the closet I could get to.

Will need to look into the formatting again, as it might be possible to show the units in the first row of data only, eg.
meter * Newton \n
123

  • Is this desirable?

This sounds super nice. If we follow that convention then we force users to only have a single unit for a given column

Data is stored as a pint quantity, which handles the unit converting and such. When you set a value in the PintArray to a different unit to the underlying quantity (in ._data), the value will be converted to the original unit.

@znicholls
Copy link
Contributor Author

Data is stored as a pint quantity, which handles the unit converting and such. When you set a value in the PintArray to a different unit to the underlying quantity (in ._data), the value will be converted to the original unit.

Awesome

@jondoesntgit
Copy link
Contributor

jondoesntgit commented Sep 3, 2018

It might require redefining __repr__ or to_html

@andrewgsavage
Copy link
Collaborator

Does changing the default formatter affects how it is shown in the dataframe. If so, it would be nice to show it. If not, maybe we should implement it.

What do you mean by the default formatter? Pandas uses a separate formatter that we pass text/values into, _formatting_values.

@andrewgsavage
Copy link
Collaborator

Awesome

Hmm, if you're questioning that it I take it isn't obvious, maybe this needs an example/reference to the pint tutorial?

@znicholls
Copy link
Contributor Author

znicholls commented Sep 3, 2018

Hmm, if you're questioning that it I take it isn't obvious, maybe this needs an example/reference to the pint tutorial?

Reference to the pint tutorial I think, I've never done it (can I admit that now..?)

@znicholls
Copy link
Contributor Author

In my experience, loading from a package resource brings trouble. Either load it from an URL or better, import the example dataframe (this can be done using the python machinery for doing so which is better than hard coding the location), then explain how to save it and then load it.

Latest commit should fix this I believe

@znicholls
Copy link
Contributor Author

In my experience, loading from a package resource brings trouble. Either load it from an URL or better, import the example dataframe (this can be done using the python machinery for doing so which is better than hard coding the location), then explain how to save it and then load it.

are you ok with having the test file in pint/testsuite/test-data or do you want that gone too?

@andrewgsavage
Copy link
Collaborator

It might require redefining repr or .to_html

I'm assuming you're referring to pandas' repr/._to_html. These aren't things ExtensionArrays can change so I can't see any way of going about it.

It might be worth submitting an issue to pandas to allow ExtensionArrays to display metadata in column headers.

@znicholls
Copy link
Contributor Author

ok I think I've done all the steps we wanted now. The last big problem is that in the process of rebasing, I seem to have removed any mention of @andrewgsavage's contribution. I'm not very comfortable with this as all of the interface code is his (I've basically only done the test code). Do you have a suggestion for how to fix this whilst also doing such a mega rebase?

What I do not know how to include the pytests in a unittest.TestSuite. If we achieve that, then running python setup.py test will run all tests

I don't think that's possible either...

@znicholls
Copy link
Contributor Author

@hgrecco this is good from our end (tests pending)

@znicholls znicholls closed this Sep 6, 2018
@znicholls znicholls reopened this Sep 6, 2018
@hgrecco
Copy link
Owner

hgrecco commented Sep 6, 2018

bors r+

bors bot added a commit that referenced this pull request Sep 6, 2018
684: Add pandas support r=hgrecco a=znicholls

This pull request adds pandas support to pint (hence is related to #645, #401 and pandas-dev/pandas#10349).

An example can be seen in `example-notebooks/basic-example.ipynb`.

It's a little bit hacksih, feedback would be greatly appreciated by me and @andrewgsavage. One obvious example is that we have to run all the interface tests with `pytest` to fit with `pandas` test suite, which introduces a dependency for the CI and currently gives us this awkward testing setup (see the alterations we had to make to `testsuite`). This also means that our code coverage tests are fiddly too.

If you'd like us to squash the commits, that can be done.

If pint has a linter, it would be good to run that over this pull request too as we're a little bit all over the place re style.

Things to discuss:

- [x]  general feedback and changes
- [x] test setup, especially need for pytest for pandas tests and hackish way to get around automatic discovery
- [x] squashing/rebasing
- [x] linting/other code style (related to #664 and #628: we're happy with whatever, I've found using an automatic linter e.g. black and/or flake8 has made things much simpler in other projects)
- [x] including notebooks in the repo (if we want to, I'm happy to put them under CI so we can make sure they run)
- [x] setting up the docs correctly

Co-authored-by: Zebedee Nicholls <zebedee.nicholls@climate-energy-college.org>
Co-authored-by: andrewgsavage <andrewgsavage@gmail.com>
@hgrecco
Copy link
Owner

hgrecco commented Sep 6, 2018

I have asked bors to merge it. Big thanks and congratulations for such wonderful work and thanks everybody else that helped by reviewing and commenting on the code. I think that now we deserve a new release

@bors
Copy link
Contributor

bors bot commented Sep 6, 2018

Build succeeded

@bors bors bot merged commit d17c745 into hgrecco:master Sep 6, 2018
@hgrecco
Copy link
Owner

hgrecco commented Sep 6, 2018

Done! I have opened #693 to track the new release

znicholls added a commit to znicholls/pandas that referenced this pull request Sep 6, 2018
We are working on upgrading pint to be compatible with pandas, see hgrecco/pint#684

I am guessing that the line in the docs, 'If you’re building a library that implements the interface, please publicize it on Extension Data Types.', meant something like this pull request. If that's completely wrong, apologies.
znicholls added a commit to znicholls/pandas that referenced this pull request Sep 6, 2018
@esip-lab esip-lab mentioned this pull request Sep 7, 2018
@TomAugspurger
Copy link

Would you all be OK with adding Pint's extension array to the pandas pandas ecosystem page at http://pandas.pydata.org/pandas-docs/version/0.24/ecosystem.html#extension-data-types?

@hgrecco
Copy link
Owner

hgrecco commented Jan 14, 2019

Sure. Please add something like beta or prerelease or provisional. I think we need to test it better on our side via a broader use.

@znicholls
Copy link
Contributor Author

I think we have this here pandas-dev/pandas#22582 so maybe just need to re-open and update?

@TomAugspurger
Copy link

TomAugspurger commented Jan 14, 2019 via email

@znicholls
Copy link
Contributor Author

@hgrecco @andrewgsavage to be clear, Pint no longer has the pandas support and we're moving it into https://github.com/hgrecco/pint-pandas ?

@hgrecco
Copy link
Owner

hgrecco commented Jan 14, 2019

True. But it would be an extension of pint. As soon as we think that is stable we can add it to pint/setup.py as optional.

@znicholls
Copy link
Contributor Author

Ok great. I'll update pandas-dev/pandas#22582 later today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants