Add pandas support #684

znicholls · 2018-08-28T19:54:44Z

This pull request adds pandas support to pint (hence is related to #645, #401 and pandas-dev/pandas#10349).

An example can be seen in example-notebooks/basic-example.ipynb.

It's a little bit hacksih, feedback would be greatly appreciated by me and @andrewgsavage. One obvious example is that we have to run all the interface tests with pytest to fit with pandas test suite, which introduces a dependency for the CI and currently gives us this awkward testing setup (see the alterations we had to make to testsuite). This also means that our code coverage tests are fiddly too.

If you'd like us to squash the commits, that can be done.

If pint has a linter, it would be good to run that over this pull request too as we're a little bit all over the place re style.

Things to discuss:

general feedback and changes
test setup, especially need for pytest for pandas tests and hackish way to get around automatic discovery
squashing/rebasing
linting/other code style (related to Code Quality #664 and Added .editorconfig #628: we're happy with whatever, I've found using an automatic linter e.g. black and/or flake8 has made things much simpler in other projects)
including notebooks in the repo (if we want to, I'm happy to put them under CI so we can make sure they run)
setting up the docs correctly

znicholls · 2018-08-29T16:28:56Z

I don't understand why this is showing a drop in coverage/how to fix the coverage if there actually is a drop

andrewgsavage · 2018-08-29T17:23:07Z

Is it because more lines have been added, but only tested in the pandas job, so the non pandas job coverages have fallen?

That's what I get from comparing the most recent build https://coveralls.io/builds/18732747 to an older build https://coveralls.io/builds/18028346

znicholls · 2018-08-29T19:58:23Z

Is it because more lines have been added, but only tested in the pandas job, so the non pandas job coverages have fallen?

Nice, found a solution.

jondoesntgit · 2018-08-31T07:37:29Z

Good work guys!

Yes, coverage is a ratio of (tested lines)/(total lines). That goes down if you write lines of code without adding tests.

znicholls · 2018-08-31T17:26:49Z

@hgrecco any possibility you could let us know whether this is even a feasible merge or whether we should start building a separate repo please?

znicholls · 2018-09-02T10:21:10Z

Yes, coverage is a ratio of (tested lines)/(total lines). That goes down if you write lines of code without adding tests.

Thanks, I think we found a solution in the end. Basically making sure that coverage only considers the pandas stuff if the pandas integration is being tested.

hgrecco · 2018-09-02T12:37:53Z

This is a long awaited feature. So I am eager to merge it , In general, I like the code and the API. But It is long PR and I need a extra couple of days to play with it.

znicholls · 2018-09-02T12:45:32Z

Great thanks. Take as long as you need! For our research group, if we have a rough timeline then we can work around it (trying to work with no information is always the most difficult)

jondoesntgit · 2018-09-03T02:43:35Z

Will probably want to include some docs. The example ipynb file is good, but folks who read the readthedocs page might not know to look for it on the Github repo.

znicholls · 2018-09-03T11:49:23Z

Will probably want to include some docs

Yep sounds good! Neither @andrewgsavage nor I could get our heads around how the docs work, do you have a resource you can point to/can you explain where the best place to add them is?

dalito · 2018-09-03T12:09:03Z

To add to the current documentation structure I suggest to create a new file pint/docs/pandas.rst and reference it in pint/docs/index.rst. You may use pint/docs/numpy.rst as guideline for the new pandas.rst.

znicholls · 2018-09-03T15:41:59Z

To add to the current documentation structure I suggest to create a new file pint/docs/pandas.rst and reference it in pint/docs/index.rst. You may use pint/docs/numpy.rst as guideline for the new pandas.rst.

Cheers, have had a crack. I don't really understand how the doctests work so they may blow up... Some other things are also not that nice but hopefully with some guidance we can turn this into a nice pull request. @andrewgsavage jump in with any additions you want!

andrewgsavage · 2018-09-03T16:07:07Z

Nice, docs! There's a few parts I should probably explain better. Will do so when I have a chance.

Have you tried using df.to_html? That might display the dfs like jupyter does.

Noticed that the dfs are now showing with the unit in the column. I think this is because of a change made to array without changing _format_values. I'm not sure which I prefer - it's nice to see the units but that reduces the space available for columns/is distracting.

We are working on upgrading pint to be compatible with pandas, see hgrecco/pint#684 I am guessing that the line in the docs, 'If you’re building a library that implements the interface, please publicize it on Extension Data Types.', meant something like this pull request. If that's completely wrong, apologies.

jondoesntgit · 2018-09-03T16:18:54Z

Good job with the docs!

I think I’m partial to showing units in the column header. Perhaps on a line below in parentheses (maybe printing using a beeline character). That keeps the data frame horizontally compact.

I think there are pros and cons to wherever you put the units though.

znicholls · 2018-09-03T16:28:06Z

Have you tried using df.to_html? That might display the dfs like jupyter does.

Yep I thought about that. I didn't know what this 'doctest' stuff was doing though so got nervous and didn't...

I think I’m partial to showing units in the column header. Perhaps on a line below in parentheses (maybe printing using a beeline character). That keeps the data frame horizontally compact.

This sounds super nice. If we follow that convention then we force users to only have a single unit for a given column. My only wondering was whether we want that to be the case (I think it's the most sensible, but it's not the only option). We could, for example, allow a column to have different units within the column (e.g. the first value could be in metres, the second in centimetres, the third in millimetres etc.). That may quickly become a nightmare though...

hgrecco · 2018-09-03T16:29:48Z

I agree that the docs are very nice. A few suggestions regarding the docs, but also the API:

In my experience, loading from a package resource brings trouble. Either load it from an URL or better, import the example dataframe (this can be done using the python machinery for doing so which is better than hard coding the location), then explain how to save it and then load it.
Does changing the default formatter affects how it is shown in the dataframe. If so, it would be nice to show it. If not, maybe we should implement it.

andrewgsavage · 2018-09-03T16:52:48Z

I think I’m partial to showing units in the column header. Perhaps on a line below in parentheses (maybe printing using a beeline character). That keeps the data frame horizontally compact.

Same, but I don't believe this is possible. The .dequantify() adds an extra row to the columns which was the closet I could get to.

Will need to look into the formatting again, as it might be possible to show the units in the first row of data only, eg.
meter * Newton \n
123

Is this desirable?

This sounds super nice. If we follow that convention then we force users to only have a single unit for a given column

Data is stored as a pint quantity, which handles the unit converting and such. When you set a value in the PintArray to a different unit to the underlying quantity (in ._data), the value will be converted to the original unit.

znicholls · 2018-09-03T16:57:01Z

Data is stored as a pint quantity, which handles the unit converting and such. When you set a value in the PintArray to a different unit to the underlying quantity (in ._data), the value will be converted to the original unit.

Awesome

jondoesntgit · 2018-09-03T16:57:28Z

It might require redefining __repr__ or to_html

andrewgsavage · 2018-09-03T17:00:19Z

Does changing the default formatter affects how it is shown in the dataframe. If so, it would be nice to show it. If not, maybe we should implement it.

What do you mean by the default formatter? Pandas uses a separate formatter that we pass text/values into, _formatting_values.

andrewgsavage · 2018-09-03T17:06:31Z

Awesome

Hmm, if you're questioning that it I take it isn't obvious, maybe this needs an example/reference to the pint tutorial?

znicholls · 2018-09-03T17:11:43Z

Hmm, if you're questioning that it I take it isn't obvious, maybe this needs an example/reference to the pint tutorial?

Reference to the pint tutorial I think, I've never done it (can I admit that now..?)

znicholls · 2018-09-03T17:12:20Z

In my experience, loading from a package resource brings trouble. Either load it from an URL or better, import the example dataframe (this can be done using the python machinery for doing so which is better than hard coding the location), then explain how to save it and then load it.

Latest commit should fix this I believe

znicholls · 2018-09-03T17:29:16Z

In my experience, loading from a package resource brings trouble. Either load it from an URL or better, import the example dataframe (this can be done using the python machinery for doing so which is better than hard coding the location), then explain how to save it and then load it.

are you ok with having the test file in pint/testsuite/test-data or do you want that gone too?

andrewgsavage · 2018-09-03T17:34:38Z

It might require redefining repr or .to_html

I'm assuming you're referring to pandas' repr/._to_html. These aren't things ExtensionArrays can change so I can't see any way of going about it.

It might be worth submitting an issue to pandas to allow ExtensionArrays to display metadata in column headers.

znicholls · 2018-09-05T17:56:44Z

ok I think I've done all the steps we wanted now. The last big problem is that in the process of rebasing, I seem to have removed any mention of @andrewgsavage's contribution. I'm not very comfortable with this as all of the interface code is his (I've basically only done the test code). Do you have a suggestion for how to fix this whilst also doing such a mega rebase?

What I do not know how to include the pytests in a unittest.TestSuite. If we achieve that, then running python setup.py test will run all tests

I don't think that's possible either...

znicholls · 2018-09-06T15:45:52Z

@hgrecco this is good from our end (tests pending)

hgrecco · 2018-09-06T16:39:29Z

bors r+

@andrewgsavage

684: Add pandas support r=hgrecco a=znicholls This pull request adds pandas support to pint (hence is related to #645, #401 and pandas-dev/pandas#10349). An example can be seen in `example-notebooks/basic-example.ipynb`. It's a little bit hacksih, feedback would be greatly appreciated by me and @andrewgsavage. One obvious example is that we have to run all the interface tests with `pytest` to fit with `pandas` test suite, which introduces a dependency for the CI and currently gives us this awkward testing setup (see the alterations we had to make to `testsuite`). This also means that our code coverage tests are fiddly too. If you'd like us to squash the commits, that can be done. If pint has a linter, it would be good to run that over this pull request too as we're a little bit all over the place re style. Things to discuss: - [x] general feedback and changes - [x] test setup, especially need for pytest for pandas tests and hackish way to get around automatic discovery - [x] squashing/rebasing - [x] linting/other code style (related to #664 and #628: we're happy with whatever, I've found using an automatic linter e.g. black and/or flake8 has made things much simpler in other projects) - [x] including notebooks in the repo (if we want to, I'm happy to put them under CI so we can make sure they run) - [x] setting up the docs correctly Co-authored-by: Zebedee Nicholls <zebedee.nicholls@climate-energy-college.org> Co-authored-by: andrewgsavage <andrewgsavage@gmail.com>

hgrecco · 2018-09-06T16:40:29Z

I have asked bors to merge it. Big thanks and congratulations for such wonderful work and thanks everybody else that helped by reviewing and commenting on the code. I think that now we deserve a new release

bors · 2018-09-06T16:53:32Z

Build succeeded

continuous-integration/travis-ci/push

hgrecco · 2018-09-06T16:57:04Z

Done! I have opened #693 to track the new release

We are working on upgrading pint to be compatible with pandas, see hgrecco/pint#684 I am guessing that the line in the docs, 'If you’re building a library that implements the interface, please publicize it on Extension Data Types.', meant something like this pull request. If that's completely wrong, apologies.

PR link: hgrecco/pint#684

TomAugspurger · 2019-01-14T13:54:46Z

Would you all be OK with adding Pint's extension array to the pandas pandas ecosystem page at http://pandas.pydata.org/pandas-docs/version/0.24/ecosystem.html#extension-data-types?

hgrecco · 2019-01-14T13:57:21Z

Sure. Please add something like beta or prerelease or provisional. I think we need to test it better on our side via a broader use.

znicholls · 2019-01-14T17:00:43Z

I think we have this here pandas-dev/pandas#22582 so maybe just need to re-open and update?

TomAugspurger · 2019-01-14T17:02:30Z

Yes, that'd be perfect if you could.

…

On Mon, Jan 14, 2019 at 11:00 AM Zeb Nicholls ***@***.***> wrote: I think we have this here pandas-dev/pandas#22582 <pandas-dev/pandas#22582> so maybe just need to re-open and update? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#684 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIgKIcMF9CeGhb5geQAsHwMVKEEOuks5vDLe9gaJpZM4WQSf_> .

znicholls · 2019-01-14T17:44:08Z

@hgrecco @andrewgsavage to be clear, Pint no longer has the pandas support and we're moving it into https://github.com/hgrecco/pint-pandas ?

hgrecco · 2019-01-14T17:45:16Z

True. But it would be an extension of pint. As soon as we think that is stable we can add it to pint/setup.py as optional.

znicholls · 2019-01-14T17:50:53Z

Ok great. I'll update pandas-dev/pandas#22582 later today

znicholls force-pushed the master branch 2 times, most recently from 0fa2a51 to 29296db Compare August 29, 2018 15:12

znicholls mentioned this pull request Aug 31, 2018

Discussion: API to add read and write for HIST EMIS and CONC input files openscm/pymagicc#41

Closed

znicholls mentioned this pull request Sep 3, 2018

Update ecosystem.rst to include Pint pandas-dev/pandas#22582

Closed

4 tasks

znicholls force-pushed the master branch from f771a9a to bbf96bf Compare September 5, 2018 17:43

znicholls force-pushed the master branch from 4808aca to be7a7e7 Compare September 5, 2018 20:10

znicholls added 2 commits September 6, 2018 15:15

Add pandas interface tests

e43e50f

Add pandas interface docs

5d30617

znicholls force-pushed the master branch from 49a3a11 to be7a7e7 Compare September 6, 2018 15:02

andrewgsavage added 2 commits September 6, 2018 16:30

Add pandas interface code

5a515d4

Add example notebook

d17c745

znicholls force-pushed the master branch from be7a7e7 to d17c745 Compare September 6, 2018 15:44

znicholls closed this Sep 6, 2018

znicholls reopened this Sep 6, 2018

bors bot merged commit d17c745 into hgrecco:master Sep 6, 2018

znicholls added a commit to znicholls/pandas that referenced this pull request Sep 6, 2018

Add placeholders for Pint bits and pieces, update once PR merged

0fa67e0

PR link: hgrecco/pint#684

esip-lab mentioned this pull request Sep 7, 2018

] ESIPFed/NUMfocusFallDev#1

Closed

andrewgsavage mentioned this pull request Sep 8, 2018

df.pint.dequantify() behaviour #696

Closed

esip-lab mentioned this pull request Sep 10, 2018

Integrate unit support into pandas through integration with pint or other units packages ESIPFed/NUMfocusFallDev#2

Open

Add pandas support #684

Add pandas support #684

Conversation

znicholls commented Aug 28, 2018 • edited Loading

znicholls commented Aug 29, 2018 • edited Loading

andrewgsavage commented Aug 29, 2018

znicholls commented Aug 29, 2018

jondoesntgit commented Aug 31, 2018

znicholls commented Aug 31, 2018

znicholls commented Sep 2, 2018

hgrecco commented Sep 2, 2018

znicholls commented Sep 2, 2018

jondoesntgit commented Sep 3, 2018

znicholls commented Sep 3, 2018

dalito commented Sep 3, 2018

znicholls commented Sep 3, 2018 • edited Loading

andrewgsavage commented Sep 3, 2018

jondoesntgit commented Sep 3, 2018

znicholls commented Sep 3, 2018 • edited Loading

hgrecco commented Sep 3, 2018

andrewgsavage commented Sep 3, 2018 • edited Loading

znicholls commented Sep 3, 2018

jondoesntgit commented Sep 3, 2018 • edited Loading

andrewgsavage commented Sep 3, 2018

andrewgsavage commented Sep 3, 2018

znicholls commented Sep 3, 2018 • edited Loading

znicholls commented Sep 3, 2018

znicholls commented Sep 3, 2018

andrewgsavage commented Sep 3, 2018

znicholls commented Sep 5, 2018

znicholls commented Sep 6, 2018

hgrecco commented Sep 6, 2018

hgrecco commented Sep 6, 2018

bors bot commented Sep 6, 2018

Build succeeded

hgrecco commented Sep 6, 2018

TomAugspurger commented Jan 14, 2019

hgrecco commented Jan 14, 2019

znicholls commented Jan 14, 2019

TomAugspurger commented Jan 14, 2019 via email

znicholls commented Jan 14, 2019

hgrecco commented Jan 14, 2019

znicholls commented Jan 14, 2019

znicholls commented Aug 28, 2018 •

edited

Loading

znicholls commented Aug 29, 2018 •

edited

Loading

znicholls commented Sep 3, 2018 •

edited

Loading

znicholls commented Sep 3, 2018 •

edited

Loading

andrewgsavage commented Sep 3, 2018 •

edited

Loading

jondoesntgit commented Sep 3, 2018 •

edited

Loading

znicholls commented Sep 3, 2018 •

edited

Loading