BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows #23688

jschendel · 2018-11-14T07:27:34Z

closes BUG: Series.rank(pct=True).max() != 1 for a large series of floats #18271
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2018-11-14T07:27:37Z

Hello @jschendel! Thanks for submitting the PR.

There are no PEP8 issues in the file pandas/tests/frame/test_rank.py !
There are no PEP8 issues in the file pandas/tests/series/test_rank.py !
There are no PEP8 issues in the file pandas/tests/test_algos.py !

WillAyd · 2018-11-14T07:39:33Z

FWIW good to have this typing consistent, though I'm surprised it's required given the Cython docs say that a float gets mapped to a double:

http://docs.cython.org/en/latest/src/tutorial/caveats.html

@jbrockmendel any insights? Does that seem like a bug in Cython?

codecov · 2018-11-14T08:03:53Z

Codecov Report

Merging #23688 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #23688   +/-   ##
=======================================
  Coverage   92.24%   92.24%           
=======================================
  Files         161      161           
  Lines       51318    51318           
=======================================
  Hits        47339    47339           
  Misses       3979     3979

Flag	Coverage Δ
#multiple	`90.63% <ø> (ø)`	⬆️
#single	`42.31% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a197837...a6abe28. Read the comment docs.

jreback · 2018-11-14T14:26:27Z

thanks!

jbrockmendel · 2018-11-14T14:46:42Z

though I'm surprised it's required given the Cython docs say that a float gets mapped to a double

@WillAyd no idea; we recently changed all usages of "double" and "double_t" to "float64_t" largely so I didn't have to keep double-checking that they mean the same thing. Maybe @scoder can offer some insight?

…andas-dev#23688)

* upstream/master: (25 commits) DOC: Delete trailing blank lines in docstrings. (pandas-dev#23651) DOC: Change release and whatsnew (pandas-dev#21599) DOC: Fix format of the See Also descriptions (pandas-dev#23654) DOC: update pandas.core.groupby.DataFrameGroupBy.resample docstring. (pandas-dev#20374) ENH: Allow export of mixed columns to Stata strl (pandas-dev#23692) CLN: Remove unnecessary code (pandas-dev#23696) Pin flake8-rst version (pandas-dev#23699) Implement _most_ of the EA interface for DTA/TDA (pandas-dev#23643) CI: raise clone depth limit on CI BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows (pandas-dev#23688) REF: Move Excel names parameter handling to CSV (pandas-dev#23690) DOC: Accessing files from a S3 bucket. (pandas-dev#23639) Fix errorbar visualization (pandas-dev#23674) DOC: Surface / doc mangle_dupe_cols in read_excel (pandas-dev#23678) DOC: Update is_sparse docstring (pandas-dev#19983) BUG: Fix read_excel w/parse_cols & empty dataset (pandas-dev#23661) Add to_flat_index method to MultiIndex (pandas-dev#22866) CLN: Move to_excel to generic.py (pandas-dev#23656) TST: IntervalTree.get_loc_interval should return platform int (pandas-dev#23660) CI: Allow to compile docs with ipython 7.11 pandas-dev#22990 (pandas-dev#23655) ...

…fixed * upstream/master: DOC: Delete trailing blank lines in docstrings. (pandas-dev#23651) DOC: Change release and whatsnew (pandas-dev#21599) DOC: Fix format of the See Also descriptions (pandas-dev#23654) DOC: update pandas.core.groupby.DataFrameGroupBy.resample docstring. (pandas-dev#20374) ENH: Allow export of mixed columns to Stata strl (pandas-dev#23692) CLN: Remove unnecessary code (pandas-dev#23696) Pin flake8-rst version (pandas-dev#23699) Implement _most_ of the EA interface for DTA/TDA (pandas-dev#23643) CI: raise clone depth limit on CI BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows (pandas-dev#23688) REF: Move Excel names parameter handling to CSV (pandas-dev#23690) DOC: Accessing files from a S3 bucket. (pandas-dev#23639) Fix errorbar visualization (pandas-dev#23674) DOC: Surface / doc mangle_dupe_cols in read_excel (pandas-dev#23678) DOC: Update is_sparse docstring (pandas-dev#19983) BUG: Fix read_excel w/parse_cols & empty dataset (pandas-dev#23661) Add to_flat_index method to MultiIndex (pandas-dev#22866) CLN: Move to_excel to generic.py (pandas-dev#23656) TST: IntervalTree.get_loc_interval should return platform int (pandas-dev#23660)

scoder · 2018-11-16T11:01:28Z

though I'm surprised it's required given the Cython docs say that a float gets mapped to a double

Note how that page says "Python's float type", not "C's float type". Whenever you say "cdef" in Cython, what follows is a C declaration. We specifically document that Python types like "int" and "float" are not (directly) usable in type declarations because they are shadowed by the more useful C types with the same name. There is no advantage at all in declaring a variable with those Python types, but there is good value in doing that with the C types.

In short, if you want a specific C type, name it.

scoder · 2018-11-16T11:45:45Z

we recently changed all usages of "double" and "double_t" to "float64_t" largely so I didn't have to keep double-checking that they mean the same thing

In theory, they do not. The C standard does not guarantee specific behaviour for the double type, only that the precision is at least twice as high as for the single precision "float" type, i.e. it gives you minimum precision guarantees.

However, as long as your code is compiled in an IEEE-754 floating point environment, which applies to pretty much all relevant system types these days, the behaviour will be exactly that of a 64-bit IEEE-754 double precision binary floating point number.

Now, in practice the float64_t type is an alias for double for exactly these reasons, which means that while it might look more exact in code, it can also suggest a false safety. For example, enabling -ffast-math in gcc will allow it to diverge from the IEEE-754 standard for optimisation purposes, and float64_t will probably not save you from that. But at least it could, in the sense that float64_t, not being a standard C type, might simply be undefined (and your code would then fail to compile) if strict 64-bit IEEE-754 semantics are not available. (Although, again strictly speaking, float64_t does not enforce IEEE-754 compliance, so a hypothetical non-IEEE-754 64-bit floating point type could still qualify. ¯_(ツ)_/¯ )

Anyway, explicit is better than implicit. If you want exact 64-bit float semantics, saying so in your code is probably a good idea. If you can live with C double semantics, and that is what CPython does internally, for example, saying so is probably also a good idea.

jbrockmendel · 2018-11-16T16:04:14Z

@scoder thanks for clarifying. Have I mentioned recently how nice it is to not have to worry about these things in python and 90+% of the time in cython?

@chris-b1 @jreback this is above my pay grade. Is there any chance that we need to revert parts of #23583 to use double/double_t instead of float64_t?

scoder · 2018-11-16T16:21:58Z

I might not have enough insight into the details here, but I don't think reverting is necessary. It's a bit more verbose that way, but it probably also reflects what your code does. Specifically, if you are programming against NumPy, then NumPy's data type API is more relevant than CPython's internals or C's double type here, so matching np.float64 in Python with float64_t in C/Cython seems right.

…andas-dev#23688)

BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows

a6abe28

jschendel added Bug Numeric Operations Arithmetic, Comparison, and Logical operations Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Nov 14, 2018

jschendel added this to the 0.24.0 milestone Nov 14, 2018

WillAyd approved these changes Nov 14, 2018

View reviewed changes

jreback merged commit 4476962 into pandas-dev:master Nov 14, 2018

jschendel deleted the rank-pct-max branch November 14, 2018 15:31

JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018

BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows (p…

2688cbe

…andas-dev#23688)

jbrockmendel mentioned this pull request Nov 15, 2018

CI/BUG?: test_pct_max_many_rows crashes on travis-27. #23726

Closed

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018

BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows (p…

3f9db65

…andas-dev#23688)

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows (p…

1088390

…andas-dev#23688)

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows (p…

2b96967

…andas-dev#23688)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows #23688

BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows #23688

jschendel commented Nov 14, 2018

pep8speaks commented Nov 14, 2018

WillAyd commented Nov 14, 2018

codecov bot commented Nov 14, 2018 •

edited

Loading

jreback commented Nov 14, 2018

jbrockmendel commented Nov 14, 2018

scoder commented Nov 16, 2018

scoder commented Nov 16, 2018 •

edited

Loading

jbrockmendel commented Nov 16, 2018

scoder commented Nov 16, 2018

BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows #23688

BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows #23688

Conversation

jschendel commented Nov 14, 2018

pep8speaks commented Nov 14, 2018

WillAyd commented Nov 14, 2018

codecov bot commented Nov 14, 2018 • edited Loading

Codecov Report

jreback commented Nov 14, 2018

jbrockmendel commented Nov 14, 2018

scoder commented Nov 16, 2018

scoder commented Nov 16, 2018 • edited Loading

jbrockmendel commented Nov 16, 2018

scoder commented Nov 16, 2018

codecov bot commented Nov 14, 2018 •

edited

Loading

scoder commented Nov 16, 2018 •

edited

Loading