Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows #23688

Merged
merged 1 commit into from
Nov 14, 2018

Conversation

jschendel
Copy link
Member

@jschendel jschendel added Bug Numeric Operations Arithmetic, Comparison, and Logical operations Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Nov 14, 2018
@jschendel jschendel added this to the 0.24.0 milestone Nov 14, 2018
@pep8speaks
Copy link

Hello @jschendel! Thanks for submitting the PR.

@WillAyd
Copy link
Member

WillAyd commented Nov 14, 2018

FWIW good to have this typing consistent, though I'm surprised it's required given the Cython docs say that a float gets mapped to a double:

http://docs.cython.org/en/latest/src/tutorial/caveats.html

@jbrockmendel any insights? Does that seem like a bug in Cython?

@codecov
Copy link

codecov bot commented Nov 14, 2018

Codecov Report

Merging #23688 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #23688   +/-   ##
=======================================
  Coverage   92.24%   92.24%           
=======================================
  Files         161      161           
  Lines       51318    51318           
=======================================
  Hits        47339    47339           
  Misses       3979     3979
Flag Coverage Δ
#multiple 90.63% <ø> (ø) ⬆️
#single 42.31% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a197837...a6abe28. Read the comment docs.

@jreback jreback merged commit 4476962 into pandas-dev:master Nov 14, 2018
@jreback
Copy link
Contributor

jreback commented Nov 14, 2018

thanks!

@jbrockmendel
Copy link
Member

though I'm surprised it's required given the Cython docs say that a float gets mapped to a double

@WillAyd no idea; we recently changed all usages of "double" and "double_t" to "float64_t" largely so I didn't have to keep double-checking that they mean the same thing. Maybe @scoder can offer some insight?

@jschendel jschendel deleted the rank-pct-max branch November 14, 2018 15:31
thoo added a commit to thoo/pandas that referenced this pull request Nov 15, 2018
* upstream/master: (25 commits)
  DOC: Delete trailing blank lines in docstrings. (pandas-dev#23651)
  DOC: Change release and whatsnew (pandas-dev#21599)
  DOC: Fix format of the See Also descriptions (pandas-dev#23654)
  DOC: update pandas.core.groupby.DataFrameGroupBy.resample docstring. (pandas-dev#20374)
  ENH: Allow export of mixed columns to Stata strl (pandas-dev#23692)
  CLN: Remove unnecessary code (pandas-dev#23696)
  Pin flake8-rst version (pandas-dev#23699)
  Implement _most_ of the EA interface for DTA/TDA (pandas-dev#23643)
  CI: raise clone depth limit on CI
  BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows (pandas-dev#23688)
  REF: Move Excel names parameter handling to CSV (pandas-dev#23690)
  DOC: Accessing files from a S3 bucket. (pandas-dev#23639)
  Fix errorbar visualization (pandas-dev#23674)
  DOC: Surface / doc mangle_dupe_cols in read_excel (pandas-dev#23678)
  DOC: Update is_sparse docstring (pandas-dev#19983)
  BUG: Fix read_excel w/parse_cols & empty dataset (pandas-dev#23661)
  Add to_flat_index method to MultiIndex (pandas-dev#22866)
  CLN: Move to_excel to generic.py (pandas-dev#23656)
  TST: IntervalTree.get_loc_interval should return platform int (pandas-dev#23660)
  CI: Allow to compile docs with ipython 7.11 pandas-dev#22990 (pandas-dev#23655)
  ...
thoo added a commit to thoo/pandas that referenced this pull request Nov 15, 2018
…fixed

* upstream/master:
  DOC: Delete trailing blank lines in docstrings. (pandas-dev#23651)
  DOC: Change release and whatsnew (pandas-dev#21599)
  DOC: Fix format of the See Also descriptions (pandas-dev#23654)
  DOC: update pandas.core.groupby.DataFrameGroupBy.resample docstring. (pandas-dev#20374)
  ENH: Allow export of mixed columns to Stata strl (pandas-dev#23692)
  CLN: Remove unnecessary code (pandas-dev#23696)
  Pin flake8-rst version (pandas-dev#23699)
  Implement _most_ of the EA interface for DTA/TDA (pandas-dev#23643)
  CI: raise clone depth limit on CI
  BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows (pandas-dev#23688)
  REF: Move Excel names parameter handling to CSV (pandas-dev#23690)
  DOC: Accessing files from a S3 bucket. (pandas-dev#23639)
  Fix errorbar visualization (pandas-dev#23674)
  DOC: Surface / doc mangle_dupe_cols in read_excel (pandas-dev#23678)
  DOC: Update is_sparse docstring (pandas-dev#19983)
  BUG: Fix read_excel w/parse_cols & empty dataset (pandas-dev#23661)
  Add to_flat_index method to MultiIndex (pandas-dev#22866)
  CLN: Move to_excel to generic.py (pandas-dev#23656)
  TST: IntervalTree.get_loc_interval should return platform int (pandas-dev#23660)
@scoder
Copy link

scoder commented Nov 16, 2018

though I'm surprised it's required given the Cython docs say that a float gets mapped to a double

Note how that page says "Python's float type", not "C's float type". Whenever you say "cdef" in Cython, what follows is a C declaration. We specifically document that Python types like "int" and "float" are not (directly) usable in type declarations because they are shadowed by the more useful C types with the same name. There is no advantage at all in declaring a variable with those Python types, but there is good value in doing that with the C types.

In short, if you want a specific C type, name it.

@scoder
Copy link

scoder commented Nov 16, 2018

we recently changed all usages of "double" and "double_t" to "float64_t" largely so I didn't have to keep double-checking that they mean the same thing

In theory, they do not. The C standard does not guarantee specific behaviour for the double type, only that the precision is at least twice as high as for the single precision "float" type, i.e. it gives you minimum precision guarantees.

However, as long as your code is compiled in an IEEE-754 floating point environment, which applies to pretty much all relevant system types these days, the behaviour will be exactly that of a 64-bit IEEE-754 double precision binary floating point number.

Now, in practice the float64_t type is an alias for double for exactly these reasons, which means that while it might look more exact in code, it can also suggest a false safety. For example, enabling -ffast-math in gcc will allow it to diverge from the IEEE-754 standard for optimisation purposes, and float64_t will probably not save you from that. But at least it could, in the sense that float64_t, not being a standard C type, might simply be undefined (and your code would then fail to compile) if strict 64-bit IEEE-754 semantics are not available. (Although, again strictly speaking, float64_t does not enforce IEEE-754 compliance, so a hypothetical non-IEEE-754 64-bit floating point type could still qualify. ¯_(ツ)_/¯ )

Anyway, explicit is better than implicit. If you want exact 64-bit float semantics, saying so in your code is probably a good idea. If you can live with C double semantics, and that is what CPython does internally, for example, saying so is probably also a good idea.

@jbrockmendel
Copy link
Member

@scoder thanks for clarifying. Have I mentioned recently how nice it is to not have to worry about these things in python and 90+% of the time in cython?

@chris-b1 @jreback this is above my pay grade. Is there any chance that we need to revert parts of #23583 to use double/double_t instead of float64_t?

@scoder
Copy link

scoder commented Nov 16, 2018

I might not have enough insight into the details here, but I don't think reverting is necessary. It's a bit more verbose that way, but it probably also reflects what your code does. Specifically, if you are programming against NumPy, then NumPy's data type API is more relevant than CPython's internals or C's double type here, so matching np.float64 in Python with float64_t in C/Cython seems right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Series.rank(pct=True).max() != 1 for a large series of floats
6 participants