Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Introducing hash tables for complex64 and complex128 #38179

Merged
merged 16 commits into from
Dec 30, 2020

Conversation

realead
Copy link
Contributor

@realead realead commented Nov 30, 2020

  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Next step for #33287

The complex case is somewhat less straight forward, because Cython defines complex128/complex64 as:

ctypedef float complex  complex64_t
ctypedef double complex complex128_t

where double complex actually means a Cython-datatype defined here https://github.com/cython/cython/blob/master/Cython/Utility/Complex.c#L56

Thus a conversion between numpy's complex and Cython's complex must happens somewhere.

@realead
Copy link
Contributor Author

realead commented Nov 30, 2020

FYI @jbrockmendel

Some polishing is still needed and probably additional testing, but this is how it could look like.

Maybe you know a better solution for the whole numpy's <->cython's complex-business.

There are also some details, which maybe need some tinkering:

  • Right now, nan+1j and 1+nan*j are considered not-a-number. However, they aren't considered equivalent. Not sure that doesn't make any problems.
  • inf +nan*j is considered not-a-number, maybe it makes more sense to see it as infinity and not not-a-number

@@ -606,14 +681,15 @@ cdef class {{name}}HashTable(HashTable):
ignore_na=True, return_inverse=True)
return labels

{{if dtype == 'int64'}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why adding this only for int64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is used just in int64-version:

table = hashtable.Int64HashTable(size_hint)
group_index = ensure_int64(group_index)
# note, group labels come out ascending (ie, 1,2,3 etc)
comp_ids, obs_group_ids = table.get_labels_groupby(group_index)

The logic in this function:

# specific for groupby
{{if dtype != 'uint64'}}
if val < 0:
labels[i] = -1
continue
{{endif}}

IIUC, negative numbers are special and should be skipped, but this is only a convention for the meaning of values group_index. For other types there is no such convention. Instead of inventing something, I've decided to delete unused functions/versions.

@realead realead changed the title WIP: PERF: Introducing hash tables for complex64 and complex128 PERF: Introducing hash tables for complex64 and complex128 Dec 1, 2020
@realead
Copy link
Contributor Author

realead commented Dec 1, 2020

Failures seem to be unrelated to my changes...

@jbrockmendel
Copy link
Member

Failures seem to be unrelated to my changes...

Yes. The npdev failure I think should be fixed by #38209 and the other one just annoyingly happens sometimes because hypothesis is slow. Don't worry about either.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good a few questions.

@@ -80,6 +82,8 @@ def test_map(self, table_type, dtype):
table = table_type()
keys = np.arange(N).astype(dtype)
vals = np.arange(N).astype(np.int64) + N
keys.flags.writeable = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize all of these on writeable True/False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 073222e


{{for c_type in float_types}}

cdef bint is_nan_{{c_type}}({{c_type}} val) nogil:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't define this for int types (as a no-op)? would it make the logic simpler if we did this (then we can always call a parameterized is_nan ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we already do. I've reshuffled the code, so it becomes more obvious.

@jreback jreback added Complex Complex Numbers Performance Memory or execution speed performance labels Dec 2, 2020
@jreback
Copy link
Contributor

jreback commented Dec 29, 2020

can you merge master and will look again

@jreback jreback added this to the 1.3 milestone Dec 29, 2020
@jreback
Copy link
Contributor

jreback commented Dec 29, 2020

ok looks good. this complex 64/128 are available on all platforms?

@realead
Copy link
Contributor Author

realead commented Dec 30, 2020

ok looks good. this complex 64/128 are available on all platforms?

@jreback It should work on platforms which have 64bit- and 32bit floating numbers. As we don't really use complex arithmetic it is not important whether C-compiler supports complex-numbers or not.

@jreback jreback merged commit f1d4026 into pandas-dev:master Dec 30, 2020
@jreback
Copy link
Contributor

jreback commented Dec 30, 2020

thanks @realead

@realead realead deleted the complex_for_khash2 branch December 31, 2020 08:01
luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Complex Complex Numbers Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants