Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - Bool Extension Array #22226

Closed
wants to merge 1 commit into from
Closed

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Aug 6, 2018

This is nowhere near complete as I have a ton of broken tests that need to be resolved, but theoretically progress towards #21778 as I get my feet wet with EAs.

My thought here was to leverage the masking operations used by the Integer EAs to implement an easy Boolean EA on top of that. I've essentially copied over all of the integer tests as well, though someone may have thoughts on a better way to structure all of this.

Any and all direction greatly appreciated

@jbrockmendel
Copy link
Member

Is the idea to make this a bit-per-entry? If not, I'm not clear on what the benefit of this is.

@WillAyd
Copy link
Member Author

WillAyd commented Aug 7, 2018

No this would be using int8 underneath - I don't think a bit-per-entry is possible since that's not an addressable unit.

Benefit would be to give users an easy way to cast to and store boolean data with the same masking technique that we are using for integers to denote missing data, albeit the actual implementation underneath uses int8. I figure that would be easier than completely reimplementing this with a dedicated bool subtype, though I'm also looking for feedback on that front

@jreback
Copy link
Contributor

jreback commented Aug 7, 2018

haven’t looked at the impl

but the clear win for this is efficient boolean arrays with missing values

right now you get nice boolean arrays but as soon as you have a NaN you coerce to object (or worse to float)

@jbrockmendel
Copy link
Member

Benefit would be to give users an easy way to cast to and store boolean data with the same masking technique that we are using for integers to denote missing data

BoolNA makes sense to me, thanks for clarifying.

No this would be using int8 underneath - I don't think a bit-per-entry is possible since that's not an addressable unit.

Yah this would takes some behind-the-scenes trickery. Something like a length-N bool array being backed by a len-N/8 int8 array.

@gfyoung gfyoung added ExtensionArray Extending pandas with custom dtypes or arrays. Enhancement labels Aug 7, 2018
@jreback
Copy link
Contributor

jreback commented Aug 7, 2018

i made an issue about using bitarray as an impl detail for integer NA; would obviously be useful here as well (so this would then be really cheap from a memory perspective)


@cache_readonly
def is_unsigned_integer(self):
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, you should be able to get is_signed_integer and is_unsigned_integer for free without needing to override since np.dtype(np.bool).kind returns 'b'. This does save performing a single comparison and is more explicit though.

@jreback
Copy link
Contributor

jreback commented Nov 21, 2018

agree this is a nice idea and we should do it but closing as stale

@jreback jreback closed this Nov 21, 2018
@fuglede
Copy link

fuglede commented Jan 27, 2019

If this ends up being possible, you may want to also provide it as an answer to this StackOverflow question.

@jreback jreback mentioned this pull request Feb 22, 2019
5 tasks
@WillAyd WillAyd deleted the bool-ext branch January 16, 2020 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants