Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Fix docs on merging categoricals. #28185

Merged
merged 4 commits into from
Nov 8, 2019

Conversation

ivirshup
Copy link
Contributor

Fixes #28166.

Updated docs about merging categorical to reflect current behavior.

@@ -813,16 +813,16 @@ but the categories of these categoricals need to be the same:
res
res.dtypes

In this case the categories are not the same, and therefore an error is raised:
If the categories are not exactly the same, merging will coerce the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this wording a little hard to digest - we are just coercing the categorical to their values as part of this right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it could be more clear, but I'm struggling with trying to keep the terminology consistent with existent docs. How about this?

Otherwise the result's dtype will be determined by promotion of the mergands' categories' dtypes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

If the categories are not identical, categorical columns will be cast to a regular array with
the dtype of the underlying categories, which will likely have higher memory usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good. Should there be an example which doesn't cast to object to make that clear? Something like this:

In [101]: pd.concat([ 
     ...:     pd.Series([1, 2], dtype="category"), 
     ...:     pd.Series([3., 4.], dtype="category") 
     ...: ])                                                                                        
Out[101]: 
0    1.0
1    2.0
0    3.0
1    4.0
dtype: float64

@WillAyd WillAyd added the Docs label Aug 28, 2019
@WillAyd WillAyd added this to the 1.0 milestone Aug 28, 2019
@ivirshup
Copy link
Contributor Author

ivirshup commented Sep 2, 2019

I've just noticed there is another section on Concatenation, as opposed to Merging, in the categorical docs. These docs describe different behavior, but still don't quite fit with what actually happens. Having two sections on this seems redundant to me, could these be collapsed to a single section in this PR?

@WillAyd
Copy link
Member

WillAyd commented Sep 2, 2019

I would be OK with that. It's what we seem to have set up in the 10 minutes to pandas:

https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#merge

@jreback
Copy link
Contributor

jreback commented Oct 6, 2019

@ivirshup can you merge master and we will look again

* Added an examples where categoricals are concatenated which results in a numeric dtype.
* Removed a table of examples which seemed confusion (most entries were equivalent, gave misleading typing info).
@ivirshup
Copy link
Contributor Author

@jreback, I've merged master (albeit a couple weeks ago now) and consolidated the sections on concatenation and merging.

union_categoricals([s1.array, s3.array])


Following table summarizes the results of ``Categoricals`` related concatenations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep this table? I think a good summary of what happens

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though update to reflect current status

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful it it said a bit more about the return types, though it might be better to just point to type promotion docs for that. How about something like this?

+----------+--------------------------------------------------------+----------------------------+		
 | arg1     | arg2                            |      identical       | result                     |		
 +==========+========================================================+============================+		
 | category | category                        | True                 | category                   |		
 +----------+--------------------------------------------------------+----------------------------+		
 | category (object)| category (object)       | False                | object (dtype is inferred) |		
 +----------+--------------------------------------------------------+----------------------------+		
 | category (int) | category (float)          | False                | float (dtype is inferred) |		
 +----------+--------------------------------------------------------+----------------------------+		

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems logical. Not sure if we have better verbiage than saying category (object) to refer to the categories of the categorical; @TomAugspurger might have thoughts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to keep it close to the repr, since that should make it easier to relate to practice

>>> pd.Categorical(list("abc"))                                                                 
[a, b, c]
Categories (3, object): [a, b, c]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @WillAyd, I thought we were waiting on @TomAugspurger. Just pushed an update. Hopefully I got the table right.

@WillAyd
Copy link
Member

WillAyd commented Nov 8, 2019

@ivirshup can you address comment on table deletion? I think this is close

@WillAyd
Copy link
Member

WillAyd commented Nov 8, 2019

lgtm @jreback or @TomAugspurger if you want to look

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 8, 2019 via email

@WillAyd WillAyd merged commit 3b58f48 into pandas-dev:master Nov 8, 2019
@WillAyd
Copy link
Member

WillAyd commented Nov 8, 2019

Thanks @ivirshup

keechongtan added a commit to keechongtan/pandas that referenced this pull request Nov 11, 2019
…ndexing-1row-df

* upstream/master: (109 commits)
  stronger typing in libreduction (pandas-dev#29502)
  API: rename labels to codes (pandas-dev#29509)
  CLN: remove unnecessary type checks (pandas-dev#29517)
  implement _BaseGrouper (pandas-dev#29520)
  CLN: F-string formatting in pandas/_libs/*.pyx (pandas-dev#29527)
  Fixed more SS03 errors (pandas-dev#29540)
  consolidate dim checks (pandas-dev#29536)
  REF: separate out _get_cython_func_and_vals (pandas-dev#29537)
  remove unnecessary exception (pandas-dev#29538)
  TST:Add test to check single category col returns series with single row slice (pandas-dev#29521)
  Make color validation more forgiving (pandas-dev#29122)
  DOC: update bottleneck repo and documentation urls (pandas-dev#29516)
  TST: add test for df construction from dict with tuples (pandas-dev#29497)
  add test for pd.melt dtypes preservation (pandas-dev#29510)
  updated DataFrame.equals docstring (pandas-dev#29496)
  Resolved merge conflicts (pandas-dev#29506)
  DOC: Improved pandas/compact/__init__.py (pandas-dev#29507)
  DOC: Update performance comparison section of io docs (pandas-dev#28890)
  TST: add test for df.where() with category dtype (pandas-dev#29454)
  DOC: Fix docs on merging categoricals. (pandas-dev#28185)
  ...
Reksbril pushed a commit to Reksbril/pandas that referenced this pull request Nov 18, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect docs for merging categoricals
4 participants