Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement DataFrame.astype('category') #18099

Merged
merged 3 commits into from
Mar 1, 2018

Conversation

jschendel
Copy link
Member

@jschendel jschendel commented Nov 3, 2017

df['A'].dtype
df['B'].dtype

Note that this behavior is different than instantiating a ``DataFrame`` with categorical dtype, which will only assign
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I am not sure I like this difference; its not differerent for any other dtypes and would be willing to break it here, to give the new behavior (all values determine categories).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, was just writing a follow-up post to illustrate this and ask for peoples input on this inconsistency.

dtype_with_cat = (isinstance(dtype, CategoricalDtype) and
dtype.categories is not None)
if not dtype_with_cat:
categories = kwargs.get('categories', None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic should either be in a utility function in categorical.py.
we should also have a unique2d function (sort of trival but puts the logic & tests with the appropriate algos).

cc @TomAugspurger

@jreback jreback added Categorical Categorical Data Type Enhancement labels Nov 3, 2017
@jschendel
Copy link
Member Author

One thing I've noticed is that this seems to lead to an inconsistency between DataFrame(data).astype('category') and DataFrame(data, dtype='category').

Specifically, per the issue specifications, using astype sets all unique labels across all columns as categories, even if they don't appear in a given column:

In [2]: df1 = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': ['c', 'd', 'e']}).astype('category')

In [3]: df1['A'].dtype
Out[3]: CategoricalDtype(categories=['a', 'b', 'c', 'd', 'e'], ordered=False)

Whereas instantiating with dtype='category' does things column by column:

In [4]: df2 = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': ['c', 'd', 'e']}, dtype='category')

In [5]: df2['A'].dtype
Out[5]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)

Is this type of inconsistency acceptable? Or should one method be modified to match the other?

@codecov
Copy link

codecov bot commented Nov 4, 2017

Codecov Report

Merging #18099 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18099      +/-   ##
==========================================
- Coverage   91.25%   91.23%   -0.02%     
==========================================
  Files         163      163              
  Lines       50120    50129       +9     
==========================================
  Hits        45737    45737              
- Misses       4383     4392       +9
Flag Coverage Δ
#multiple 89.05% <100%> (ø) ⬆️
#single 40.31% <9.09%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 92.45% <100%> (+0.03%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 86e9dcc...081d533. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 4, 2017

Codecov Report

Merging #18099 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18099      +/-   ##
==========================================
+ Coverage   91.68%   91.69%   +<.01%     
==========================================
  Files         150      150              
  Lines       48976    48978       +2     
==========================================
+ Hits        44906    44910       +4     
+ Misses       4070     4068       -2
Flag Coverage Δ
#multiple 90.07% <100%> (ø) ⬆️
#single 41.86% <25%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/generic.py 95.89% <100%> (ø) ⬆️
pandas/util/testing.py 83.85% <0%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dfe9d4a...293dee5. Read the comment docs.


.. versionadded:: 0.22.0

:meth:`DataFrame.astype` supports simultaneously setting multiple columns as categorical. When setting multiple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, @jcrist assumed that the factorization / categorization would be done column-wise, not table-wise. It's not clear to me which one is more obvious.

Given DataFrames are columnar, I'd slightly prefer having it be done column-wise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

column-wise is trivial, uniques across all values then factorize this is the more useful factorization (e.g. this one)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 5, 2017 via email

@jreback
Copy link
Contributor

jreback commented Nov 5, 2017

"More useful" as in more difficult for a user to accomplish, or more people
expect / want?

more useful in that this is generally what you want (e.g. conceptually similar to get_dummies). and non-trivial to do. per-column is so easy, e.g. df.apply(lambda x: x.astype('category')) that having this as the default doesn't make sense. But the most important issues here is that ALL other dtypes coerce when passed to the contructor AND with astype. Diverging here would be a huge problem.

@jreback
Copy link
Contributor

jreback commented Nov 5, 2017

@jschendel pls change the constructor to do essentially this (you will want to move the actual coercing routine to pandas.core.categorical as I mentioned above)

@jschendel
Copy link
Member Author

Updated to address the comments:

  • Modified the DataFrame constructor to be consistent with DataFrame.astype
    • Modified categorical.rst to reflect this
    • Added this to the breaking API changes section of the whatsnew
  • Moved logic to the _get_categorical_dtype_2d helper function in categorical.py
  • Created unique2d and _ensure_arraylike2d functions in algorithms.py
  • Added tests for the three items above

Regarding _ensure_arraylike2d, the reason I created this was because I was running into an issue with _ensure_arraylike where a list of lists gets returned as a numpy array of lists instead of a 2d numpy array:

In [22]: values = [['a', 'b', 'c', 'a'], ['b', np.nan, 'd', 'd']]

In [23]: _ensure_arraylike(values)
Out[23]: array([list(['a', 'b', 'c', 'a']), list(['b', nan, 'd', 'd'])], dtype=object)

whereas I'd like to get output along the lines of:

In [24]: np.array([_ensure_arraylike(x) for x in values])
Out[24]:
array([['a', 'b', 'c', 'a'],
       ['b', nan, 'd', 'd']], dtype=object)

Should I just try patching _ensure_arraylike instead of creating _ensure_arraylike2d? Or is what I did fine?

@jorisvandenbossche
Copy link
Member

I am -1 on having df.astype('category') work on the full dataframe values instead of column-wise. Yes, working column-wise is also trivial with apply (it the the frame-wise which is more difficult out of the box) but IMO that's not a good reason to choose a different default.
Doing it column-wise a) feels more consistent and b) is IMO what users would expect it to do as the default if they see such a line of code.
IMO frame-wise is also not 'generally what users want', I think this very much depends on the use case.

To facilitate the more difficult frame-wise case, I would just make sure that we add a clear example on how to achieve this.
I suppose it is something like:

df.astype(pd.api.types.CategoricalDType(categories=np.unique(df.values.ravel())))

I would say that is exactly a nice use case for which we made CategoricalDtype publicly available.

@jreback
Copy link
Contributor

jreback commented Nov 10, 2017

Doing it column-wise a) feels more consistent and b) is IMO what users would expect it to do as the default if they see such a line of code

a) is only true because of the existing behavior.
b) not sure that is the case.

The point here its not clear what a user expects. In a 2-d case maybe we should just refuse to guess, IOW raise if dtype='category' is passed in both construction of a DataFrame and in .astype (meaning categories=None is passed), rather than picking a default that is unclear.

Of course both examples would be useful in the docs (and even in the error message).

@jcrist
Copy link
Contributor

jcrist commented Nov 10, 2017

I agree with @jorisvandenbossche and @TomAugspurger here. In my mind, df.astype is semantically the same as df.apply(lambda x: x.astype(...)), even if in implementation the conversion is done as one operation by the block manager. As a user I find the unification of categories here surprising and non-intuitive. I also did a brief office survey and others found this equally surprising and non-intuitive (cc @mcg1969).

If it's not an easy one-liner I'd prefer a separate function for doing the category unification and conversion instead of using astype. Or perhaps a keyword argument to astype, with unification as the non-default option.

@mcg1969
Copy link

mcg1969 commented Nov 10, 2017

I appreciate the cc: here. I'm with @jcrist et al regarding what the default behavior should be. The one addition I offered, that I don't think Jim agreed with, was the idea of offering the full-dataframe conversion as a kwarg option.

@jorisvandenbossche
Copy link
Member

The "one-liner" currently is

df.astype(pd.api.types.CategoricalDType(categories=np.unique(df.values.ravel())))

which is of course not that easy.
I was thinking we could add a constructor for CategoricalDtype from just values (and categories get inferred), but that is also not much shorter:

df.astype(pd.api.types.CategoricalDtype.from_values(df.values.ravel()))

(although we could implement it so that the .ravel() is not needed to be done by the user)

A keyword argument to astype feels a bit strange to me as this seems to be specific for categoricals. Or are there other dtypes were this would give a different result?

@jschendel
Copy link
Member Author

I don't have a particularly strong opinion either way; did this PR more out of finding it interesting than an actual want/need for the feature. Have been in situations where it'd be the behavior I'd want, though not necessarily sure it's what I'd expect.

I can't think of any other dtypes where a keyword argument would yield different results, but if we still want to use one it seems like axis could work with behavior similar to how ndarray.sum works in numpy, i.e.:

  • axis=None: over all elements, i.e. what this PR currently does
  • axis=0: column-wise
  • axis=1: ??? Not really sure. Seems like it could just raise a NotImplementedError equivalent to what 2d structures currently do, e.g. pd.Categorical(np.random.random((3,3)))

Still seems kind of strange to have a keyword argument that's only applicable to categoricals, but at least axis seems somewhat logically consistent, since it seems like axis would conceptually give the same result regardless of what value is passed for other dtypes, so could minimize any confusion.

Happy to modify the PR however is deemed fit, once a clear path has been established.

@pep8speaks
Copy link

pep8speaks commented Nov 20, 2017

Hello @jschendel! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 28, 2018 at 08:13 Hours UTC

@jreback
Copy link
Contributor

jreback commented Jan 21, 2018

ok I will reverse my earlier stance on this, and support this working column-wise, assuring that DataFrame.astype('category') and DataFrame(data, dtype='category') work the same. Should add a small section in docs to do whole dataframe coding as an example.

@jschendel if you want to update (and move notes to 0.23.0) could get this in .

@jschendel
Copy link
Member Author

jschendel commented Feb 24, 2018

Updated to perform columnwise conversion:

  • Changed the issue number to reference this PR instead of the original issue
    • Wanted to avoid confusion since the original issue is in regards to tablewise conversion
  • Added this behavior to categorical.rst
    • Shuffled some examples and remarks around to make things more coherent
    • Nothing should have be removed, just moved to a slightly different location
  • Added a small example to the whatsnew

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small doc comments. lgtm otherwise. @jorisvandenbossche @TomAugspurger


.. ipython:: python

df.dtypes
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
cat_type = CategoricalDtype(categories=list('abcd'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to add a ::note section to show how to gain all of the uniques a-priori on a DataFrame to pass to CDT

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:meth:`DataFrame.astype` can now perform columnwise conversion to ``Categorical`` by supplying the string ``'category'`` or a :class:`~pandas.api.types.CategoricalDtype`.
Previously, attempting this would raise a ``NotImplementedError``. (:issue:`18099`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can put a :ref: to the added doc-section as well.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:meth:`DataFrame.astype` can now perform columnwise conversion to ``Categorical`` by supplying the string ``'category'`` or a :class:`~pandas.api.types.CategoricalDtype`.
Previously, attempting this would raise a ``NotImplementedError``. (:issue:`18099`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the original issue number here as well

df['A'].dtype
df['B'].dtype

See the :ref:`categorical.objectcreation` section of the documentation for more details and examples.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you have the ref, put this at the top section then

@jschendel
Copy link
Member Author

Made the requested doc changes.

elif is_categorical_dtype(dtype) and self.ndim > 1:
# GH 18099: columnwise conversion to categorical
results = (self[col].astype(dtype, copy=copy) for col in self)
return pd.concat(results, axis=1, copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think using concat is the best way here. For example, this will loose information about the columns object (its type, metadata such as its name).
Can't we in a loop assign it back into the frame? (or is that less efficient?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is idiomatic and how things are handled elsewhere.

For example, this will loose information about the columns object (its type, metadata such as its name)

it will preserve as these are all the same index. of course if you have a counter-example

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am speaking about the columns, not index. So just take an example with CategoricalIndex, RangeIndex, .. or one with a name, and the resulting columns will not be the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, we do this else where in astype. So this is obviously not tested. ok In any event, using concat is vastly more performant. I suppose you just set the columns after.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened #19920 to address the (an?) existing occurrence of this bug. A quick check makes it look like @jreback's suggestion of setting the columns afterwards should fix things in both places, e.g.

result = pd.concat(results, axis=1, copy=False)
result.columns = self.columns
return result

Since this is relatively small, I'm planning to make the fix/add tests in this PR. Can address it in a separate PR if that would be preferable though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both (here or in other PR) are fine for me. However, if we also want to fix this for the other cases (not only for categorical), it might be cleaner as a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is fine, I see you opened another one which is good.

.. note::

In contrast to R's `factor` function, there is currently no way to assign/change labels at
creation time. Use `categories` to change the categories after creation time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally move those back a bit more below. They are specifically about the creation of Categoricals, so I would put it at the end of the 'Object creation' section

DataFrame Creation
~~~~~~~~~~~~~~~~~~

Columns in a ``DataFrame`` can be batch converted to categorical, either at the time of construction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Columns -> All columns

And maybe can start this section with saying that single column can be converted similar to Series (df['col'] = df['col'].astype(category)), to then go to what to do if you want multiple / all columns to be categorical

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Remove the comma after "categorical"

  • "at the time of" -> "during"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous "Series Creation" section actually does single column conversion, so I just mentioned that section instead of repeating the code/examples.

.. note::

In contrast to R's `factor` function, categorical data is not converting input values to
strings and categories will end up the same data type as the original values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the "and" and split this in two sentences? Currently this could be read as two differences from R, while it's just 1 difference (not stringified), and the consequence (same dtype).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced the "and" with a semicolon instead of making separate sentences.

DataFrame Creation
~~~~~~~~~~~~~~~~~~

Columns in a ``DataFrame`` can be batch converted to categorical, either at the time of construction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Remove the comma after "categorical"

  • "at the time of" -> "during"

~~~~~~~~~~~~~~~~~~

Columns in a ``DataFrame`` can be batch converted to categorical, either at the time of construction
or after construction. The conversion to categorical is done on a column by column basis; labels present
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"on a column by column basis" -> "column by column".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you can maybe remove this sentence. It's conveyed below.

df_cat = df.astype('category')
df_cat.dtypes

This conversion is likewise done on a column by column basis:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> "on a column by column basis" -> "columny by column"

@jreback jreback added this to the 0.23.0 milestone Feb 28, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. ping after docs updated.

@jschendel
Copy link
Member Author

Made the doc updates. Will fix the related issue in a follow-up PR.

@jreback jreback merged commit 96b8bb1 into pandas-dev:master Mar 1, 2018
@jreback
Copy link
Contributor

jreback commented Mar 1, 2018

very nice @jschendel pls the .astype concat issues in a followup.

@jschendel jschendel deleted the df-astype-category branch March 1, 2018 01:51
harisbal added a commit to harisbal/pandas that referenced this pull request Mar 12, 2018
commit df2e361
Author: Jeff Reback <jeff@reback.net>
Date:   Sun Mar 11 18:33:25 2018 -0400

    LINT: fixing

commit f1c0b7c
Author: David Polo <delkk0@users.noreply.github.com>
Date:   Sun Mar 11 22:54:27 2018 +0100

    DOC: Improved the docstring of pandas.plotting._core.FramePlotMethods… (pandas-dev#20157)

    * DOC: Improved the docstring of pandas.plotting._core.FramePlotMethods.barh()
    - Added examples section
    - Added extended summary
    - Added argument explanation

    * DOC: Improved the docstring of pandas.plotting._core.FramePlotMethods.barh()
    - Correcting PR comments

    * DOC: Improved the docstring of pandas.plotting._core.FramePlotMethods.barh()
    - Adding defaults for variables.

    * Update reference

commit 0780193
Author: Jonas Schulze <jonas.schulze7@t-online.de>
Date:   Sun Mar 11 22:37:37 2018 +0100

    DOC: update the pandas.DataFrame.plot.density docstring (pandas-dev#20236)

    * DOC: update the pandas.DataFrame.plot.kde and pandas.Series.plot.kde docstrings

    Unfortunately, I was not able to compute a kernel estimate of a
    two-dimensional random variable. Hence, the example is more of an
    analysis of some independent data series.

    * DOC: extract similarities of kde docstrings

    The `DataFrame.plot.kde` and `Series.plot.kde` now use a common
    docstring, for which the differences are inserted.

commit 2718984
Author: Cihan Ceyhan <chncyhn@gmail.com>
Date:   Sun Mar 11 21:48:08 2018 +0100

    DOC: Update the pandas.Series.dt.round/floor/ceil docstrings (pandas-dev#20187)

    * DOC: Update the pandas.Series.dt.round/floor/ceil docstrings

    * DOC: review points fixed.

    * Add series

commit 0d86742
Author: Antonio Molina <aydevosotros@gmail.com>
Date:   Sun Mar 11 18:57:37 2018 +0100

    DOC: Improved pandas.plotting.bootstrap_plot docstring (pandas-dev#20166)

    * Improved documentation on bootstrap_plot

    * Improved documentation on bootstrap_plot

    * Doc bootstrap_plot: Fixed some comments on pull requests

    * Added reference to wikipedia

    * Changed kwds for **kwds

    * Removed ** from kwds becuase of validation iuses

    * Fixed forgotten break line. I think that the kwds paramater now fits what expected @TomAugspurger. If not, sorry and indicate how it should be

    * Fixed warnings on compilation

    * Moved reference to extended description

commit a2910ad
Author: András Novoszáth <nocibambi@gmail.com>
Date:   Sun Mar 11 18:56:01 2018 +0100

    DOC: update the Index.get_values docstring (pandas-dev#20231)

    * DOC: update the Index.get_values docstring

    * Corrections

    * Corrected extended summary and quotes

    * Correcting spaces, extended summary, multiIndex example

    * See also correction

    * Multi ndim

commit afa6c42
Author: Marc <mlafore05@gmail.com>
Date:   Sun Mar 11 10:42:35 2018 -0400

    DOC: update the pandas.DataFrame.all docstring (pandas-dev#20216)

commit a44bae3
Author: Victor Villas <villasv@outlook.com>
Date:   Sun Mar 11 11:41:12 2018 -0300

    DOC: update the Series.view docstring (pandas-dev#20220)

commit 233103f
Author: David Adrián Cañones Castellano <davidarcano@gmail.com>
Date:   Sun Mar 11 15:40:02 2018 +0100

    DOC: update the docstring of pandas.DataFrame.from_dict (pandas-dev#20259)

commit 62bddec
Author: csfarkas <csaba.farkas95@gmail.com>
Date:   Sun Mar 11 15:33:54 2018 +0100

    DOC: add docstring for Index.get_duplicates (pandas-dev#20223)

commit 8c77238
Author: adatasetaday <32177771+adatasetaday@users.noreply.github.com>
Date:   Sun Mar 11 10:17:05 2018 -0400

    Docstring pandas.series.diff (pandas-dev#20238)

commit 4271757
Author: Aly Sivji <4369343+alysivji@users.noreply.github.com>
Date:   Sun Mar 11 08:51:25 2018 -0500

    DOC: update `pandas/core/ops.py` docstring template to accept examples (pandas-dev#20246)

commit 080ef0c
Author: akosel <aaronjkosel@gmail.com>
Date:   Sun Mar 11 12:43:10 2018 +0000

    DOC: update the DataFrame.iat[] docstring (pandas-dev#20219)

    * DOC: update the DataFrame.iat[] docstring

    * Update based on PR comments

    * Update based on PR comments

    * Singular not plural

    * Update to account for use with Series. Add example using Series.

    * Update indexing.py

    * PEP8

commit 302fda4
Author: adatasetaday <32177771+adatasetaday@users.noreply.github.com>
Date:   Sun Mar 11 08:36:21 2018 -0400

    DOC: update the pandas.DataFrame.diff docstring (pandas-dev#20227)

    * DOC: update the pandas.DataFrame.diff  docstring

    * DOC: update the pandas.DataFrame.diff docstring

    * DOC: update the pandas.DataFrame.diff docstring

    * DOC: update the pandas.DataFrame.diff docstring

    * DOC: update the pandas.DataFrame.diff docstring

    * DOC: update the pandas.DataFrame.diff  docstring

    * DOC: update the pandas.DataFrame.diff  docstring

    * DOC: update the pandas.DataFrame.diff  docstring

    * DOC: update the pandas.DataFrame.diff docstring

    * Cleanup

commit c791a84
Author: Pietro Battiston <me@pietrobattiston.it>
Date:   Sun Mar 11 13:07:01 2018 +0100

    DOC: pd.core.window.Expanding.kurt docstring (split from pd.core.Rolling.kurt) (pandas-dev#20064)

commit b3d6ce6
Author: Nipun Sadvilkar <nipunsadvilkar@gmail.com>
Date:   Sun Mar 11 17:29:33 2018 +0530

    DOC: update the pandas.date_range() docstring (pandas-dev#20143)

    * DOC: Improved the docstring of pandas.date_range()

    * Change date strings to iso format

    * Removed import pands in Examples docstring

    * Add See Also Docstring

    * Update datetimes.py

    * Doctests

commit 6d7272a
Author: Samuel Sinayoko <samuelsinayoko@bmlltech.com>
Date:   Sun Mar 11 11:58:09 2018 +0000

    DOC: update DataFrame.to_records (pandas-dev#20191)

    * Update to_records docstring.

    - Minor changes (missing dots, newlines) to make tests pass.
    - More examples.

    * Fix html docs.

    Missing newlines.

    * Reword datetime type information.

    * flake8 errors

    * Fix typo (duplicated type)

    * Remove unwanted blank line after Examples.

    * Fix doctests.

    ```
    (pandas_dev) sinayoks@landade:~/dev/pandas/ $ pytest --doctest-modules pandas/core/frame.py -k to_record
    ========================================================================================== test session starts ==========================================================================================
    platform darwin -- Python 3.6.4, pytest-3.4.2, py-1.5.2, pluggy-0.6.0
    rootdir: /Users/sinayoks/dev/pandas, inifile: setup.cfg
    plugins: xdist-1.22.1, forked-0.2, cov-2.5.1
    collected 43 items

    pandas/core/frame.py .                                                                                                                                                                            [100%]

    ========================================================================================== 42 tests deselected ==========================================================================================
    ```

    * Few more changes

commit 636335a
Author: Gabriel de Maeztu <gabriel.maeztu@gmail.com>
Date:   Sun Mar 11 12:56:48 2018 +0100

    DOC: Improved the docstring of pandas.plotting.radviz (pandas-dev#20169)

commit fbebc7f
Author: jen w <j.e.weiss@gmail.com>
Date:   Sun Mar 11 06:50:54 2018 -0500

    DOC: Update pandas.DataFrame.tail docstring (pandas-dev#20225)

commit c2864d7
Author: Stephen Childs <sechilds@gmail.com>
Date:   Sun Mar 11 07:50:39 2018 -0400

    DOC: update the DataFrame.cov docstring (pandas-dev#20245)

    * DOC: Revise docstring of DataFrame cov method

    Update the docstring with some examples from
    elsewhere in the pandas documentation.

    Some of the examples use randomly generated time series
    because we need to get covariance between long series.
    Used a random seed to ensure that the results are the
    same each time.

    * DOC: Fix See Also and min_periods explanation.

    Responding to comments on PR. See also section will link
    properly and number of periods explanation clearer.

commit 90e31b9
Author: jen w <j.e.weiss@gmail.com>
Date:   Sun Mar 11 06:50:18 2018 -0500

    DOC: update pandas.DataFrame.head docstring (pandas-dev#20262)

commit fb556ed
Author: Israel Saeta Pérez <dukebody@gmail.com>
Date:   Sat Mar 10 22:33:42 2018 +0100

    DOC: Improve pandas.Series.plot.kde docstring and kwargs rewording for whole file (pandas-dev#20041)

commit c3d491a
Author: Andy R. Terrel <andy.terrel@gmail.com>
Date:   Sat Mar 10 11:48:13 2018 -0800

    DOC: update the DataFrame.head()  docstring (pandas-dev#20206)

commit dd7f567
Author: DataOmbudsman <DataOmbudsman@users.noreply.github.com>
Date:   Sat Mar 10 20:15:48 2018 +0100

    DOC: update the Index.shift docstring (pandas-dev#20192)

    * DOC: updating docstring of Index.shift

    * Add See Also section to shift

    * Update link to Series.shift

commit 5b0caf4
Author: Eric O. LEBIGOT (EOL) <lebigot@users.noreply.github.com>
Date:   Sat Mar 10 17:32:20 2018 +0100

    DOC: update the Series.memory_usage() docstring (pandas-dev#20086)

commit 9fb7ac9
Author: Carol Willing <carolcode@willingconsulting.com>
Date:   Sat Mar 10 08:28:54 2018 -0800

    DOC: Edit contributing to docs section (pandas-dev#20190)

commit d8181a5
Author: DaanVanHauwermeiren <DaanVanHauwermeiren@users.noreply.github.com>
Date:   Sat Mar 10 17:25:20 2018 +0100

    DOC: update the Series.isin docstring (pandas-dev#20175)

commit ec631ce
Author: Riccardo Magliocchetti <riccardo.magliocchetti@gmail.com>
Date:   Sat Mar 10 17:12:41 2018 +0100

    DOC: update the pandas.Series.tail docstring (pandas-dev#20176)

commit e5e4ae9
Author: DaanVanHauwermeiren <DaanVanHauwermeiren@users.noreply.github.com>
Date:   Sat Mar 10 16:41:58 2018 +0100

    DOC: update the pandas.Index.drop_duplicates and pandas.Series.drop_duplicates docstring (pandas-dev#20114)

commit d7bcb22
Author: Riccardo Magliocchetti <riccardo.magliocchetti@gmail.com>
Date:   Sat Mar 10 15:49:31 2018 +0100

    DOC: update the MultiIndex.swaplevel docstring (pandas-dev#20105)

commit 8497029
Author: Gjelt <math-and-data@users.noreply.github.com>
Date:   Sat Mar 10 15:41:17 2018 +0100

    DOC: Improved the docstring of pandas.DataFrame.values (pandas-dev#20065)

commit 840d432
Author: Jordi Contestí <25779507+jcontesti@users.noreply.github.com>
Date:   Sat Mar 10 13:24:35 2018 +0100

    DOC: Improved the docstring of Series.str.findall (pandas-dev#19982)

commit 2a0d23b
Author: Jeff Reback <jeff@reback.net>
Date:   Sat Mar 10 06:54:19 2018 -0500

    DOC: lint

commit bf0dcb5
Author: Kate Surta <kate.surta@gmail.com>
Date:   Sat Mar 10 14:42:52 2018 +0300

    BUG: Check for wrong arguments in index subclasses constructors (pandas-dev#20017)

commit 4131149
Author: Stijn Van Hoey <stijnvanhoey@gmail.com>
Date:   Sat Mar 10 10:15:41 2018 +0100

    DOC: Extend docstring pandas core index to_frame method (pandas-dev#20036)

commit 52cffa3
Author: William Ayd <william.ayd@icloud.com>
Date:   Fri Mar 9 18:06:43 2018 -0800

    Cythonized GroupBy pct_change (pandas-dev#19919)

commit da6f827
Author: William Ayd <william.ayd@icloud.com>
Date:   Fri Mar 9 18:03:50 2018 -0800

    Refactored GroupBy ASVs (pandas-dev#20043)

commit bd31f71
Author: William Ayd <william.ayd@icloud.com>
Date:   Fri Mar 9 17:53:34 2018 -0800

    Added 'displayed_only' option to 'read_html' (pandas-dev#20047)

commit ed96567
Author: Ksenia <bobrovaksenia@gmail.com>
Date:   Sat Mar 10 02:40:10 2018 +0100

    TST: series/indexing tests parametrization + moving test methods (pandas-dev#20059)

commit 7c14e4f
Author: Kyle Barron <kylebarron2@gmail.com>
Date:   Fri Mar 9 11:31:14 2018 -0500

    DOC: Add syntax highlighting to SAS code blocks in comparison_with_sas.rst (pandas-dev#20080)

    * Add syntax highlighting to SAS code blocks

    * Fix typo

commit 731d971
Author: Matthew Roeschke <emailformattr@gmail.com>
Date:   Fri Mar 9 03:30:22 2018 -0800

    Fix typo in apply.py (pandas-dev#20058)

commit cc1b934
Author: Matthew Roeschke <emailformattr@gmail.com>
Date:   Fri Mar 9 03:13:50 2018 -0800

    BUG: Retain timezone dtype with cut and qcut (pandas-dev#19890)

commit c730d08
Author: William Ayd <william.ayd@icloud.com>
Date:   Fri Mar 9 02:37:27 2018 -0800

    DOC: Update Kurt Docstr (pandas-dev#20044)

commit 9119d07
Author: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Date:   Fri Mar 9 10:03:44 2018 +0100

    Temporary github PR template for sprint (pandas-dev#20055)

commit 747501a
Author: Aly Sivji <4369343+alysivji@users.noreply.github.com>
Date:   Fri Mar 9 02:19:59 2018 -0600

    DOC: Improve docstring for pandas.Index.repeat (pandas-dev#19985)

commit 1d73cf3
Author: Rouz Azari <rouzazari@users.noreply.github.com>
Date:   Thu Mar 8 16:54:53 2018 -0800

    BUG: Dense ranking with percent now uses 100% basis (pandas-dev#15639)

commit f9fd540
Author: William Ayd <william.ayd@icloud.com>
Date:   Thu Mar 8 16:36:23 2018 -0800

    Added flake8 to DEV requirements (pandas-dev#20063)

commit b669112
Author: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Date:   Thu Mar 8 14:09:12 2018 +0100

    DOC: require returns section in validation script (pandas-dev#19994)

commit 024d8b4
Author: Jeff Reback <jeff@reback.net>
Date:   Thu Mar 8 07:08:57 2018 -0500

    TST: xfail test_time on py2 & mpl 1.4.3 (pandas-dev#20053)

commit b85f6c1
Author: Marc Garcia <garcia.marc@gmail.com>
Date:   Thu Mar 8 11:07:08 2018 +0000

    DOC: update docstring validation script + replace api coverage script (pandas-dev#20025)

    * Improvments to validate_docstrings script: adding sections to summary, validating type and description of parameters

    * DOC: Improvements to validate docstring script (added api_coverage functionality, sections in csv and extra validations)

commit 9273bf5
Author: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Date:   Thu Mar 8 11:14:05 2018 +0100

    DOC/CI: temp pin matplotlib for doc build (pandas-dev#20045)

commit 63ce781
Author: Jeff Reback <jeff@reback.net>
Date:   Wed Mar 7 17:01:38 2018 -0500

    TST: xfail mpl 2.2 tests

    xref pandas-dev#20031

commit 7c7bd56
Author: Daniel Frank <danfrankj@gmail.com>
Date:   Wed Mar 7 13:54:46 2018 -0800

    enable multivalues insert (pandas-dev#19664)

commit f33e84c
Author: Ksenia <bobrovaksenia@gmail.com>
Date:   Wed Mar 7 22:09:42 2018 +0100

    Moving tests in series/indexing to fixtures (pandas-dev#20014.1) (pandas-dev#20034)

commit 2532a49
Author: Liam3851 <david.krych@gmail.com>
Date:   Wed Mar 7 13:04:22 2018 -0500

    BUG: Fixes to msgpack support. (pandas-dev#19975)

commit fd010de
Author: Guilherme Beltramini <guilherme.beltramini@nubank.com.br>
Date:   Wed Mar 7 11:33:09 2018 -0300

    to_sql also accepts Series (pandas-dev#20004)

commit 8d462ed
Author: Paul Reidy <paul_reidy@outlook.com>
Date:   Wed Mar 7 14:32:12 2018 +0000

    EHN: Implement method argument for DataFrame.replace (pandas-dev#19894)

commit d14fae8
Author: jbrockmendel <jbrockmendel@gmail.com>
Date:   Wed Mar 7 06:19:21 2018 -0800

    cleanup ops (pandas-dev#19972)

commit 776f2be
Author: William Ayd <william.ayd@icloud.com>
Date:   Wed Mar 7 05:59:39 2018 -0800

    Added .pytest_cache to gitignore (pandas-dev#20021)

commit 460941f
Author: jschendel <jschendel@users.noreply.github.com>
Date:   Wed Mar 7 06:57:51 2018 -0700

    Fix typos in test_interval_new (pandas-dev#20026)

commit 5782ab8
Author: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Date:   Wed Mar 7 14:57:17 2018 +0100

    DOC: enable matplotlib plot_directive to include figures in docstrings (pandas-dev#20015)

commit dd2b224
Author: DataOmbudsman <DataOmbudsman@users.noreply.github.com>
Date:   Wed Mar 7 14:56:49 2018 +0100

    DOC: updating docstring of Index.shift (pandas-dev#19996)

commit 09c416c
Author: William Ayd <william.ayd@icloud.com>
Date:   Wed Mar 7 05:56:16 2018 -0800

    DOC: Updated kurt docstring (for pandas sprint) (pandas-dev#19999)

commit ad15f80
Author: Kate Surta <kate.surta@gmail.com>
Date:   Wed Mar 7 16:55:48 2018 +0300

    TST: Fix wrong argument in TestDataFrameAlterAxes.test_set_index_dst (pandas-dev#20019)

commit f6ee9ac
Author: Jeff Reback <jeff@reback.net>
Date:   Wed Mar 7 08:55:33 2018 -0500

    TST: xfail clip tests under numpy-dev (pandas-dev#20035)

    xref pandas-dev#19976

commit 397e296
Author: Jeff Reback <jeff@reback.net>
Date:   Wed Mar 7 08:15:49 2018 -0500

    TST: xfail some tests for mpl 2.2 compat (pandas-dev#20033)

    xref pandas-dev#20031

commit 56939b4
Author: luzpaz <luzpaz@users.noreply.github.com>
Date:   Wed Mar 7 06:10:39 2018 -0500

    DOC: misc typos (pandas-dev#20029)

commit 01b91c2
Author: alinde1 <32714875+alinde1@users.noreply.github.com>
Date:   Tue Mar 6 22:47:45 2018 +0100

    DOC: is confusing for ddof parameter of sem, var and std functions (pandas-dev#19986)

commit db82165
Author: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Date:   Tue Mar 6 22:42:41 2018 +0100

    CLN/DOC: cache_readonly: remove allow_setting + preserve docstring (pandas-dev#19991)

commit e02f737
Author: Tom Augspurger <TomAugspurger@users.noreply.github.com>
Date:   Tue Mar 6 09:38:32 2018 -0600

    DOC: add doc on ExtensionArray and extending pandas (pandas-dev#19936)

commit 0ca77b3
Author: jbrockmendel <jbrockmendel@gmail.com>
Date:   Tue Mar 6 04:27:21 2018 -0800

    Datetimelike add/sub catch cases more explicitly, tests (pandas-dev#19912)

commit 0038bad
Author: Matthew Roeschke <emailformattr@gmail.com>
Date:   Tue Mar 6 04:25:55 2018 -0800

    month_name/day_name warnings followup (pandas-dev#20010)

commit fd63c90
Author: Ksenia <bobrovaksenia@gmail.com>
Date:   Tue Mar 6 13:25:37 2018 +0100

    TST: split series/test_indexing.py (pandas-dev#18614) (pandas-dev#20006)

commit 6366bf0
Author: Jeff Reback <jeff@reback.net>
Date:   Tue Mar 6 07:25:17 2018 -0500

    TST: clean deprecation warnings for xref pandas-dev#19980 (pandas-dev#20013)

    xfail some mpl > 2.1.2 tests

commit fe61299
Author: William Ayd <william.ayd@icloud.com>
Date:   Tue Mar 6 00:30:13 2018 -0800

    DOC: fixed dynamic import mechanics of make.py (pandas-dev#20005)

commit 8a084eb
Author: Grant Smith <grantsmith@gmail.com>
Date:   Tue Mar 6 03:29:26 2018 -0500

    CLN: deprecate the pandas.tseries.plotting.tsplot function (GH18627) (pandas-dev#19980)

commit aedbd94
Author: Jeff Reback <jeff@reback.net>
Date:   Mon Mar 5 06:36:41 2018 -0500

    TST: text correction, xref pandas-dev#19987

commit cbffd19
Author: Bhavesh Poddar <bhavesh13103507@gmail.com>
Date:   Mon Mar 5 06:34:59 2018 -0500

    fixed pytest deprecation warning (pandas-dev#19987)

commit 058a16c
Author: Matthew Roeschke <emailformattr@gmail.com>
Date:   Mon Mar 5 03:23:49 2018 -0800

    CLN: Use generators in builtin functions (pandas-dev#19989)

commit 607910b
Author: Matthew Roeschke <emailformattr@gmail.com>
Date:   Sun Mar 4 12:15:37 2018 -0800

    Add month names (pandas-dev#18164)

commit 2fad756
Author: jbrockmendel <jbrockmendel@gmail.com>
Date:   Sun Mar 4 12:00:39 2018 -0800

    transition period_helper to use pandas_datetimestruct (pandas-dev#19918)

commit 53606ff
Author: Liam3851 <david.krych@gmail.com>
Date:   Sun Mar 4 14:58:22 2018 -0500

    BUG: Compat for pre-0.20 TimedeltaIndex and Float64Index pickles pandas-dev#19939 (pandas-dev#19943)

commit 0bfb61b
Author: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Date:   Fri Mar 2 22:35:45 2018 +0100

    DOC: small updates to make.py script (pandas-dev#19951)

    * enable passing verbosity flag to sphinx

    * alias api for api.rst

commit d1f3689
Author: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Date:   Fri Mar 2 22:33:48 2018 +0100

     DOC: fix some sphinx syntax warnings  (pandas-dev#19962)

commit 49f09cc
Author: Tom Augspurger <TomAugspurger@users.noreply.github.com>
Date:   Fri Mar 2 15:20:28 2018 -0600

    API: Added ExtensionArray constructor from scalars (pandas-dev#19913)

commit d30d165
Author: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Date:   Fri Mar 2 22:18:10 2018 +0100

    DOC: update docstring validation script (pandas-dev#19960)

commit a7a7f8c
Author: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Date:   Fri Mar 2 13:49:59 2018 +0100

    DOC: clarify version of ActivePython that includes pandas (pandas-dev#19964)

commit b167483
Author: Gina <Dr-G@users.noreply.github.com>
Date:   Fri Mar 2 05:33:49 2018 -0600

    DOC: update install.rst to include ActivePython distribution (pandas-dev#19908)

commit e6c7dea
Author: topper-123 <terji78@gmail.com>
Date:   Fri Mar 2 11:19:07 2018 +0000

    ENH: Let initialisation from dicts use insertion order for python >= 3.6 (part III) (pandas-dev#19884)

commit d615f86
Author: Marc Garcia <garcia.marc@gmail.com>
Date:   Fri Mar 2 09:39:45 2018 +0000

    DOC: Adding script to validate docstrings, and generate list of all functions/methods with state (pandas-dev#19898)

commit 5f271eb
Author: Yian <yian.shang@gmail.com>
Date:   Fri Mar 2 00:13:58 2018 +0100

    BUG: Adding skipna as an option to groupby cumsum and cumprod (pandas-dev#19914)

commit 072545d
Author: David C Hall <davidchall@users.noreply.github.com>
Date:   Thu Mar 1 15:06:20 2018 -0800

    ENH: Add option to disable MathJax (pandas-dev#19824). (pandas-dev#19856)

commit d44a6ec
Author: Yian <yian.shang@gmail.com>
Date:   Fri Mar 2 00:02:31 2018 +0100

    Making to_datetime('today') and Timestamp('today') consistent (pandas-dev#19937)

commit 87fefe2
Author: jbrockmendel <jbrockmendel@gmail.com>
Date:   Thu Mar 1 14:54:42 2018 -0800

    dispatch Series[datetime64] comparison ops to DatetimeIndex (pandas-dev#19800)

commit 9242248
Author: Matthew Roeschke <emailformattr@gmail.com>
Date:   Thu Mar 1 14:50:35 2018 -0800

    BUG: DataFrame.diff(axis=0) with DatetimeTZ data (pandas-dev#19773)

commit c5a1ef1
Author: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Date:   Thu Mar 1 22:48:39 2018 +0100

    DOC: remove empty attribute/method lists from class docstrings html page (pandas-dev#19949)

commit 9958ce6
Author: jschendel <jschendel@users.noreply.github.com>
Date:   Thu Mar 1 04:14:19 2018 -0700

    BUG: Preserve column metadata with DataFrame.astype (pandas-dev#19948)

commit 3b4eb8d
Author: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Date:   Thu Mar 1 12:12:35 2018 +0100

    CLN: remove redundant clean_fill_method calls (pandas-dev#19947)

commit c8859b5
Author: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Date:   Thu Mar 1 10:35:05 2018 +0100

    DOC: script to build single docstring page (pandas-dev#19840)

commit 52559f5
Author: Matthew Roeschke <emailformattr@gmail.com>
Date:   Wed Feb 28 17:32:24 2018 -0800

    ENH: Allow Timestamp to accept Nanosecond argument (pandas-dev#19889)

commit 4a27697
Author: William Ayd <william.ayd@icloud.com>
Date:   Wed Feb 28 17:30:18 2018 -0800

    Cythonized GroupBy any (pandas-dev#19722)

commit 96b8bb1
Author: jschendel <jschendel@users.noreply.github.com>
Date:   Wed Feb 28 18:07:15 2018 -0700

    ENH: Implement DataFrame.astype('category') (pandas-dev#18099)

commit 6ef4be3
Author: Liam3851 <david.krych@gmail.com>
Date:   Wed Feb 28 06:14:11 2018 -0500

    ENH: Allow literal (non-regex) replacement using .str.replace pandas-dev#16808 (pandas-dev#19584)

commit 318a287
Author: README Bot <35302948+codetriage-readme-bot@users.noreply.github.com>
Date:   Wed Feb 28 05:07:28 2018 -0600

    Add CodeTriage badge to pandas-dev/pandas (pandas-dev#19928)

    Adds a badge showing the number of people helping this repo on CodeTriage.

commit 14a38a6
Author: Chris Catalfo <ccatalfo@users.noreply.github.com>
Date:   Wed Feb 28 03:14:23 2018 -0500

    DOC: fixes pipe example in basics.rst due to statsmodel changes (pandas-dev#19923)

commit dfe9d4a
Author: Phil Ngo <ngo.phil@gmail.com>
Date:   Wed Feb 28 00:05:56 2018 -0800

    DOC: fix Series.reset_index example (pandas-dev#19930)

commit 9bdc5c8
Author: William Ayd <william.ayd@icloud.com>
Date:   Tue Feb 27 16:16:48 2018 -0800

    Consistent Timedelta Writing for all Excel Engines (pandas-dev#19921)

commit 61211a8
Author: jbrockmendel <jbrockmendel@gmail.com>
Date:   Tue Feb 27 16:11:47 2018 -0800

    Assorted _libs cleanups (pandas-dev#19887)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: support .astype('category') on DataFrame / aka co-factorization
7 participants