Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/master' into io_csv_docstring_…
Browse files Browse the repository at this point in the history
…fixed

* upstream/master:
  DOC: Enhancing pivot / reshape docs (pandas-dev#21038)
  TST: Fix xfailing DataFrame arithmetic tests by transposing (pandas-dev#23620)
  BUILD: Simplifying contributor dependencies (pandas-dev#23522)
  BUG/REF: TimedeltaIndex.__new__ (pandas-dev#23539)
  BUG: Casting tz-aware DatetimeIndex to object-dtype ndarray/Index (pandas-dev#23524)
  BUG: Delegate more of Excel parsing to CSV (pandas-dev#23544)
  API: DataFrame.__getitem__ returns Series for sparse column (pandas-dev#23561)
  CLN: use float64_t consistently instead of double, double_t (pandas-dev#23583)
  DOC: Fix Order of parameters in docstrings (pandas-dev#23611)
  TST: Unskip some Categorical Tests (pandas-dev#23613)
  TST: Fix integer ops comparison test (pandas-dev#23619)
  • Loading branch information
thoo committed Nov 12, 2018
2 parents 3f5fbcd + dcb8b6a commit 4e6f3a0
Show file tree
Hide file tree
Showing 75 changed files with 1,939 additions and 1,325 deletions.
20 changes: 15 additions & 5 deletions ci/code_checks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,19 @@
# In the future we may want to add the validation of docstrings and other checks here.
#
# Usage:
# $ ./ci/code_checks.sh # run all checks
# $ ./ci/code_checks.sh lint # run linting only
# $ ./ci/code_checks.sh patterns # check for patterns that should not exist
# $ ./ci/code_checks.sh doctests # run doctests
# $ ./ci/code_checks.sh # run all checks
# $ ./ci/code_checks.sh lint # run linting only
# $ ./ci/code_checks.sh patterns # check for patterns that should not exist
# $ ./ci/code_checks.sh doctests # run doctests
# $ ./ci/code_checks.sh dependencies # check that dependencies are consistent

echo "inside $0"
[[ $LINT ]] || { echo "NOT Linting. To lint use: LINT=true $0 $1"; exit 0; }
[[ -z "$1" || "$1" == "lint" || "$1" == "patterns" || "$1" == "doctests" ]] || { echo "Unknown command $1. Usage: $0 [lint|patterns|doctests]"; exit 9999; }
[[ -z "$1" || "$1" == "lint" || "$1" == "patterns" || "$1" == "doctests" || "$1" == "dependencies" ]] \
|| { echo "Unknown command $1. Usage: $0 [lint|patterns|doctests|dependencies]"; exit 9999; }

source activate pandas
BASE_DIR="$(dirname $0)/.."
RET=0
CHECK=$1

Expand Down Expand Up @@ -172,4 +175,11 @@ if [[ -z "$CHECK" || "$CHECK" == "doctests" ]]; then

fi

### DEPENDENCIES ###
if [[ -z "$CHECK" || "$CHECK" == "dependencies" ]]; then
MSG='Check that requirements-dev.txt has been generated from environment.yml' ; echo $MSG
$BASE_DIR/scripts/generate_pip_deps_from_conda.py --compare
RET=$(($RET + $?)) ; echo $MSG "DONE"
fi

exit $RET
20 changes: 0 additions & 20 deletions ci/environment-dev.yaml

This file was deleted.

28 changes: 0 additions & 28 deletions ci/requirements-optional-conda.txt

This file was deleted.

16 changes: 0 additions & 16 deletions ci/requirements_dev.txt

This file was deleted.

11 changes: 3 additions & 8 deletions doc/source/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ We'll now kick off a three-step process:
.. code-block:: none
# Create and activate the build environment
conda env create -f ci/environment-dev.yaml
conda env create -f environment.yml
conda activate pandas-dev
# or with older versions of Anaconda:
Expand All @@ -180,9 +180,6 @@ We'll now kick off a three-step process:
python setup.py build_ext --inplace -j 4
python -m pip install -e .
# Install the rest of the optional dependencies
conda install -c defaults -c conda-forge --file=ci/requirements-optional-conda.txt
At this point you should be able to import pandas from your locally built version::

$ python # start an interpreter
Expand Down Expand Up @@ -221,14 +218,12 @@ You'll need to have at least python3.5 installed on your system.
. ~/virtualenvs/pandas-dev/bin/activate
# Install the build dependencies
python -m pip install -r ci/requirements_dev.txt
python -m pip install -r requirements-dev.txt
# Build and install pandas
python setup.py build_ext --inplace -j 4
python -m pip install -e .
# Install additional dependencies
python -m pip install -r ci/requirements-optional-pip.txt
Creating a branch
-----------------

Expand Down
29 changes: 28 additions & 1 deletion doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2861,7 +2861,13 @@ to be parsed.
read_excel('path_to_file.xls', 'Sheet1', usecols=2)
If `usecols` is a list of integers, then it is assumed to be the file column
You can also specify a comma-delimited set of Excel columns and ranges as a string:

.. code-block:: python
read_excel('path_to_file.xls', 'Sheet1', usecols='A,C:E')
If ``usecols`` is a list of integers, then it is assumed to be the file column
indices to be parsed.

.. code-block:: python
Expand All @@ -2870,6 +2876,27 @@ indices to be parsed.
Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.

.. versionadded:: 0.24

If ``usecols`` is a list of strings, it is assumed that each string corresponds
to a column name provided either by the user in ``names`` or inferred from the
document header row(s). Those strings define which columns will be parsed:

.. code-block:: python
read_excel('path_to_file.xls', 'Sheet1', usecols=['foo', 'bar'])
Element order is ignored, so ``usecols=['baz', 'joe']`` is the same as ``['joe', 'baz']``.

.. versionadded:: 0.24

If ``usecols`` is callable, the callable function will be evaluated against
the column names, returning names where the callable function evaluates to ``True``.

.. code-block:: python
read_excel('path_to_file.xls', 'Sheet1', usecols=lambda x: x.isalpha())
Parsing Dates
+++++++++++++

Expand Down
110 changes: 104 additions & 6 deletions doc/source/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ Reshaping and Pivot Tables
Reshaping by pivoting DataFrame objects
---------------------------------------

.. image:: _static/reshaping_pivot.png

.. ipython::
:suppress:

Expand All @@ -33,8 +35,7 @@ Reshaping by pivoting DataFrame objects

In [3]: df = unpivot(tm.makeTimeDataFrame())

Data is often stored in CSV files or databases in so-called "stacked" or
"record" format:
Data is often stored in so-called "stacked" or "record" format:

.. ipython:: python
Expand Down Expand Up @@ -66,8 +67,6 @@ To select out everything for variable ``A`` we could do:
df[df['variable'] == 'A']
.. image:: _static/reshaping_pivot.png

But suppose we wish to do time series operations with the variables. A better
representation would be where the ``columns`` are the unique variables and an
``index`` of dates identifies individual observations. To reshape the data into
Expand All @@ -87,7 +86,7 @@ column:
.. ipython:: python
df['value2'] = df['value'] * 2
pivoted = df.pivot('date', 'variable')
pivoted = df.pivot(index='date', columns='variable')
pivoted
You can then select subsets from the pivoted ``DataFrame``:
Expand All @@ -99,6 +98,12 @@ You can then select subsets from the pivoted ``DataFrame``:
Note that this returns a view on the underlying data in the case where the data
are homogeneously-typed.

.. note::
:func:`~pandas.pivot` will error with a ``ValueError: Index contains duplicate
entries, cannot reshape`` if the index/column pair is not unique. In this
case, consider using :func:`~pandas.pivot_table` which is a generalization
of pivot that can handle duplicate values for one index/column pair.

.. _reshaping.stacking:

Reshaping by stacking and unstacking
Expand Down Expand Up @@ -704,10 +709,103 @@ handling of NaN:
In [3]: np.unique(x, return_inverse=True)[::-1]
Out[3]: (array([3, 3, 0, 4, 1, 2]), array([nan, 3.14, inf, 'A', 'B'], dtype=object))
.. note::
If you just want to handle one column as a categorical variable (like R's factor),
you can use ``df["cat_col"] = pd.Categorical(df["col"])`` or
``df["cat_col"] = df["col"].astype("category")``. For full docs on :class:`~pandas.Categorical`,
see the :ref:`Categorical introduction <categorical>` and the
:ref:`API documentation <api.categorical>`.

Examples
--------

In this section, we will review frequently asked questions and examples. The
column names and relevant column values are named to correspond with how this
DataFrame will be pivoted in the answers below.

.. ipython:: python
np.random.seed([3, 1415])
n = 20
cols = np.array(['key', 'row', 'item', 'col'])
df = cols + pd.DataFrame((np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str))
df.columns = cols
df = df.join(pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val'))
df
Pivoting with Single Aggregations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Suppose we wanted to pivot ``df`` such that the ``col`` values are columns,
``row`` values are the index, and the mean of ``val0`` are the values? In
particular, the resulting DataFrame should look like:

.. code-block:: ipython
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 NaN 0.860 0.65
row2 0.13 NaN 0.395 0.500 0.25
row3 NaN 0.310 NaN 0.545 NaN
row4 NaN 0.100 0.395 0.760 0.24
This solution uses :func:`~pandas.pivot_table`. Also note that
``aggfunc='mean'`` is the default. It is included here to be explicit.

.. ipython:: python
df.pivot_table(
values='val0', index='row', columns='col', aggfunc='mean')
Note that we can also replace the missing values by using the ``fill_value``
parameter.

.. ipython:: python
df.pivot_table(
values='val0', index='row', columns='col', aggfunc='mean', fill_value=0)
Also note that we can pass in other aggregation functions as well. For example,
we can also pass in ``sum``.

.. ipython:: python
df.pivot_table(
values='val0', index='row', columns='col', aggfunc='sum', fill_value=0)
Another aggregation we can do is calculate the frequency in which the columns
and rows occur together a.k.a. "cross tabulation". To do this, we can pass
``size`` to the ``aggfunc`` parameter.

.. ipython:: python
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
Pivoting with Multiple Aggregations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can also perform multiple aggregations. For example, to perform both a
``sum`` and ``mean``, we can pass in a list to the ``aggfunc`` argument.

.. ipython:: python
df.pivot_table(
values='val0', index='row', columns='col', aggfunc=['mean', 'sum'])
Note to aggregate over multiple value columns, we can pass in a list to the
``values`` parameter.

.. ipython:: python
df.pivot_table(
values=['val0', 'val1'], index='row', columns='col', aggfunc=['mean'])
Note to subdivide over multiple columns we can pass in a list to the
``columns`` parameter.

.. ipython:: python
df.pivot_table(
values=['val0'], index='row', columns=['item', 'col'], aggfunc=['mean'])
10 changes: 10 additions & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,7 @@ Other Enhancements
- Added :meth:`Interval.overlaps`, :meth:`IntervalArray.overlaps`, and :meth:`IntervalIndex.overlaps` for determining overlaps between interval-like objects (:issue:`21998`)
- :func:`~DataFrame.to_parquet` now supports writing a ``DataFrame`` as a directory of parquet files partitioned by a subset of the columns when ``engine = 'pyarrow'`` (:issue:`23283`)
- :meth:`Timestamp.tz_localize`, :meth:`DatetimeIndex.tz_localize`, and :meth:`Series.tz_localize` have gained the ``nonexistent`` argument for alternative handling of nonexistent times. See :ref:`timeseries.timezone_nonexsistent` (:issue:`8917`)
- :meth:`read_excel()` now accepts ``usecols`` as a list of column names or callable (:issue:`18273`)

.. _whatsnew_0240.api_breaking:

Expand All @@ -246,6 +247,7 @@ Backwards incompatible API changes

- A newly constructed empty :class:`DataFrame` with integer as the ``dtype`` will now only be cast to ``float64`` if ``index`` is specified (:issue:`22858`)
- :meth:`Series.str.cat` will now raise if `others` is a `set` (:issue:`23009`)
- Passing scalar values to :class:`DatetimeIndex` or :class:`TimedeltaIndex` will now raise ``TypeError`` instead of ``ValueError`` (:issue:`23539`)

.. _whatsnew_0240.api_breaking.deps:

Expand Down Expand Up @@ -562,6 +564,7 @@ changes were made:
- The result of concatenating a mix of sparse and dense Series is a Series with sparse values, rather than a ``SparseSeries``.
- ``SparseDataFrame.combine`` and ``DataFrame.combine_first`` no longer supports combining a sparse column with a dense column while preserving the sparse subtype. The result will be an object-dtype SparseArray.
- Setting :attr:`SparseArray.fill_value` to a fill value with a different dtype is now allowed.
- ``DataFrame[column]`` is now a :class:`Series` with sparse values, rather than a :class:`SparseSeries`, when slicing a single column with sparse values (:issue:`23559`).

Some new warnings are issued for operations that require or are likely to materialize a large dense array:

Expand Down Expand Up @@ -967,6 +970,7 @@ Deprecations
- The class ``FrozenNDArray`` has been deprecated. When unpickling, ``FrozenNDArray`` will be unpickled to ``np.ndarray`` once this class is removed (:issue:`9031`)
- Deprecated the `nthreads` keyword of :func:`pandas.read_feather` in favor of
`use_threads` to reflect the changes in pyarrow 0.11.0. (:issue:`23053`)
- Constructing a :class:`TimedeltaIndex` from data with ``datetime64``-dtyped data is deprecated, will raise ``TypeError`` in a future version (:issue:`23539`)

.. _whatsnew_0240.deprecations.datetimelike_int_ops:

Expand Down Expand Up @@ -1126,6 +1130,9 @@ Datetimelike
- Bug in :class:`PeriodIndex` with attribute ``freq.n`` greater than 1 where adding a :class:`DateOffset` object would return incorrect results (:issue:`23215`)
- Bug in :class:`Series` that interpreted string indices as lists of characters when setting datetimelike values (:issue:`23451`)
- Bug in :class:`Timestamp` constructor which would drop the frequency of an input :class:`Timestamp` (:issue:`22311`)
- Bug in :class:`DatetimeIndex` where calling ``np.array(dtindex, dtype=object)`` would incorrectly return an array of ``long`` objects (:issue:`23524`)
- Bug in :class:`Index` where passing a timezone-aware :class:`DatetimeIndex` and `dtype=object` would incorrectly raise a ``ValueError`` (:issue:`23524`)
- Bug in :class:`Index` where calling ``np.array(dtindex, dtype=object)`` on a timezone-naive :class:`DatetimeIndex` would return an array of ``datetime`` objects instead of :class:`Timestamp` objects, potentially losing nanosecond portions of the timestamps (:issue:`23524`)

Timedelta
^^^^^^^^^
Expand Down Expand Up @@ -1172,6 +1179,7 @@ Offsets
- Bug in :class:`FY5253` where date offsets could incorrectly raise an ``AssertionError`` in arithmetic operatons (:issue:`14774`)
- Bug in :class:`DateOffset` where keyword arguments ``week`` and ``milliseconds`` were accepted and ignored. Passing these will now raise ``ValueError`` (:issue:`19398`)
- Bug in adding :class:`DateOffset` with :class:`DataFrame` or :class:`PeriodIndex` incorrectly raising ``TypeError`` (:issue:`23215`)
- Bug in comparing :class:`DateOffset` objects with non-DateOffset objects, particularly strings, raising ``ValueError`` instead of returning ``False`` for equality checks and ``True`` for not-equal checks (:issue:`23524`)

Numeric
^^^^^^^
Expand Down Expand Up @@ -1299,6 +1307,8 @@ Notice how we now instead output ``np.nan`` itself instead of a stringified form
- Bug in :meth:`HDFStore.append` when appending a :class:`DataFrame` with an empty string column and ``min_itemsize`` < 8 (:issue:`12242`)
- Bug in :meth:`read_csv()` in which :class:`MultiIndex` index names were being improperly handled in the cases when they were not provided (:issue:`23484`)
- Bug in :meth:`read_html()` in which the error message was not displaying the valid flavors when an invalid one was provided (:issue:`23549`)
- Bug in :meth:`read_excel()` in which ``index_col=None`` was not being respected and parsing index columns anyway (:issue:`20480`)
- Bug in :meth:`read_excel()` in which ``usecols`` was not being validated for proper column names when passed in as a string (:issue:`20480`)

Plotting
^^^^^^^^
Expand Down
Loading

0 comments on commit 4e6f3a0

Please sign in to comment.