Skip to content

Latest commit

 

History

History
1249 lines (981 loc) · 82.2 KB

v1.3.0.rst

File metadata and controls

1249 lines (981 loc) · 82.2 KB

What's new in 1.3.0 (July 2, 2021)

These are the changes in pandas 1.3.0. See :ref:`release` for a full changelog including other versions of pandas.

{{ header }}

Warning

When reading new Excel 2007+ (.xlsx) files, the default argument engine=None to :func:`read_excel` will now result in using the openpyxl engine in all cases when the option :attr:`io.excel.xlsx.reader` is set to "auto". Previously, some cases would use the xlrd engine instead. See :ref:`What's new 1.2.0 <whatsnew_120>` for background on this change.

Enhancements

Custom HTTP(s) headers when reading csv or json files

When reading from a remote URL that is not handled by fsspec (e.g. HTTP and HTTPS) the dictionary passed to storage_options will be used to create the headers included in the request. This can be used to control the User-Agent header or send other custom headers (:issue:`36688`). For example:

In [1]: headers = {"User-Agent": "pandas"}
In [2]: df = pd.read_csv(
   ...:     "https://download.bls.gov/pub/time.series/cu/cu.item",
   ...:     sep="\t",
   ...:     storage_options=headers
   ...: )

Read and write XML documents

We added I/O support to read and render shallow versions of XML documents with :func:`read_xml` and :meth:`DataFrame.to_xml`. Using lxml as parser, both XPath 1.0 and XSLT 1.0 are available. (:issue:`27554`)

In [1]: xml = """<?xml version='1.0' encoding='utf-8'?>
   ...: <data>
   ...:  <row>
   ...:     <shape>square</shape>
   ...:     <degrees>360</degrees>
   ...:     <sides>4.0</sides>
   ...:  </row>
   ...:  <row>
   ...:     <shape>circle</shape>
   ...:     <degrees>360</degrees>
   ...:     <sides/>
   ...:  </row>
   ...:  <row>
   ...:     <shape>triangle</shape>
   ...:     <degrees>180</degrees>
   ...:     <sides>3.0</sides>
   ...:  </row>
   ...:  </data>"""

In [2]: df = pd.read_xml(xml)
In [3]: df
Out[3]:
      shape  degrees  sides
0    square      360    4.0
1    circle      360    NaN
2  triangle      180    3.0

In [4]: df.to_xml()
Out[4]:
<?xml version='1.0' encoding='utf-8'?>
<data>
  <row>
    <index>0</index>
    <shape>square</shape>
    <degrees>360</degrees>
    <sides>4.0</sides>
  </row>
  <row>
    <index>1</index>
    <shape>circle</shape>
    <degrees>360</degrees>
    <sides/>
  </row>
  <row>
    <index>2</index>
    <shape>triangle</shape>
    <degrees>180</degrees>
    <sides>3.0</sides>
  </row>
</data>

For more, see :ref:`io.xml` in the user guide on IO tools.

Styler enhancements

We provided some focused development on :class:`.Styler`. See also the Styler documentation which has been revised and improved (:issue:`39720`, :issue:`39317`, :issue:`40493`).

DataFrame constructor honors copy=False with dict

When passing a dictionary to :class:`DataFrame` with copy=False, a copy will no longer be made (:issue:`32960`).

.. ipython:: python

    arr = np.array([1, 2, 3])
    df = pd.DataFrame({"A": arr, "B": arr.copy()}, copy=False)
    df

df["A"] remains a view on arr:

.. ipython:: python

    arr[0] = 0
    assert df.iloc[0, 0] == 0

The default behavior when not passing copy will remain unchanged, i.e. a copy will be made.

PyArrow backed string data type

We've enhanced the :class:`StringDtype`, an extension type dedicated to string data. (:issue:`39908`)

It is now possible to specify a storage keyword option to :class:`StringDtype`. Use pandas options or specify the dtype using dtype='string[pyarrow]' to allow the StringArray to be backed by a PyArrow array instead of a NumPy array of Python objects.

The PyArrow backed StringArray requires pyarrow 1.0.0 or greater to be installed.

Warning

string[pyarrow] is currently considered experimental. The implementation and parts of the API may change without warning.

.. ipython:: python

   pd.Series(['abc', None, 'def'], dtype=pd.StringDtype(storage="pyarrow"))

You can use the alias "string[pyarrow]" as well.

.. ipython:: python

   s = pd.Series(['abc', None, 'def'], dtype="string[pyarrow]")
   s

You can also create a PyArrow backed string array using pandas options.

.. ipython:: python

    with pd.option_context("string_storage", "pyarrow"):
        s = pd.Series(['abc', None, 'def'], dtype="string")
    s

The usual string accessor methods work. Where appropriate, the return type of the Series or columns of a DataFrame will also have string dtype.

.. ipython:: python

   s.str.upper()
   s.str.split('b', expand=True).dtypes

String accessor methods returning integers will return a value with :class:`Int64Dtype`

.. ipython:: python

   s.str.count("a")

Centered datetime-like rolling windows

When performing rolling calculations on DataFrame and Series objects with a datetime-like index, a centered datetime-like window can now be used (:issue:`38780`). For example:

.. ipython:: python

    df = pd.DataFrame(
        {"A": [0, 1, 2, 3, 4]}, index=pd.date_range("2020", periods=5, freq="1D")
    )
    df
    df.rolling("2D", center=True).mean()


Other enhancements

Notable bug fixes

These are bug fixes that might have notable behavior changes.

Categorical.unique now always maintains same dtype as original

Previously, when calling :meth:`Categorical.unique` with categorical data, unused categories in the new array would be removed, making the dtype of the new array different than the original (:issue:`18291`)

As an example of this, given:

.. ipython:: python

        dtype = pd.CategoricalDtype(['bad', 'neutral', 'good'], ordered=True)
        cat = pd.Categorical(['good', 'good', 'bad', 'bad'], dtype=dtype)
        original = pd.Series(cat)
        unique = original.unique()

Previous behavior:

In [1]: unique
['good', 'bad']
Categories (2, object): ['bad' < 'good']
In [2]: original.dtype == unique.dtype
False

New behavior:

.. ipython:: python

        unique
        original.dtype == unique.dtype

:meth:`DataFrame.combine_first` will now preserve dtypes (:issue:`7509`)

.. ipython:: python

   df1 = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=[0, 1, 2])
   df1
   df2 = pd.DataFrame({"B": [4, 5, 6], "C": [1, 2, 3]}, index=[2, 3, 4])
   df2
   combined = df1.combine_first(df2)

Previous behavior:

In [1]: combined.dtypes
Out[2]:
A    float64
B    float64
C    float64
dtype: object

New behavior:

.. ipython:: python

   combined.dtypes

Groupby methods agg and transform no longer changes return dtype for callables

Previously the methods :meth:`.DataFrameGroupBy.aggregate`, :meth:`.SeriesGroupBy.aggregate`, :meth:`.DataFrameGroupBy.transform`, and :meth:`.SeriesGroupBy.transform` might cast the result dtype when the argument func is callable, possibly leading to undesirable results (:issue:`21240`). The cast would occur if the result is numeric and casting back to the input dtype does not change any values as measured by np.allclose. Now no such casting occurs.

.. ipython:: python

    df = pd.DataFrame({'key': [1, 1], 'a': [True, False], 'b': [True, True]})
    df

Previous behavior:

In [5]: df.groupby('key').agg(lambda x: x.sum())
Out[5]:
        a  b
key
1    True  2

New behavior:

.. ipython:: python

    df.groupby('key').agg(lambda x: x.sum())

Previously, these methods could result in different dtypes depending on the input values. Now, these methods will always return a float dtype. (:issue:`41137`)

.. ipython:: python

    df = pd.DataFrame({'a': [True], 'b': [1], 'c': [1.0]})

Previous behavior:

In [5]: df.groupby(df.index).mean()
Out[5]:
        a  b    c
0    True  1  1.0

New behavior:

.. ipython:: python

    df.groupby(df.index).mean()

Try operating inplace when setting values with loc and iloc

When setting an entire column using loc or iloc, pandas will try to insert the values into the existing data rather than create an entirely new array.

.. ipython:: python

   df = pd.DataFrame(range(3), columns=["A"], dtype="float64")
   values = df.values
   new = np.array([5, 6, 7], dtype="int64")
   df.loc[[0, 1, 2], "A"] = new

In both the new and old behavior, the data in values is overwritten, but in the old behavior the dtype of df["A"] changed to int64.

Previous behavior:

In [1]: df.dtypes
Out[1]:
A    int64
dtype: object
In [2]: np.shares_memory(df["A"].values, new)
Out[2]: False
In [3]: np.shares_memory(df["A"].values, values)
Out[3]: False

In pandas 1.3.0, df continues to share data with values

New behavior:

.. ipython:: python

   df.dtypes
   np.shares_memory(df["A"], new)
   np.shares_memory(df["A"], values)


Never operate inplace when setting frame[keys] = values

When setting multiple columns using frame[keys] = values new arrays will replace pre-existing arrays for these keys, which will not be over-written (:issue:`39510`). As a result, the columns will retain the dtype(s) of values, never casting to the dtypes of the existing arrays.

.. ipython:: python

   df = pd.DataFrame(range(3), columns=["A"], dtype="float64")
   df[["A"]] = 5

In the old behavior, 5 was cast to float64 and inserted into the existing array backing df:

Previous behavior:

In [1]: df.dtypes
Out[1]:
A    float64

In the new behavior, we get a new array, and retain an integer-dtyped 5:

New behavior:

.. ipython:: python

   df.dtypes


Consistent casting with setting into Boolean Series

Setting non-boolean values into a :class:`Series` with dtype=bool now consistently casts to dtype=object (:issue:`38709`)

In [1]: orig = pd.Series([True, False])

In [2]: ser = orig.copy()

In [3]: ser.iloc[1] = np.nan

In [4]: ser2 = orig.copy()

In [5]: ser2.iloc[1] = 2.0

Previous behavior:

In [1]: ser
Out [1]:
0    1.0
1    NaN
dtype: float64

In [2]:ser2
Out [2]:
0    True
1     2.0
dtype: object

New behavior:

In [1]: ser
Out [1]:
0    True
1     NaN
dtype: object

In [2]:ser2
Out [2]:
0    True
1     2.0
dtype: object

DataFrameGroupBy.rolling and SeriesGroupBy.rolling no longer return grouped-by column in values

The group-by column will now be dropped from the result of a groupby.rolling operation (:issue:`32262`)

.. ipython:: python

    df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
    df

Previous behavior:

In [1]: df.groupby("A").rolling(2).sum()
Out[1]:
       A    B
A
1 0  NaN  NaN
1    2.0  1.0
2 2  NaN  NaN
3 3  NaN  NaN

New behavior:

.. ipython:: python

    df.groupby("A").rolling(2).sum()

Removed artificial truncation in rolling variance and standard deviation

:meth:`.Rolling.std` and :meth:`.Rolling.var` will no longer artificially truncate results that are less than ~1e-8 and ~1e-15 respectively to zero (:issue:`37051`, :issue:`40448`, :issue:`39872`).

However, floating point artifacts may now exist in the results when rolling over larger values.

.. ipython:: python

   s = pd.Series([7, 5, 5, 5])
   s.rolling(3).var()

DataFrameGroupBy.rolling and SeriesGroupBy.rolling with MultiIndex no longer drop levels in the result

:meth:`DataFrameGroupBy.rolling` and :meth:`SeriesGroupBy.rolling` will no longer drop levels of a :class:`DataFrame` with a :class:`MultiIndex` in the result. This can lead to a perceived duplication of levels in the resulting :class:`MultiIndex`, but this change restores the behavior that was present in version 1.1.3 (:issue:`38787`, :issue:`38523`).

.. ipython:: python

   index = pd.MultiIndex.from_tuples([('idx1', 'idx2')], names=['label1', 'label2'])
   df = pd.DataFrame({'a': [1], 'b': [2]}, index=index)
   df

Previous behavior:

In [1]: df.groupby('label1').rolling(1).sum()
Out[1]:
          a    b
label1
idx1    1.0  2.0

New behavior:

.. ipython:: python

    df.groupby('label1').rolling(1).sum()


Backwards incompatible API changes

Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package Minimum Version Required Changed
numpy 1.17.3 X X
pytz 2017.3 X  
python-dateutil 2.7.3 X  
bottleneck 1.2.1    
numexpr 2.7.0   X
pytest (dev) 6.0   X
mypy (dev) 0.812   X
setuptools 38.6.0   X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed
beautifulsoup4 4.6.0  
fastparquet 0.4.0 X
fsspec 0.7.4  
gcsfs 0.6.0  
lxml 4.3.0  
matplotlib 2.2.3  
numba 0.46.0  
openpyxl 3.0.0 X
pyarrow 0.17.0 X
pymysql 0.8.1 X
pytables 3.5.1  
s3fs 0.4.0  
scipy 1.2.0  
sqlalchemy 1.3.0 X
tabulate 0.8.7 X
xarray 0.12.0  
xlrd 1.2.0  
xlsxwriter 1.0.2  
xlwt 1.3.0  
pandas-gbq 0.12.0  

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Other API changes

Build

  • Documentation in .pptx and .pdf formats are no longer included in wheels or source distributions. (:issue:`30741`)

Deprecations

Deprecated dropping nuisance columns in DataFrame reductions and DataFrameGroupBy operations

Calling a reduction (e.g. .min, .max, .sum) on a :class:`DataFrame` with numeric_only=None (the default), columns where the reduction raises a TypeError are silently ignored and dropped from the result.

This behavior is deprecated. In a future version, the TypeError will be raised, and users will need to select only valid columns before calling the function.

For example:

.. ipython:: python

   df = pd.DataFrame({"A": [1, 2, 3, 4], "B": pd.date_range("2016-01-01", periods=4)})
   df

Old behavior:

In [3]: df.prod()
Out[3]:
Out[3]:
A    24
dtype: int64

Future behavior:

In [4]: df.prod()
...
TypeError: 'DatetimeArray' does not implement reduction 'prod'

In [5]: df[["A"]].prod()
Out[5]:
A    24
dtype: int64

Similarly, when applying a function to :class:`DataFrameGroupBy`, columns on which the function raises TypeError are currently silently ignored and dropped from the result.

This behavior is deprecated. In a future version, the TypeError will be raised, and users will need to select only valid columns before calling the function.

For example:

.. ipython:: python

   df = pd.DataFrame({"A": [1, 2, 3, 4], "B": pd.date_range("2016-01-01", periods=4)})
   gb = df.groupby([1, 1, 2, 2])

Old behavior:

In [4]: gb.prod(numeric_only=False)
Out[4]:
A
1   2
2  12

Future behavior:

In [5]: gb.prod(numeric_only=False)
...
TypeError: datetime64 type does not support prod operations

In [6]: gb[["A"]].prod(numeric_only=False)
Out[6]:
    A
1   2
2  12

Other Deprecations

Performance improvements

Bug fixes

Categorical

Datetimelike

Timedelta

Timezones

  • Bug in different tzinfo objects representing UTC not being treated as equivalent (:issue:`39216`)
  • Bug in dateutil.tz.gettz("UTC") not being recognized as equivalent to other UTC-representing tzinfos (:issue:`39276`)

Numeric

Conversion

Strings

Interval

Indexing

Missing

MultiIndex

I/O

Period

Plotting

Groupby/resample/rolling

Reshaping

Sparse

ExtensionArray

Styler

Other

Contributors

.. contributors:: v1.2.5..v1.3.0