Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add NA scalar for missing value indicator, use in StringArray. #29597

Merged
merged 25 commits into from
Dec 1, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
03f83bd
ENH: add NA scalar for missing value indicator
jorisvandenbossche Nov 12, 2019
c1797d5
add np.nan to arithmetic/comparison tests
jorisvandenbossche Nov 13, 2019
3339eaa
use id(self) for hash
jorisvandenbossche Nov 13, 2019
e9d4d6a
fix api test
jorisvandenbossche Nov 13, 2019
4450d2d
move to cython
jorisvandenbossche Nov 13, 2019
1849a23
add examples to isna/notna docstring
jorisvandenbossche Nov 14, 2019
c72e3ee
Use NA scalar in string dtype (#1)
TomAugspurger Nov 14, 2019
3a97782
Merge remote-tracking branch 'upstream/master' into NA-scalar
jorisvandenbossche Nov 14, 2019
2302661
fix doctest
jorisvandenbossche Nov 14, 2019
2ab592a
small edits
jorisvandenbossche Nov 14, 2019
018399e
fix NA in repr
jorisvandenbossche Nov 15, 2019
31290b9
Merge remote-tracking branch 'upstream/master' into NA-scalar
TomAugspurger Nov 19, 2019
33fd3e0
remove redundant test
TomAugspurger Nov 19, 2019
289c885
remove dead code
TomAugspurger Nov 19, 2019
22de7cd
Merge remote-tracking branch 'upstream/master' into NA-scalar
jorisvandenbossche Nov 20, 2019
f8208db
fix divmod
jorisvandenbossche Nov 21, 2019
371eeeb
Merge remote-tracking branch 'upstream/master' into NA-scalar
jorisvandenbossche Nov 21, 2019
1cadeda
Merge remote-tracking branch 'upstream/master' into NA-scalar
jorisvandenbossche Nov 25, 2019
1fcf4b7
NA -> C_NA
jorisvandenbossche Nov 25, 2019
f6798e5
start some docs
jorisvandenbossche Nov 26, 2019
14c1434
futher doc updates
jorisvandenbossche Nov 27, 2019
788a2c2
Merge remote-tracking branch 'upstream/master' into NA-scalar
jorisvandenbossche Nov 27, 2019
1bcbab2
doc fixup
jorisvandenbossche Nov 27, 2019
775cdfb
Merge remote-tracking branch 'upstream/master' into NA-scalar
jorisvandenbossche Nov 27, 2019
589a961
further doc updates
jorisvandenbossche Nov 28, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 143 additions & 6 deletions doc/source/user_guide/missing_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@ pandas.
.. note::

The choice of using ``NaN`` internally to denote missing data was largely
for simplicity and performance reasons. It differs from the MaskedArray
approach of, for example, :mod:`scikits.timeseries`. We are hopeful that
NumPy will soon be able to provide a native NA type solution (similar to R)
performant enough to be used in pandas.
for simplicity and performance reasons.
Starting from pandas 1.0, some optional data types start experimenting
with a native ``NA`` scalar using a mask-based approach. See
:ref:`here <missing_data.NA>` for more.

See the :ref:`cookbook<cookbook.missing_data>` for some advanced strategies.

Expand Down Expand Up @@ -110,7 +110,7 @@ pandas objects provide compatibility between ``NaT`` and ``NaN``.
.. _missing.inserting:

Inserting missing data
----------------------
~~~~~~~~~~~~~~~~~~~~~~

You can insert missing values by simply assigning to containers. The
actual missing value used will be chosen based on the dtype.
Expand All @@ -135,9 +135,10 @@ For object containers, pandas will use the value given:
s.loc[1] = np.nan
s

.. _missing_data.calculations:

Calculations with missing data
------------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Missing values propagate naturally through arithmetic operations between pandas
objects.
Expand Down Expand Up @@ -771,3 +772,139 @@ the ``dtype="Int64"``.
s

See :ref:`integer_na` for more.


.. _missing_data.NA:

Experimental ``NA`` scalar to denote missing values
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. warning::

Experimental: the behaviour of ``pd.NA`` can still change without warning.

.. versionadded:: 1.0.0

Starting from pandas 1.0, an experimental ``pd.NA`` value (singleton) is
available to represent scalar missing values. At this moment, it is used in
the nullable :doc:`integer <integer_na>`, boolean and
:ref:`dedicated string <text.types>` data types as the missing value indicator.

The goal of ``pd.NA`` is provide a "missing" indicator that can be used
consistently accross data types (instead of ``np.nan``, ``None`` or ``pd.NaT``
depending on the data type).

For example, when having missing values in a Series with the nullable integer
dtype, it will use ``pd.NA``:

.. ipython:: python

s = pd.Series([1, 2, None], dtype="Int64")
s
s[2]
s[2] is pd.NA

Currently, pandas does not yet use those data types by default (when creating
a DataFrame or Series, or when reading in data), so you need to specify
the dtype explicitly.

Propagation in arithmetic and comparison operations
---------------------------------------------------

In general, missing values *propagate* in operations involving ``pd.NA``. When
one of the operands is unknown, the outcome of the operation is also unknown.

For example, ``pd.NA`` propagates in arithmetic operations, similarly to
``np.nan``:

.. ipython:: python

pd.NA + 1
"a" * pd.NA

In equality and comparison operations, ``pd.NA`` also propagates. This deviates
from the behaviour of ``np.nan``, where comparisons with ``np.nan`` always
return ``False``.

.. ipython:: python

pd.NA == 1
pd.NA == pd.NA
pd.NA < 2.5

To check if a value is equal to ``pd.NA``, the :func:`isna` function can be
used:

.. ipython:: python

pd.isna(pd.NA)

An exception on this basic propagation rule are *reductions* (such as the
mean or the minimum), where pandas defaults to skipping missing values. See
:ref:`above <missing_data.calculations>` for more.

Logical operations
------------------

For logical operations, ``pd.NA`` follows the rules of the
`three-valued logic <https://en.wikipedia.org/wiki/Three-valued_logic>`__ (or
*Kleene logic*, similarly to R, SQL and Julia). This logic means to only
propagate missing values when it is logically required.

For example, for the logical "or" operation (``|``), if one of the operands
is ``True``, we already know the result will be ``True``, regardless of the
other value (so regardless the missing value would be ``True`` or ``False``).
In this case, ``pd.NA`` does not propagate:

.. ipython:: python

True | False
True | pd.NA
pd.NA | True

On the other hand, if one of the operands is ``False``, the result depends
on the value of the other operand. Therefore, in this case ``pd.NA``
propagates:

.. ipython:: python

False | True
False | False
False | pd.NA

The behaviour of the logical "and" operation (``&``) can be derived using
similar logic (where now ``pd.NA`` will not propagate if one of the operands
is already ``False``):

.. ipython:: python

False & True
False & False
False & pd.NA

.. ipython:: python

True & True
True & False
True & pd.NA


``NA`` in a boolean context
---------------------------

Since the actual value of an NA is unknown, it is ambiguous to convert NA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a section gotchas.truth that is very similiar, could link.

to a boolean value. The following raises an error:

.. ipython:: python
:okexcept:

bool(pd.NA)

This also means that ``pd.NA`` cannot be used in a context where it is
evaluated to a boolean, such as ``if condition: ...`` where ``condition`` can
potentially be ``pd.NA``. In such cases, :func:`isna` can be used to check
for ``pd.NA`` or ``condition`` being ``pd.NA`` can be avoided, for example by
filling missing values beforehand.

A similar situation occurs when using Series or DataFrame objects in ``if``
statements, see :ref:`gotchas.truth`.
44 changes: 44 additions & 0 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,50 @@ String accessor methods returning integers will return a value with :class:`Int6
We recommend explicitly using the ``string`` data type when working with strings.
See :ref:`text.types` for more.

.. _whatsnew_100.NA:

Experimental ``NA`` scalar to denote missing values
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A new ``pd.NA`` value (singleton) is introduced to represent scalar missing
values. Up to now, ``np.nan`` is used for this for float data, ``np.nan`` or
``None`` for object-dtype data and ``pd.NaT`` for datetime-like data. The
goal of ``pd.NA`` is provide a "missing" indicator that can be used
consistently accross data types. For now, the nullable integer and boolean
data types and the new string data type make use of ``pd.NA`` (:issue:`28095`).

.. warning::

Experimental: the behaviour of ``pd.NA`` can still change without warning.

For example, creating a Series using the nullable integer dtype:

.. ipython:: python

s = pd.Series([1, 2, None], dtype="Int64")
s
s[2]

Compared to ``np.nan``, ``pd.NA`` behaves differently in certain operations.
In addition to arithmetic operations, ``pd.NA`` also propagates as "missing"
or "unknown" in comparison operations:

.. ipython:: python

np.nan > 1
pd.NA > 1

For logical operations, ``pd.NA`` follows the rules of the
`three-valued logic <https://en.wikipedia.org/wiki/Three-valued_logic>`__ (or
*Kleene logic*). For example:

.. ipython:: python

pd.NA | True

For more, see :ref:`NA section <missing_data.NA>` in the user guide on missing
data.

.. _whatsnew_100.boolean:

Boolean data type with missing values support
Expand Down
1 change: 1 addition & 0 deletions pandas/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@
StringDtype,
BooleanDtype,
# missing
NA,
isna,
isnull,
notna,
Expand Down
5 changes: 3 additions & 2 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ from pandas._libs.tslibs.timedeltas cimport convert_to_timedelta64
from pandas._libs.tslibs.timezones cimport get_timezone, tz_compare

from pandas._libs.missing cimport (
checknull, isnaobj, is_null_datetime64, is_null_timedelta64, is_null_period
checknull, isnaobj, is_null_datetime64, is_null_timedelta64, is_null_period, C_NA
)


Expand Down Expand Up @@ -161,6 +161,7 @@ def is_scalar(val: object) -> bool:
or PyTime_Check(val)
# We differ from numpy, which claims that None is not scalar;
# see np.isscalar
or val is C_NA
or val is None
or isinstance(val, (Fraction, Number))
or util.is_period_object(val)
Expand Down Expand Up @@ -1502,7 +1503,7 @@ cdef class Validator:
f'must define is_value_typed')

cdef bint is_valid_null(self, object value) except -1:
return value is None or util.is_nan(value)
return value is None or value is C_NA or util.is_nan(value)

cdef bint is_array_typed(self) except -1:
return False
Expand Down
5 changes: 5 additions & 0 deletions pandas/_libs/missing.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,8 @@ cpdef ndarray[uint8_t] isnaobj(ndarray arr)
cdef bint is_null_datetime64(v)
cdef bint is_null_timedelta64(v)
cdef bint is_null_period(v)

cdef class C_NAType:
pass

cdef C_NAType C_NA
Loading