Skip to content

Commit

Permalink
ENH: add NA scalar for missing value indicator, use in StringArray. (p…
Browse files Browse the repository at this point in the history
  • Loading branch information
jorisvandenbossche authored and proost committed Dec 19, 2019
1 parent 93f38af commit a2cec79
Show file tree
Hide file tree
Showing 16 changed files with 530 additions and 40 deletions.
149 changes: 143 additions & 6 deletions doc/source/user_guide/missing_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@ pandas.
.. note::

The choice of using ``NaN`` internally to denote missing data was largely
for simplicity and performance reasons. It differs from the MaskedArray
approach of, for example, :mod:`scikits.timeseries`. We are hopeful that
NumPy will soon be able to provide a native NA type solution (similar to R)
performant enough to be used in pandas.
for simplicity and performance reasons.
Starting from pandas 1.0, some optional data types start experimenting
with a native ``NA`` scalar using a mask-based approach. See
:ref:`here <missing_data.NA>` for more.

See the :ref:`cookbook<cookbook.missing_data>` for some advanced strategies.

Expand Down Expand Up @@ -110,7 +110,7 @@ pandas objects provide compatibility between ``NaT`` and ``NaN``.
.. _missing.inserting:

Inserting missing data
----------------------
~~~~~~~~~~~~~~~~~~~~~~

You can insert missing values by simply assigning to containers. The
actual missing value used will be chosen based on the dtype.
Expand All @@ -135,9 +135,10 @@ For object containers, pandas will use the value given:
s.loc[1] = np.nan
s
.. _missing_data.calculations:

Calculations with missing data
------------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Missing values propagate naturally through arithmetic operations between pandas
objects.
Expand Down Expand Up @@ -771,3 +772,139 @@ the ``dtype="Int64"``.
s
See :ref:`integer_na` for more.


.. _missing_data.NA:

Experimental ``NA`` scalar to denote missing values
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. warning::

Experimental: the behaviour of ``pd.NA`` can still change without warning.

.. versionadded:: 1.0.0

Starting from pandas 1.0, an experimental ``pd.NA`` value (singleton) is
available to represent scalar missing values. At this moment, it is used in
the nullable :doc:`integer <integer_na>`, boolean and
:ref:`dedicated string <text.types>` data types as the missing value indicator.

The goal of ``pd.NA`` is provide a "missing" indicator that can be used
consistently accross data types (instead of ``np.nan``, ``None`` or ``pd.NaT``
depending on the data type).

For example, when having missing values in a Series with the nullable integer
dtype, it will use ``pd.NA``:

.. ipython:: python
s = pd.Series([1, 2, None], dtype="Int64")
s
s[2]
s[2] is pd.NA
Currently, pandas does not yet use those data types by default (when creating
a DataFrame or Series, or when reading in data), so you need to specify
the dtype explicitly.

Propagation in arithmetic and comparison operations
---------------------------------------------------

In general, missing values *propagate* in operations involving ``pd.NA``. When
one of the operands is unknown, the outcome of the operation is also unknown.

For example, ``pd.NA`` propagates in arithmetic operations, similarly to
``np.nan``:

.. ipython:: python
pd.NA + 1
"a" * pd.NA
In equality and comparison operations, ``pd.NA`` also propagates. This deviates
from the behaviour of ``np.nan``, where comparisons with ``np.nan`` always
return ``False``.

.. ipython:: python
pd.NA == 1
pd.NA == pd.NA
pd.NA < 2.5
To check if a value is equal to ``pd.NA``, the :func:`isna` function can be
used:

.. ipython:: python
pd.isna(pd.NA)
An exception on this basic propagation rule are *reductions* (such as the
mean or the minimum), where pandas defaults to skipping missing values. See
:ref:`above <missing_data.calculations>` for more.

Logical operations
------------------

For logical operations, ``pd.NA`` follows the rules of the
`three-valued logic <https://en.wikipedia.org/wiki/Three-valued_logic>`__ (or
*Kleene logic*, similarly to R, SQL and Julia). This logic means to only
propagate missing values when it is logically required.

For example, for the logical "or" operation (``|``), if one of the operands
is ``True``, we already know the result will be ``True``, regardless of the
other value (so regardless the missing value would be ``True`` or ``False``).
In this case, ``pd.NA`` does not propagate:

.. ipython:: python
True | False
True | pd.NA
pd.NA | True
On the other hand, if one of the operands is ``False``, the result depends
on the value of the other operand. Therefore, in this case ``pd.NA``
propagates:

.. ipython:: python
False | True
False | False
False | pd.NA
The behaviour of the logical "and" operation (``&``) can be derived using
similar logic (where now ``pd.NA`` will not propagate if one of the operands
is already ``False``):

.. ipython:: python
False & True
False & False
False & pd.NA
.. ipython:: python
True & True
True & False
True & pd.NA
``NA`` in a boolean context
---------------------------

Since the actual value of an NA is unknown, it is ambiguous to convert NA
to a boolean value. The following raises an error:

.. ipython:: python
:okexcept:
bool(pd.NA)
This also means that ``pd.NA`` cannot be used in a context where it is
evaluated to a boolean, such as ``if condition: ...`` where ``condition`` can
potentially be ``pd.NA``. In such cases, :func:`isna` can be used to check
for ``pd.NA`` or ``condition`` being ``pd.NA`` can be avoided, for example by
filling missing values beforehand.

A similar situation occurs when using Series or DataFrame objects in ``if``
statements, see :ref:`gotchas.truth`.
44 changes: 44 additions & 0 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,50 @@ String accessor methods returning integers will return a value with :class:`Int6
We recommend explicitly using the ``string`` data type when working with strings.
See :ref:`text.types` for more.

.. _whatsnew_100.NA:

Experimental ``NA`` scalar to denote missing values
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A new ``pd.NA`` value (singleton) is introduced to represent scalar missing
values. Up to now, ``np.nan`` is used for this for float data, ``np.nan`` or
``None`` for object-dtype data and ``pd.NaT`` for datetime-like data. The
goal of ``pd.NA`` is provide a "missing" indicator that can be used
consistently accross data types. For now, the nullable integer and boolean
data types and the new string data type make use of ``pd.NA`` (:issue:`28095`).

.. warning::

Experimental: the behaviour of ``pd.NA`` can still change without warning.

For example, creating a Series using the nullable integer dtype:

.. ipython:: python
s = pd.Series([1, 2, None], dtype="Int64")
s
s[2]
Compared to ``np.nan``, ``pd.NA`` behaves differently in certain operations.
In addition to arithmetic operations, ``pd.NA`` also propagates as "missing"
or "unknown" in comparison operations:

.. ipython:: python
np.nan > 1
pd.NA > 1
For logical operations, ``pd.NA`` follows the rules of the
`three-valued logic <https://en.wikipedia.org/wiki/Three-valued_logic>`__ (or
*Kleene logic*). For example:

.. ipython:: python
pd.NA | True
For more, see :ref:`NA section <missing_data.NA>` in the user guide on missing
data.

.. _whatsnew_100.boolean:

Boolean data type with missing values support
Expand Down
1 change: 1 addition & 0 deletions pandas/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@
StringDtype,
BooleanDtype,
# missing
NA,
isna,
isnull,
notna,
Expand Down
5 changes: 3 additions & 2 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ from pandas._libs.tslibs.timedeltas cimport convert_to_timedelta64
from pandas._libs.tslibs.timezones cimport get_timezone, tz_compare

from pandas._libs.missing cimport (
checknull, isnaobj, is_null_datetime64, is_null_timedelta64, is_null_period
checknull, isnaobj, is_null_datetime64, is_null_timedelta64, is_null_period, C_NA
)


Expand Down Expand Up @@ -160,6 +160,7 @@ def is_scalar(val: object) -> bool:
or PyTime_Check(val)
# We differ from numpy, which claims that None is not scalar;
# see np.isscalar
or val is C_NA
or val is None
or isinstance(val, (Fraction, Number))
or util.is_period_object(val)
Expand Down Expand Up @@ -1494,7 +1495,7 @@ cdef class Validator:
f'must define is_value_typed')

cdef bint is_valid_null(self, object value) except -1:
return value is None or util.is_nan(value)
return value is None or value is C_NA or util.is_nan(value)

cdef bint is_array_typed(self) except -1:
return False
Expand Down
5 changes: 5 additions & 0 deletions pandas/_libs/missing.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,8 @@ cpdef ndarray[uint8_t] isnaobj(ndarray arr)
cdef bint is_null_datetime64(v)
cdef bint is_null_timedelta64(v)
cdef bint is_null_period(v)

cdef class C_NAType:
pass

cdef C_NAType C_NA
Loading

0 comments on commit a2cec79

Please sign in to comment.