ENH: add NA scalar for missing value indicator, use in StringArray. (p…

…andas-dev#29597)
proost · Dec 19, 2019 · a2cec79 · a2cec79
1 parent 93f38af
commit a2cec79
Show file tree

Hide file tree

Showing 16 changed files with 530 additions and 40 deletions.
diff --git a/doc/source/user_guide/missing_data.rst b/doc/source/user_guide/missing_data.rst
@@ -12,10 +12,10 @@ pandas.
 .. note::
 
     The choice of using ``NaN`` internally to denote missing data was largely
-    for simplicity and performance reasons. It differs from the MaskedArray
-    approach of, for example, :mod:`scikits.timeseries`. We are hopeful that
-    NumPy will soon be able to provide a native NA type solution (similar to R)
-    performant enough to be used in pandas.
+    for simplicity and performance reasons.
+    Starting from pandas 1.0, some optional data types start experimenting
+    with a native ``NA`` scalar using a mask-based approach. See
+    :ref:`here <missing_data.NA>` for more.
 
 See the :ref:`cookbook<cookbook.missing_data>` for some advanced strategies.
 
@@ -110,7 +110,7 @@ pandas objects provide compatibility between ``NaT`` and ``NaN``.
 .. _missing.inserting:
 
 Inserting missing data
-----------------------
+~~~~~~~~~~~~~~~~~~~~~~
 
 You can insert missing values by simply assigning to containers. The
 actual missing value used will be chosen based on the dtype.
@@ -135,9 +135,10 @@ For object containers, pandas will use the value given:
    s.loc[1] = np.nan
    s
 
+.. _missing_data.calculations:
 
 Calculations with missing data
-------------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Missing values propagate naturally through arithmetic operations between pandas
 objects.
@@ -771,3 +772,139 @@ the ``dtype="Int64"``.
    s
 
 See :ref:`integer_na` for more.
+
+
+.. _missing_data.NA:
+
+Experimental ``NA`` scalar to denote missing values
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. warning::
+
+   Experimental: the behaviour of ``pd.NA`` can still change without warning.
+
+.. versionadded:: 1.0.0
+
+Starting from pandas 1.0, an experimental ``pd.NA`` value (singleton) is
+available to represent scalar missing values. At this moment, it is used in
+the nullable :doc:`integer <integer_na>`, boolean and
+:ref:`dedicated string <text.types>` data types as the missing value indicator.
+
+The goal of ``pd.NA`` is provide a "missing" indicator that can be used
+consistently accross data types (instead of ``np.nan``, ``None`` or ``pd.NaT``
+depending on the data type).
+
+For example, when having missing values in a Series with the nullable integer
+dtype, it will use ``pd.NA``:
+
+.. ipython:: python
+
+    s = pd.Series([1, 2, None], dtype="Int64")
+    s
+    s[2]
+    s[2] is pd.NA
+
+Currently, pandas does not yet use those data types by default (when creating
+a DataFrame or Series, or when reading in data), so you need to specify
+the dtype explicitly.
+
+Propagation in arithmetic and comparison operations
+---------------------------------------------------
+
+In general, missing values *propagate* in operations involving ``pd.NA``. When
+one of the operands is unknown, the outcome of the operation is also unknown.
+
+For example, ``pd.NA`` propagates in arithmetic operations, similarly to
+``np.nan``:
+
+.. ipython:: python
+
+   pd.NA + 1
+   "a" * pd.NA
+
+In equality and comparison operations, ``pd.NA`` also propagates. This deviates
+from the behaviour of ``np.nan``, where comparisons with ``np.nan`` always
+return ``False``.
+
+.. ipython:: python
+
+   pd.NA == 1
+   pd.NA == pd.NA
+   pd.NA < 2.5
+
+To check if a value is equal to ``pd.NA``, the :func:`isna` function can be
+used:
+
+.. ipython:: python
+
+   pd.isna(pd.NA)
+
+An exception on this basic propagation rule are *reductions* (such as the
+mean or the minimum), where pandas defaults to skipping missing values. See
+:ref:`above <missing_data.calculations>` for more.
+
+Logical operations
+------------------
+
+For logical operations, ``pd.NA`` follows the rules of the
+`three-valued logic <https://en.wikipedia.org/wiki/Three-valued_logic>`__ (or
+*Kleene logic*, similarly to R, SQL and Julia). This logic means to only
+propagate missing values when it is logically required.
+
+For example, for the logical "or" operation (``|``), if one of the operands
+is ``True``, we already know the result will be ``True``, regardless of the
+other value (so regardless the missing value would be ``True`` or ``False``).
+In this case, ``pd.NA`` does not propagate:
+
+.. ipython:: python
+
+   True | False
+   True | pd.NA
+   pd.NA | True
+
+On the other hand, if one of the operands is ``False``, the result depends
+on the value of the other operand. Therefore, in this case ``pd.NA``
+propagates:
+
+.. ipython:: python
+
+   False | True
+   False | False
+   False | pd.NA
+
+The behaviour of the logical "and" operation (``&``) can be derived using
+similar logic (where now ``pd.NA`` will not propagate if one of the operands
+is already ``False``):
+
+.. ipython:: python
+
+   False & True
+   False & False
+   False & pd.NA
+
+.. ipython:: python
+
+   True & True
+   True & False
+   True & pd.NA
+
+
+``NA`` in a boolean context
+---------------------------
+
+Since the actual value of an NA is unknown, it is ambiguous to convert NA
+to a boolean value. The following raises an error:
+
+.. ipython:: python
+   :okexcept:
+
+   bool(pd.NA)
+
+This also means that ``pd.NA`` cannot be used in a context where it is
+evaluated to a boolean, such as ``if condition: ...`` where ``condition`` can
+potentially be ``pd.NA``. In such cases, :func:`isna` can be used to check
+for ``pd.NA`` or ``condition`` being ``pd.NA`` can be avoided, for example by
+filling missing values beforehand.
+
+A similar situation occurs when using Series or DataFrame objects in ``if``
+statements, see :ref:`gotchas.truth`.
diff --git a/doc/source/whatsnew/v1.0.0.rst b/doc/source/whatsnew/v1.0.0.rst
@@ -102,6 +102,50 @@ String accessor methods returning integers will return a value with :class:`Int6
 We recommend explicitly using the ``string`` data type when working with strings.
 See :ref:`text.types` for more.
 
+.. _whatsnew_100.NA:
+
+Experimental ``NA`` scalar to denote missing values
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A new ``pd.NA`` value (singleton) is introduced to represent scalar missing
+values. Up to now, ``np.nan`` is used for this for float data, ``np.nan`` or
+``None`` for object-dtype data and ``pd.NaT`` for datetime-like data. The
+goal of ``pd.NA`` is provide a "missing" indicator that can be used
+consistently accross data types. For now, the nullable integer and boolean
+data types and the new string data type make use of ``pd.NA`` (:issue:`28095`).
+
+.. warning::
+
+   Experimental: the behaviour of ``pd.NA`` can still change without warning.
+
+For example, creating a Series using the nullable integer dtype:
+
+.. ipython:: python
+
+    s = pd.Series([1, 2, None], dtype="Int64")
+    s
+    s[2]
+
+Compared to ``np.nan``, ``pd.NA`` behaves differently in certain operations.
+In addition to arithmetic operations, ``pd.NA`` also propagates as "missing"
+or "unknown" in comparison operations:
+
+.. ipython:: python
+
+    np.nan > 1
+    pd.NA > 1
+
+For logical operations, ``pd.NA`` follows the rules of the
+`three-valued logic <https://en.wikipedia.org/wiki/Three-valued_logic>`__ (or
+*Kleene logic*). For example:
+
+.. ipython:: python
+
+    pd.NA | True
+
+For more, see :ref:`NA section <missing_data.NA>` in the user guide on missing
+data.
+
 .. _whatsnew_100.boolean:
 
 Boolean data type with missing values support

diff --git a/pandas/__init__.py b/pandas/__init__.py
@@ -70,6 +70,7 @@
     StringDtype,
     BooleanDtype,
     # missing
+    NA,
     isna,
     isnull,
     notna,

diff --git a/pandas/_libs/lib.pyx b/pandas/_libs/lib.pyx
@@ -57,7 +57,7 @@ from pandas._libs.tslibs.timedeltas cimport convert_to_timedelta64
 from pandas._libs.tslibs.timezones cimport get_timezone, tz_compare
 
 from pandas._libs.missing cimport (
-    checknull, isnaobj, is_null_datetime64, is_null_timedelta64, is_null_period
+    checknull, isnaobj, is_null_datetime64, is_null_timedelta64, is_null_period, C_NA
 )
 
 
@@ -160,6 +160,7 @@ def is_scalar(val: object) -> bool:
             or PyTime_Check(val)
             # We differ from numpy, which claims that None is not scalar;
             # see np.isscalar
+            or val is C_NA
             or val is None
             or isinstance(val, (Fraction, Number))
             or util.is_period_object(val)
@@ -1494,7 +1495,7 @@ cdef class Validator:
                                   f'must define is_value_typed')
 
     cdef bint is_valid_null(self, object value) except -1:
-        return value is None or util.is_nan(value)
+        return value is None or value is C_NA or util.is_nan(value)
 
     cdef bint is_array_typed(self) except -1:
         return False

diff --git a/pandas/_libs/missing.pxd b/pandas/_libs/missing.pxd
@@ -9,3 +9,8 @@ cpdef ndarray[uint8_t] isnaobj(ndarray arr)
 cdef bint is_null_datetime64(v)
 cdef bint is_null_timedelta64(v)
 cdef bint is_null_period(v)
+
+cdef class C_NAType:
+    pass
+
+cdef C_NAType C_NA