-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add NA scalar for missing value indicator, use in StringArray. #29597
Changes from 24 commits
03f83bd
c1797d5
3339eaa
e9d4d6a
4450d2d
1849a23
c72e3ee
3a97782
2302661
2ab592a
018399e
31290b9
33fd3e0
289c885
22de7cd
f8208db
371eeeb
1cadeda
1fcf4b7
f6798e5
14c1434
788a2c2
1bcbab2
775cdfb
589a961
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -12,10 +12,10 @@ pandas. | |||||
.. note:: | ||||||
|
||||||
The choice of using ``NaN`` internally to denote missing data was largely | ||||||
for simplicity and performance reasons. It differs from the MaskedArray | ||||||
approach of, for example, :mod:`scikits.timeseries`. We are hopeful that | ||||||
NumPy will soon be able to provide a native NA type solution (similar to R) | ||||||
performant enough to be used in pandas. | ||||||
for simplicity and performance reasons. | ||||||
Starting from pandas 1.0, some optional data types start experimenting | ||||||
with a native ``NA`` scalar using a mask-based approach. See | ||||||
:ref:`here <missing_data.NA>` for more. | ||||||
|
||||||
See the :ref:`cookbook<cookbook.missing_data>` for some advanced strategies. | ||||||
|
||||||
|
@@ -110,7 +110,7 @@ pandas objects provide compatibility between ``NaT`` and ``NaN``. | |||||
.. _missing.inserting: | ||||||
|
||||||
Inserting missing data | ||||||
---------------------- | ||||||
~~~~~~~~~~~~~~~~~~~~~~ | ||||||
|
||||||
You can insert missing values by simply assigning to containers. The | ||||||
actual missing value used will be chosen based on the dtype. | ||||||
|
@@ -135,9 +135,10 @@ For object containers, pandas will use the value given: | |||||
s.loc[1] = np.nan | ||||||
s | ||||||
|
||||||
.. _missing_data.calculations: | ||||||
|
||||||
Calculations with missing data | ||||||
------------------------------ | ||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||||
|
||||||
Missing values propagate naturally through arithmetic operations between pandas | ||||||
objects. | ||||||
|
@@ -771,3 +772,132 @@ the ``dtype="Int64"``. | |||||
s | ||||||
|
||||||
See :ref:`integer_na` for more. | ||||||
|
||||||
|
||||||
.. _missing_data.NA: | ||||||
|
||||||
Experimental ``NA`` scalar to denote missing values | ||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||||
|
||||||
.. warning:: | ||||||
|
||||||
Experimental: the behaviour of ``pd.NA`` can still change without warning. | ||||||
|
||||||
.. versionadded:: 1.0.0 | ||||||
|
||||||
Starting from pandas 1.0, an experimental ``pd.NA`` value (singleton) is | ||||||
available to represent scalar missing values. At this moment, it is used in | ||||||
the nullable integer and boolean data types and the dedicated string data type | ||||||
as the missing value indicator. | ||||||
|
||||||
The goal of ``pd.NA`` is provide a "missing" indicator that can be used | ||||||
consistently accross data types (instead of ``np.nan``, ``None`` or ``pd.NaT`` | ||||||
depending on the data type). | ||||||
|
||||||
For example, when having missing values in a Series with the nullable integer | ||||||
dtype, it will use ``pd.NA``: | ||||||
|
||||||
.. ipython:: python | ||||||
|
||||||
s = pd.Series([1, 2, None], dtype="Int64") | ||||||
s | ||||||
s[2] | ||||||
s[2] is pd.NA | ||||||
|
||||||
Propagation in arithmetic and comparison operations | ||||||
--------------------------------------------------- | ||||||
|
||||||
In general, missing values *propagate* in operations involving ``pd.NA``. When | ||||||
one of the operands is unknown, the outcome of the operation is also unknown. | ||||||
|
||||||
For example, ``pd.NA`` propagates in arithmetic operations, similarly to | ||||||
``np.nan``: | ||||||
|
||||||
.. ipython:: python | ||||||
|
||||||
pd.NA + 1 | ||||||
"a" * pd.NA | ||||||
|
||||||
In equality and comparison operations, ``pd.NA`` also propagates. This deviates | ||||||
from the behaviour of ``np.nan``, where comparisons with ``np.nan`` always | ||||||
return ``False``. | ||||||
|
||||||
.. ipython:: python | ||||||
|
||||||
pd.NA == 1 | ||||||
pd.NA == pd.NA | ||||||
pd.NA < 2.5 | ||||||
|
||||||
To check if a value is equal to ``pd.NA``, the :func:`isna` function can be | ||||||
used: | ||||||
|
||||||
.. ipython:: python | ||||||
|
||||||
pd.isna(pd.NA) | ||||||
|
||||||
An exception on this basic propagation rule are *reductions* (such as the | ||||||
mean or the minimum), where pandas defaults to skipping missing values. See | ||||||
:ref:`above <missing_data.calculations>` for more. | ||||||
|
||||||
Logical operations | ||||||
------------------ | ||||||
|
||||||
For logical operations, ``pd.NA`` follows the rules of the | ||||||
`three-valued logic <https://en.wikipedia.org/wiki/Three-valued_logic>`__ (or | ||||||
*Kleene logic*, similarly to R, SQL and Julia). This logic means to only | ||||||
propagate missing values when it is logically required. | ||||||
|
||||||
For example, for the logical "or" operation (``|``), if one of the operands | ||||||
is ``True``, we already know the result will be ``True``, regardless of the | ||||||
other value (so regardless the missing value would be ``True`` or ``False``). | ||||||
In this case, ``pd.NA`` does not propagate: | ||||||
|
||||||
.. ipython:: python | ||||||
|
||||||
True | False | ||||||
True | pd.NA | ||||||
pd.NA | True | ||||||
|
||||||
On the other hand, if one of the operands is ``False``, the result depends | ||||||
on the value of the other operand. Therefore, in thise case ``pd.NA`` | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thise -> this |
||||||
propagates: | ||||||
|
||||||
.. ipython:: python | ||||||
|
||||||
False | True | ||||||
False | False | ||||||
False | pd.NA | ||||||
|
||||||
The behaviour of the logical "and" operation (``&``) can be derived using | ||||||
similar logic (where now ``pd.NA`` will not propagate if one of the operands | ||||||
is already ``False``): | ||||||
|
||||||
.. ipython:: python | ||||||
|
||||||
False & True | ||||||
False & False | ||||||
False & pd.NA | ||||||
|
||||||
.. ipython:: python | ||||||
|
||||||
True & True | ||||||
True & False | ||||||
True & pd.NA | ||||||
|
||||||
|
||||||
``NA`` in a boolean context | ||||||
--------------------------- | ||||||
|
||||||
Since the actual value of an NA is unknown, it is ambiguous to convert NA | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we have a section gotchas.truth that is very similiar, could link. |
||||||
to a boolean value. The following raises an error: | ||||||
|
||||||
.. ipython:: python | ||||||
:okexcept: | ||||||
|
||||||
bool(pd.NA) | ||||||
|
||||||
This also means that ``pd.NA`` cannot be used in a context where it is | ||||||
evaluated to a boolean, such as ``if condition: ...`` where ``condition`` can | ||||||
potentially be ``pd.NA``. In such cases, :func:`isna` can be used to check | ||||||
for ``pd.NA`` or it could be prevented that ``condition`` can be ``pd.NA`` | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
(can't apply this directly, since it wraps two lines) |
||||||
by for example filling missing values beforehand. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -70,6 +70,7 @@ | |
StringDtype, | ||
BooleanDtype, | ||
# missing | ||
NA, | ||
isna, | ||
isnull, | ||
notna, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use commas rather than 3 ands, ideally if you can link to those sections?