Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Assignment of Timestamp Scalar uses micrsosecond precision, Series uses nano #55487

Closed
3 tasks done
Tracked by #55564
WillAyd opened this issue Oct 11, 2023 · 6 comments · Fixed by #55901
Closed
3 tasks done
Tracked by #55564

BUG: Assignment of Timestamp Scalar uses micrsosecond precision, Series uses nano #55487

WillAyd opened this issue Oct 11, 2023 · 6 comments · Fixed by #55901
Labels
Bug Non-Nano datetime64/timedelta64 with non-nanosecond resolution

Comments

@WillAyd
Copy link
Member

WillAyd commented Oct 11, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
ts = pd.Timestamp.now()
df = pd.DataFrame({"a": [1]})
df["direct_assignment"] = ts
df["series_assignment"] = pd.Series(ts)
df.dtypes


yields

```python
a                             int64
direct_assignment    datetime64[us]
series_assignment    datetime64[ns]
dtype: object

yields

Issue Description

I was surprised to see the dtype mismatch here

Expected Behavior

At least for backwards compatability we might want to still make the scalar assignment still yield nanosecond resolution

Installed Versions

INSTALLED VERSIONS

commit : c2cd90a
python : 3.10.12.final.0
python-bits : 64
OS : Linux
OS-release : 6.2.0-33-generic
Version : #33-Ubuntu SMP PREEMPT_DYNAMIC Tue Sep 5 14:49:19 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.0dev0+341.gc2cd90ac54
numpy : 1.24.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2.1
Cython : 0.29.33
pytest : 7.4.2
hypothesis : 6.87.1
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : 3.1.6
lxml.etree : 4.9.3
html5lib : 1.1
pymysql : 1.4.6
psycopg2 : 2.9.7
jinja2 : 3.1.2
IPython : 8.16.1
pandas_datareader : None
bs4 : 4.12.2
bottleneck : 1.3.7
dataframe-api-compat: None
fastparquet : 2023.8.0
fsspec : 2023.9.2
gcsfs : 2023.9.2
matplotlib : 3.7.3
numba : 0.57.1
numexpr : 2.8.7
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : 1.2.3
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2023.9.2
scipy : 1.11.3
sqlalchemy : 2.0.21
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.9.0
xlrd : 2.0.1
zstandard : 0.21.0
tzdata : 2023.3
qtpy : None
pyqt5 : None

@WillAyd WillAyd added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 11, 2023
@jbrockmendel jbrockmendel added the Non-Nano datetime64/timedelta64 with non-nanosecond resolution label Oct 24, 2023
@behrenhoff
Copy link
Contributor

behrenhoff commented Oct 25, 2023

I just ran into a similar issue:

df1 = pd.DataFrame(data={"x": [1], "d": [pd.Timestamp("2020-01-01")]})
df2 = pd.DataFrame(data={"x": [1], "d": pd.Timestamp("2020-01-01")})

Question: which resulting dtypes do df1 and df2 have?

Answer:

>>> df1.dtypes
x             int64
d    datetime64[ns]
dtype: object

>>> df2.dtypes
x            int64
d    datetime64[s]
dtype: object

...and the resulting DFs are incompatible, cannot be concatenated because of incompatible dtypes!

I think it boils down to pd.Timestamp("2020-01-01") deciding on an internal granularity automatically. The "unit" argument does nothing (it is used for interpreting the input value, not the resulting internal dtype). There seems to be no parameter to switch off the automatic. So I think Timestamp should always be "ns" unless you specify something like Timestamp(..., resolution="s") explicitly. Otherwise we get different incompatible dtypes depending on the input string (which might come from external sources). The only current solution seems to be to use Timestamp("2020-01-01").as_unit("ns"). Then my example from above works.

@ziadk
Copy link
Contributor

ziadk commented Oct 26, 2023

Hello @WillAyd,

I would love to work on this.

I have found that the issue is that the direct assignment passes through the infer_dtype_from_scalar() function in the cast.py file. Inside this function, the following cast is what gives the us precision : val = val.to_datetime64().

To recap, the direct assignment follows this call trace to the problem : Dataframe.__setitem__() -> Dataframe._set_item() ->Dataframe._sanitize_column() -> construction.sanitize_array() -> dtypes.cast.construct_1d_arraylike_from_scalar() -> dtypes.cast.infer_dtype_from_scalar(). Inside this method, these lines of code are the source of our problem:
elif isinstance(val, (np.datetime64, dt.datetime)): ... if val is NaT or val.tz is None: val = val.to_datetime64() dtype = val.dtype

I am just beginning in this kind of open source work, so please do not hesitate to give me any kind of guidance. Also, I would be very happy to work on the problem if you would have any specific guidelines.

Thank you

@jbrockmendel jbrockmendel removed the Needs Triage Issue that has not been reviewed by a pandas team member label Nov 1, 2023
@davetapley
Copy link
Contributor

@ValueRaider
Copy link

@davetapley How is pd.Timestamp.now() a Python datetime?

@davetapley
Copy link
Contributor

@ValueRaider I'm not sure I follow?

It is literally a datetime in the sense that:

class Timestamp(datetime):

i.e.:

>>> import pandas as pd
>>> from datetime import datetime

>>> isinstance(pd.Timestamp.now(), datetime)
True

R.e. my specific linking of #55014 as a possible dupe,
then the issues are linked because they both have the same symptom,
as identified in #55014 (comment):

  • for scalars, the resolution is preserved (so for stdlib datetime, it becomes 'us', because that's the resolution of the python stdlib)
  • for a list, the resolution is 'ns' by default

@ValueRaider
Copy link

ValueRaider commented Dec 30, 2023

@davetapley It does appear similar, but my concern is that thread is handling bug as a low-priority edge case: I think a conversation is needed regarding the expected behaviour in Pandas 2 when instantiating a DataFrame with columns of type dt.datetime

That this happens using pure Pandas API should raise the urgency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Non-Nano datetime64/timedelta64 with non-nanosecond resolution
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants