Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame constructor raises error if specify tz dtype dtype='datetime64[ns, UTC]' #12513

Closed
BranYang opened this issue Mar 2, 2016 · 13 comments · Fixed by #30507
Closed
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Milestone

Comments

@BranYang
Copy link
Contributor

BranYang commented Mar 2, 2016

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
array_dim2 = np.arange(10).reshape((5, 2))
df = pd.DataFrame(array_dim2 , dtype='datetime64[ns, UTC]') # doesn't work

The error:

TypeError                                 Traceback (most recent call last)
<ipython-input-4-7101cf798aa3> in <module>()
----> 1 df = pd.DataFrame(array_dim2 , dtype='datetime64[ns, UTC]')

C:\D\Projects\Github\pandas\pandas\core\frame.py in __init__(self, data, index,
columns, dtype, copy)
    252             else:
    253                 mgr = self._init_ndarray(data, index, columns, dtype=dty
pe,
--> 254                                          copy=copy)
    255         elif isinstance(data, (list, types.GeneratorType)):
    256             if isinstance(data, types.GeneratorType):

C:\D\Projects\Github\pandas\pandas\core\frame.py in _init_ndarray(self, values,
index, columns, dtype, copy)
    412
    413         if dtype is not None:
--> 414             if values.dtype != dtype:
    415                 try:
    416                     values = values.astype(dtype)

TypeError: data type not understood

Expected Output

In [5]: df = pd.DataFrame(array_dim2 , dtype='datetime64[ns, UTC]')

In [6]: df
Out[6]:
                              0                                           1
0 1970-01-01 00:00:00.000000000+00:00 1970-01-01 00:00:00.000000001+00:00
1 1970-01-01 00:00:00.000000002+00:00 1970-01-01 00:00:00.000000003+00:00
2 1970-01-01 00:00:00.000000004+00:00 1970-01-01 00:00:00.000000005+00:00
3 1970-01-01 00:00:00.000000006+00:00 1970-01-01 00:00:00.000000007+00:00
4 1970-01-01 00:00:00.000000008+00:00 1970-01-01 00:00:00.000000009+00:00

output of pd.show_versions()

python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.0rc1+66.gce3ac93
nose: 1.3.7
pip: 8.0.2
setuptools: 19.2
Cython: 0.23.4
numpy: 1.10.1
scipy: None
statsmodels: None
xarray: None
IPython: 4.0.2
sphinx: 1.3.1
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: 2.3.3
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: None
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
@BranYang
Copy link
Contributor Author

BranYang commented Mar 2, 2016

Trying to look into this.

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions Timezones Timezone data dtype labels Mar 3, 2016
@jreback jreback added this to the 0.18.1 milestone Mar 3, 2016
@jreback jreback modified the milestones: 0.18.1, 0.18.2 Apr 25, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.19.0 Aug 21, 2016
@John-Boik
Copy link

Is there any workaround offered for the `dtype='datetime64[ns, UTC]' problem? Any suggestions?

@jreback
Copy link
Contributor

jreback commented Oct 5, 2017

what are you trying to do?

@jreback
Copy link
Contributor

jreback commented Oct 5, 2017

this is a reasonable way to deal with this.

In [15]: array_dim2 = np.arange(10).reshape((5, 2))
    ...: df = pd.DataFrame(array_dim2)
    ...: 
    ...: 

In [16]: df
Out[16]: 
   0  1
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

In [17]: df.apply(lambda x: pd.to_datetime(x, unit='D').dt.tz_localize('UTC'))
Out[17]: 
                          0                         1
0 1970-01-01 00:00:00+00:00 1970-01-02 00:00:00+00:00
1 1970-01-03 00:00:00+00:00 1970-01-04 00:00:00+00:00
2 1970-01-05 00:00:00+00:00 1970-01-06 00:00:00+00:00
3 1970-01-07 00:00:00+00:00 1970-01-08 00:00:00+00:00
4 1970-01-09 00:00:00+00:00 1970-01-10 00:00:00+00:00

In [18]: df.apply(lambda x: pd.to_datetime(x, unit='D').dt.tz_localize('UTC')).dtypes
Out[18]: 
0    datetime64[ns, UTC]
1    datetime64[ns, UTC]
dtype: object

@John-Boik
Copy link

John-Boik commented Oct 5, 2017

Thanks. I see the error when using Ibis framework, when I query on a table that has null values in a timestamp with timezone field. I did use something like that as a fix, but it was very slow on queries for large tables.

@jreback
Copy link
Contributor

jreback commented Oct 5, 2017

@John-Boik that doesn't make sense, you are iterating over the columns. unless you have millions of columns (which would be completely non-performant anyhow)

@John-Boik
Copy link

The error occurs within Ibis, which calls pandas, which raises an error in ~lib/python3.5/site-packages/pandas/core/internals.py, near line 573: dtype = np.dtype(dtype).
The error is something like "dtype not understood". If I change the database field to timezone without timestamp, then the error is not raised. Nor is it raised if the values are non-null. I am now using an older version of Ibis, where that error is not raised.

@John-Boik
Copy link

My crude fix was:

            if dtype ==  'datetime64[ns, UTC]':
                dtype = np.arange(2).reshape((1, 2)).astype('datetime64[ns]').dtype
            else:
                assert False

But as I said, it was too slow to work for big tables.

@TomAugspurger TomAugspurger changed the title BUG: Construct DataFrame raise error if specify dtype='datetime64[ns, UTC]' BUG: DataFrame constructor raises error if specify tz dtype dtype='datetime64[ns, UTC]' Apr 27, 2018
@zhuoqiang
Copy link

zhuoqiang commented Jan 2, 2019

Pandas also failed to view() with tz dtype:

import pandas as pd

df = pd.DataFrame({'a': pd.date_range('2018-01-01', '2018-01-03', tz='Asia/Shanghai')})
da = df['a'].view('int64')
da.view(df['a'].dtype)

will generate TypeError: data type not understood

Traceback (most recent call last)
<ipython-input-62-58aa88ef59a7> in <module>
      3 df = pd.DataFrame({'a': pd.date_range('2018-01-01', '2018-01-03', tz='Asia/Shanghai')})
      4 da = df['a'].view('int64')
----> 5 da.view(df['a'].dtype)

~/python3.7/site-packages/pandas/core/series.py in view(self, dtype)
    632         dtype: int8
    633         """
--> 634         return self._constructor(self._values.view(dtype),
    635                                  index=self.index).__finalize__(self)
    636 

TypeError: data type not understood

I have to use the following view_as() to make it work:

def view_as(s, dtype):
    try:
        return s.view(dtype)
    except TypeError as e:
        if isinstance(dtype, str):
            dtype = pd.core.dtypes.dtypes.DatetimeTZDtype.construct_from_string(dtype)
        if isinstance(dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
            s = s.view(f'datetime64[{dtype.unit}]')
            if dtype.tz:
                s = s.dt.tz_localize('utc').dt.tz_convert(dtype.tz)
            return s
        raise e

Actually, the full version of view_as() could also view the categorical data:

def view_as(s, dtype):
    try:
        if isinstance(s.dtype, pd.core.dtypes.dtypes.CategoricalDtype):
            s = s.cat.codes.values
        if isinstance(dtype, pd.core.dtypes.dtypes.CategoricalDtype):
            return pd.Categorical.from_codes(s, dtype.categories)
        else:
            return s.view(dtype)
    except TypeError as e:
        if isinstance(dtype, str):
            dtype = pd.core.dtypes.dtypes.DatetimeTZDtype.construct_from_string(dtype)
        if isinstance(dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
            s = s.view(f'datetime64[{dtype.unit}]')
            if dtype.tz:
                s = s.dt.tz_localize('utc').dt.tz_convert(dtype.tz)
            return s
        raise e

@tswast
Copy link
Contributor

tswast commented Feb 22, 2019

While working on googleapis/python-bigquery-pandas#247, I'm able to construct a DataFrame (and series) with dtype="datetime64[ns, UTC]" in the latest packages on pip, but it fails for the pre-wheels with the following:

_ TestReadGBQIntegration.test_should_properly_handle_timestamp_unix_epoch[env] _

self = <tests.system.test_gbq.TestReadGBQIntegration object at 0x7f3225515588>
project_id = 'pandas-gbq-tests'

    def test_should_properly_handle_timestamp_unix_epoch(self, project_id):
        query = 'SELECT TIMESTAMP("1970-01-01 00:00:00") AS unix_epoch'
        df = gbq.read_gbq(
            query,
            project_id=project_id,
            credentials=self.credentials,
>           dialect="legacy",
        )

tests/system/test_gbq.py:310: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas_gbq/gbq.py:842: in read_gbq
    final_df = connector.run_query(query, configuration=configuration)
pandas_gbq/gbq.py:486: in run_query
    df = rows_iter.to_dataframe(dtypes=nullsafe_dtypes)
/opt/conda/envs/test-environment/lib/python3.6/site-packages/google/cloud/bigquery/table.py:1429: in to_dataframe
    return self._to_dataframe_tabledata_list(dtypes)
/opt/conda/envs/test-environment/lib/python3.6/site-packages/google/cloud/bigquery/table.py:1333: in _to_dataframe_tabledata_list
    frames.append(self._to_dataframe_dtypes(page, column_names, dtypes))
/opt/conda/envs/test-environment/lib/python3.6/site-packages/google/cloud/bigquery/table.py:1325: in _to_dataframe_dtypes
    columns[column] = pandas.Series(columns[column], dtype=dtypes[column])
/opt/conda/envs/test-environment/lib/python3.6/site-packages/pandas/core/series.py:248: in __init__
    raise_cast_failure=True)
/opt/conda/envs/test-environment/lib/python3.6/site-packages/pandas/core/series.py:2967: in _sanitize_array
    subarr = _try_cast(data, False)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

arr = [datetime.datetime(1970, 1, 1, 0, 0, tzinfo=<UTC>)]
take_fast_path = False

    def _try_cast(arr, take_fast_path):
    
        # perf shortcut as this is the most common case
        if take_fast_path:
            if maybe_castable(arr) and not copy and dtype is None:
                return arr
    
        try:
            subarr = maybe_cast_to_datetime(arr, dtype)
            if not is_extension_type(subarr):
>               subarr = np.array(subarr, dtype=dtype, copy=copy)
E               TypeError: data type not understood

I'm not sure what the pip packages are doing differently than the latest pre-wheel? In the meantime, I'll use timezone naive datetimes in pandas-gbq.

@tswast
Copy link
Contributor

tswast commented Mar 23, 2019

Update: Passing a timezone as part of the dtype string was officially deprecated in #23990 Construct a DatetimeTZDtype instead.

I believe this issue can be closed.

@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 23, 2019
jbrockmendel added a commit to jbrockmendel/pandas that referenced this issue Dec 27, 2019
@jreback jreback removed this from the Contributions Welcome milestone Jan 1, 2020
@JoshZastrow
Copy link

JoshZastrow commented May 17, 2021

Is this issue closed? Getting a new error related to how numpy handles datetime[ns, UTC] types:

>>> schema_file = '{"date": "datetime64[ns, UTC]"}'
>>> import json
>>> schema = json.loads(schema_file)

Working Example:

>>> working_df = pd.DataFrame(columns=schema).astype(schema)
>>> working_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype              
---  ------  --------------  -----              
 0   date    0 non-null      datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1)
memory usage: 0.0+ bytes

Not Working Example # 1

>>> no_work_df = pd.DataFrame(columns=schema, dtype=schema)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/joshua.zastrow/.pyenv/versions/3.7.9/envs/dynamic_pricing/lib/python3.7/site-packages/pandas/core/frame.py", line 513, in __init__
    dtype = self._validate_dtype(dtype)
  File "/Users/joshua.zastrow/.pyenv/versions/3.7.9/envs/dynamic_pricing/lib/python3.7/site-packages/pandas/core/generic.py", line 345, in _validate_dtype
    dtype = pandas_dtype(dtype)
  File "/Users/joshua.zastrow/.pyenv/versions/3.7.9/envs/dynamic_pricing/lib/python3.7/site-packages/pandas/core/dtypes/common.py", line 1799, in pandas_dtype
    npdtype = np.dtype(dtype)
  File "/Users/joshua.zastrow/.pyenv/versions/3.7.9/envs/dynamic_pricing/lib/python3.7/site-packages/numpy/core/_internal.py", line 61, in _usefields
    names, formats, offsets, titles = _makenames_list(adict, align)
  File "/Users/joshua.zastrow/.pyenv/versions/3.7.9/envs/dynamic_pricing/lib/python3.7/site-packages/numpy/core/_internal.py", line 31, in _makenames_list
    raise ValueError("entry not a 2- or 3- tuple")
ValueError: entry not a 2- or 3- tuple

Not Working Example # 2

>>> no_work_df_2 = pd.DataFrame(columns=schema, dtype=working_df.dtypes.to_dict())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/joshua.zastrow/.pyenv/versions/3.7.9/envs/dynamic_pricing/lib/python3.7/site-packages/pandas/core/frame.py", line 513, in __init__
    dtype = self._validate_dtype(dtype)
  File "/Users/joshua.zastrow/.pyenv/versions/3.7.9/envs/dynamic_pricing/lib/python3.7/site-packages/pandas/core/generic.py", line 345, in _validate_dtype
    dtype = pandas_dtype(dtype)
  File "/Users/joshua.zastrow/.pyenv/versions/3.7.9/envs/dynamic_pricing/lib/python3.7/site-packages/pandas/core/dtypes/common.py", line 1799, in pandas_dtype
    npdtype = np.dtype(dtype)
  File "/Users/joshua.zastrow/.pyenv/versions/3.7.9/envs/dynamic_pricing/lib/python3.7/site-packages/numpy/core/_internal.py", line 61, in _usefields
    names, formats, offsets, titles = _makenames_list(adict, align)
  File "/Users/joshua.zastrow/.pyenv/versions/3.7.9/envs/dynamic_pricing/lib/python3.7/site-packages/numpy/core/_internal.py", line 29, in _makenames_list
    n = len(obj)
TypeError: object of type 'DatetimeTZDtype' has no len()
NSTALLED VERSIONS
------------------
commit           : f2c8480af2f25efdbd803218b9d87980f416563e
python           : 3.7.9.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.6.0
Version          : Darwin Kernel Version 19.6.0: Mon Apr 12 20:57:45 PDT 2021; root:xnu-6153.141.28.1~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.2.3
numpy            : 1.20.1
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 21.1.1
setuptools       : 47.1.0
Cython           : None
pytest           : 6.2.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.8.6 (dt dec pq3 ext lo64)
jinja2           : 2.11.3
IPython          : 7.21.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : 0.9.0
fastparquet      : None
gcsfs            : 0.8.0
matplotlib       : 3.3.3
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : 0.14.1
pyarrow          : 3.0.0
pyxlsb           : None
s3fs             : None
scipy            : 1.6.0
sqlalchemy       : 1.3.20
tables           : None
tabulate         : 0.8.9
xarray           : None
xlrd             : None
xlwt             : None
numba            : 0.52.0

@jreback
Copy link
Contributor

jreback commented May 17, 2021

you would have to try on master and if not u can open an issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants