Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST: dtype for BigQuery TIMESTAMP unexpectedly using datetime64[ns, UTC] dtype #261

Closed
tswast opened this issue Mar 22, 2019 · 6 comments
Closed
Assignees
Labels
type: process A process-related concern. May include testing, release, or the like.

Comments

@tswast
Copy link
Collaborator

tswast commented Mar 22, 2019

$ pytest 'tests/system/test_gbq.py::TestReadGBQIntegration::test_return_correct_types[env-current_timestamp()-datetime64[ns]]'
=============================== test session starts ===============================
platform darwin -- Python 3.6.4, pytest-4.2.0, py-1.8.0, pluggy-0.8.1
rootdir: /Users/swast/src/pandas/pandas-gbq, inifile:
collected 1 item                                                                  

tests/system/test_gbq.py F                                                  [100%]

==================================== FAILURES =====================================
_ TestReadGBQIntegration.test_return_correct_types[env-current_timestamp()-datetime64[ns]] _

self = <tests.system.test_gbq.TestReadGBQIntegration object at 0x10c3277b8>
project_id = 'swast-scratch', expression = 'current_timestamp()'
type_ = 'datetime64[ns]'

    @pytest.mark.parametrize(
        "expression, type_",
        [
            ("current_date()", "<M8[ns]"),
            ("current_timestamp()", "datetime64[ns]"),
            ("current_datetime()", "<M8[ns]"),
            ("TRUE", bool),
            ("FALSE", bool),
        ],
    )
    def test_return_correct_types(self, project_id, expression, type_):
        """
        All type checks can be added to this function using additional
        parameters, rather than creating additional functions.
        We can consolidate the existing functions here in time
    
        TODO: time doesn't currently parse
        ("time(12,30,00)", "<M8[ns]"),
        """
        query = "SELECT {} AS _".format(expression)
        df = gbq.read_gbq(
            query,
            project_id=project_id,
            credentials=self.credentials,
            dialect="standard",
        )
>       assert df["_"].dtype == type_
E       AssertionError: assert datetime64[ns, UTC] == 'datetime64[ns]'
E        +  where datetime64[ns, UTC] = 0   2019-03-22 22:35:32.398261+00:00\nName: _, dtype: datetime64[ns, UTC].dtype

tests/system/test_gbq.py:392: AssertionError
============================ 1 failed in 2.68 seconds =============================

It's odd that we explicitly specify the datetime64[ns] dtype, but it comes back as datetime64[ns, UTC] on the latest pandas version. I know to_dataframe from google-cloud-bigquery returns datetime objects with the UTC timezone, but I'd expect an explicit dtype of datetime64[ns] to take precedence.

@tswast tswast added the type: process A process-related concern. May include testing, release, or the like. label Mar 22, 2019
@tswast tswast self-assigned this Mar 22, 2019
@tswast
Copy link
Collaborator Author

tswast commented Mar 22, 2019

I can reproduces this in pandas (development version, 0.24.0+, but not 0.23.4) with this minimal example:

import datetime

import pandas as pd
import pytz


dates = [
    datetime.datetime(2019, 1, 1, 12, tzinfo=pytz.utc),
    datetime.datetime(2018, 4, 1, 17, 13, tzinfo=pytz.utc),
]

df = pd.DataFrame({"dates": dates})
print(df.dtypes)

df2 = pd.DataFrame({"dates": dates}, dtype="datetime64[ns]")
print(df2.dtypes)

It prints:

# df
dates    datetime64[ns, UTC]  <-- I expect this.
dtype: object

# df2
dates    datetime64[ns, UTC]  <-- I didn't expect this.
dtype: object

There do appear to be a lot of changes to datetime64 behavior in the changelog for 0.24.0 http://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html so maybe this is intended behavior? Maybe the distinction between datetime64[ns, UTC] and datetime64[ns] when you pass in an explicit dtype shouldn't actually be a meaningful difference?

@max-sixty
Copy link
Contributor

Should we put this upstream? That second example is a good one...

@tswast
Copy link
Collaborator Author

tswast commented Mar 22, 2019

Filed upstream at pandas-dev/pandas#25843

@tswast
Copy link
Collaborator Author

tswast commented Mar 23, 2019

From the test results for #262.

/root/project/tests/system/test_gbq.py:1377: DeprecationWarning: parsing timezone aware datetimes is deprecated; this will raise an error in the future

Hrm. Hopefully this doesn't also affect the DataFrame or Series constructor because then to_dataframe will likely break in google-cloud-bigquery (and thus break this library) because google-cloud-bigquery parses TIMESTAMP BigQuery columns into timezone-aware datetime objects.

@tswast
Copy link
Collaborator Author

tswast commented Mar 23, 2019

Re: warning. I think it's pandas-dev/pandas#23579 which should just affect individual values, not the DataFrame / Series constructors.

Re: ignoring UTC in dtype string. I think that's pandas-dev/pandas#23990 in which passing units / tzinfo in dtype string for datetime64 is deprecated. I think the fix is to pass a DatetimeTZDtype object instead of a string.

@tswast
Copy link
Collaborator Author

tswast commented Apr 3, 2019

Now that #269 is merged, this behavior is no longer unexpected, it's intentional. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: process A process-related concern. May include testing, release, or the like.
Projects
None yet
Development

No branches or pull requests

2 participants