Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: PeriodIndex inconsistent deserialization with HDF5 - PyTables #41978

Open
3 tasks done
ra1nty opened this issue Jun 13, 2021 · 5 comments
Open
3 tasks done

BUG: PeriodIndex inconsistent deserialization with HDF5 - PyTables #41978

ra1nty opened this issue Jun 13, 2021 · 5 comments
Labels
Bug IO HDF5 read_hdf, HDFStore Period Period data type

Comments

@ra1nty
Copy link

ra1nty commented Jun 13, 2021

  • I have checked that this issue has not already been reported.
    There was a issue 5 years ago mentioned that .to_hdf() acts inconsistently across Python2 & 3 on PeriodIndex for fixed format
    DataFrame with PeriodIndex written in Python2 gets an Int64Index when read back in Python3 #16781

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.
    The bug exists, but behavior is different - see next comment


I noticed that the deserialization of a pandas Series/DataFrame with PeriodIndex from HDF5 file is inconsistent when using PyTables format: The retrieved series/df index will be converted to Int64Index instead of PeriodIndex: See code below for example

import pandas as pd
store = pd.HDFStore('test.h5')
series = pd.Series(index=pd.date_range(start='2015-01', end='2016-01', freq='M'), data=0).to_period('M')
df = pd.DataFrame(index=pd.date_range(start='2015-01', end='2016-01', freq='M'), data=0, columns=['a']).to_period('M')
store.put('/a/a', series, format='table')
store.put('/a/b', df, format='table')
store.select('/a/a')

Output:

540    0
541    0
542    0
543    0
544    0
545    0
546    0
547    0
548    0
549    0
550    0
551    0
dtype: int64
store.select('/a/b').index

Output:

Int64Index([540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551], dtype='int64')

Problem description

Inconsistent output with HDF5 file & PyTables format

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 2cb9652
python : 3.9.1.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252

pandas : 1.2.4
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 20.3.1
setuptools : 51.0.0.post20201207
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.4.3
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.24.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 4.0.1
pyxlsb : None
s3fs : None
scipy : 1.6.3
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None

@ra1nty ra1nty added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 13, 2021
@ra1nty
Copy link
Author

ra1nty commented Jun 13, 2021

So I have figured out the issue:
The _get_data_and_dtype_name in
https://github.com/pandas-dev/pandas/blob/v1.2.4/pandas/io/pytables.py#L5070
used Index.asi8 to store the int64 values of the PeriodIndex,

but the case was unhandled in DataCol.convert and IndexCol.convert
https://github.com/pandas-dev/pandas/blob/v1.2.4/pandas/io/pytables.py#L2400
https://github.com/pandas-dev/pandas/blob/v1.2.4/pandas/io/pytables.py#L3644

For master branch, the issue still exist but instead raise TypeErrow due to not using the correct index factory in DataCol.convert and IndexCol.convert
https://github.com/pandas-dev/pandas/blob/master/pandas/io/pytables.py#L2077

The fixed-format in both master and v1.2.4 has no problem with PeriodIndex and handled the conversion.

@ra1nty
Copy link
Author

ra1nty commented Jun 13, 2021

E.g. A simple but not clean fix will be to add a corner case in IndexCol.convert when constructing the index factory
https://github.com/pandas-dev/pandas/blob/master/pandas/io/pytables.py#L2077

factory = Index
if is_datetime64_dtype(values.dtype) or is_datetime64tz_dtype(values.dtype):
    factory = DatetimeIndex
elif "freq" in kwargs:
    # workaround for PeriodIndex
    def f(values, freq=None, **kwargs):
        parr = PeriodArray._simple_new(values, freq=freq)
        return PeriodIndex._simple_new(parr, **kwargs)
    factory = f

From my understanding, the TimedeltaIndex and DatetimeIndex will be covered by the first if case as the correct dtype is implemented. If the 'freq' still in kwargs then it's for PeriodIndex. The workaround works on my local machine for now but I haven't got a chance to look into the pandas codebase in depth.

@ra1nty
Copy link
Author

ra1nty commented Jun 13, 2021

I also noticed that both fixed and table format can not handle the store of values where the underlying array is PeriodArray: while fixed format raised a readable TypeError, the table format result in a TypeError without clear information. I do think this should be fixed as well.
Code to reproduce:

series_p = pd.Series(data=pd.date_range(start='2015-01', end='2016-01', freq='M').to_period('M'))
store.put('/a/c', series_p, format='fixed')
store.put('/a/d', series_p, format='table')

Output (master & v1.2.4):
Fixed

TypeError: objects of type ``PeriodArray`` are not supported in this context, sorry; supported objects are: NumPy array, record or scalar; homogeneous list or tuple, integer, float, complex or bytes

PyTables

TypeError: int() argument must be a string, a bytes-like object or a number, not 'Period'

@ra1nty ra1nty changed the title BUG: DataFrame/Series with PeriodIndex inconsistent deserialization with HDF5 - PyTables BUG: PeriodIndex inconsistent deserialization with HDF5 - PyTables Jun 13, 2021
@mroeschke mroeschke added IO HDF5 read_hdf, HDFStore Period Period data type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021
@ra1nty
Copy link
Author

ra1nty commented Jan 20, 2022

@mroeschke Is it ok if I start working on that since it's confirmed? I was able to patch my local pandas last year but haven't got time to re-attend to this since then.

@mroeschke
Copy link
Member

Sure go for it @ra1nty

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore Period Period data type
Projects
None yet
Development

No branches or pull requests

2 participants