Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: parsing multi-column headers in read_csv (GH6051) #6170

Closed
wants to merge 4 commits into from

Conversation

waitingkuo
Copy link
Contributor

closes #6051

Bug: mangle_dupe_cols cannot work while the header is a list.

Originally, has_mi_columns will be set as 1 while the header is a list. And the mangle_dupe_cols things would not work when has_mi_columns == 1.

This pull request is to

  1. for the list with single element, for example [0], the has_mi_columns will be set as 0, and the header will do the original mangle_dupe_cols things as the header is a single integer.
  2. for the list with more than one element, append the sequence number in the last element of the duplicated multi-column

For example

Male,Male,Male,Male,Male,Female,Female
A,B,B,B,C,D,D
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7

would be converted to

In [2]: pd.read_csv('woo.csv', header=[0,1])
Out[2]: 
   Male                  Female     
      A  B  B.1  B.2  C       D  D.1
0     1  2    3    4  5       6    7
1     1  2    3    4  5       6    7
2     1  2    3    4  5       6    7
3     1  2    3    4  5       6    7
4     1  2    3    4  5       6    7
5     1  2    3    4  5       6    7

[6 rows x 7 columns]

@jreback
Copy link
Contributor

jreback commented Jan 29, 2014

@waitingkuo IIUC, the last case is to handle an 'incorrect' multi-index?

@waitingkuo
Copy link
Contributor Author

@jreback Yes, multi-index cannot be appended the sequence number in the original version.

@jreback
Copy link
Contributor

jreback commented Jan 29, 2014

I think at the very least this should be a warning (and then effectively mangle the names as you have indicated / maybe even raise.

@waitingkuo
Copy link
Contributor Author

Should it also give a warning for the single column case?

@jreback
Copy link
Contributor

jreback commented Jan 29, 2014

can you post what that case is?

@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

@waitingkuo can you rebase and see if their are any more issues?

@jreback
Copy link
Contributor

jreback commented Feb 15, 2014

can you rebase this?

@jreback
Copy link
Contributor

jreback commented Mar 9, 2014

@waitingkuo what's the state of this?

@waitingkuo
Copy link
Contributor Author

@jreback,
do you mean we still need to give some warning even mangle_dupe_cols is set as True?

@jreback
Copy link
Contributor

jreback commented Mar 13, 2014

I think you need to give warning for your case 2) above

@waitingkuo
Copy link
Contributor Author

What if I use pd.read_csv('woo.csv', header=[0,1], mangle_dupe_cols=True). Should we still give warning?

@jreback
Copy link
Contributor

jreback commented Mar 13, 2014

I think you still should do a warning, because you are 'expectnig' a multi-index on the columns, but you won't get one because of the incorrect format. (mangle is the default), kind of an older option (now that can parse multi-column). It was supposed to be defaulted to False in future versions but that broke things so didn't go forward with that.

@waitingkuo
Copy link
Contributor Author

now it warns if duplicated columns have been mangled

woo.csv

A,B,B,B,C,D,D
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7

woo_mul.csv

Male,Male,Male,Male,Male,Female,Female
A,B,B,B,C,D,D
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7

In [1]: import pandas as pd

In [2]: pd.read_csv('woo.csv', header=[0])
/Users/willy/git/pandas/pandas/io/parsers.py:972: DtypeWarning: Duplicated columns have been mangled
  self._reader = _parser.TextReader(src, **kwds)
Out[2]: 
   A  B  B.1  B.2  C  D  D.1 
0  1  2    3    4  5  6    7   
1  1  2    3    4  5  6    7   
2  1  2    3    4  5  6    7   
3  1  2    3    4  5  6    7   
4  1  2    3    4  5  6    7   
5  1  2    3    4  5  6    7   

[6 rows x 7 columns]

In [3]: pd.read_csv('woo_mul.csv', header=[0,1])
Out[3]: 
   Male                  Female     
      A  B  B.1  B.2  C       D  D.1 
0     1  2    3    4  5       6    7   
1     1  2    3    4  5       6    7   
2     1  2    3    4  5       6    7   
3     1  2    3    4  5       6    7   
4     1  2    3    4  5       6    7   
5     1  2    3    4  5       6    7   

[6 rows x 7 columns]

4,5,6
7,8,9
"""
df = self.read_csv(StringIO(data), header=[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the assert_produces_warning here to assert that you are actually showing the warning (around the read_csv)

@jreback
Copy link
Contributor

jreback commented Mar 13, 2014

also need a one-liner in release notes, API change (and same line in v0.14.0.txt)

@waitingkuo
Copy link
Contributor Author

Could you give me a hand?

i use assert_produce_warning as

2607         with tm.assert_produces_warning(DtypeWarning):
2608             df = self.read_csv(StringIO(data), header=[0,1], mangle_dupe_cols=True)

But seems it doesn't catch my warning

(pandas-dev)appledembp-2:tests willy$ nosetests test_parsers.py
....................../Users/willy/git/pandas/pandas/io/parsers.py:972: DtypeWarning: Duplicated columns have been mangled
  self._reader = _parser.TextReader(src, **kwds)
............................................................................S......................................................F.................................................................................S..........S...............................................................................................................
======================================================================
FAIL: test_list_of_multiple_headers_with_duplicated_column_pairs (pandas.io.tests.test_parsers.TestCParserLowMemory)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/willy/git/pandas/pandas/io/tests/test_parsers.py", line 2608, in test_list_of_multiple_headers_with_duplicated_column_pairs
    df = self.read_csv(StringIO(data), header=[0,1], mangle_dupe_cols=True)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/Users/willy/git/pandas/pandas/util/testing.py", line 1505, in assert_produces_warning
    % expected_warning.__name__)
AssertionError: Did not see expected warning of class 'DtypeWarning'.

----------------------------------------------------------------------
Ran 358 tests in 10.321s

FAILED (SKIP=3, failures=1)

Anything went wrong? That warning is warned in parser.pyx, a cython file. Does it matter?

@jreback
Copy link
Contributor

jreback commented Mar 13, 2014

add this to the setUp method

warnings.filterwarnings(action='error', category=DtypeWarning)

this will then raise on the error and you can see where its coming from

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014
@jreback
Copy link
Contributor

jreback commented Apr 9, 2014

@waitingkuo update on this?

@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

@waitingkuo can you address this soon?

@jreback jreback modified the milestones: 0.14.1, 0.15.0 May 30, 2014
@jreback
Copy link
Contributor

jreback commented May 30, 2014

looks good, can you rebase, and add a whatsnew (0.14.1) entry (bug fixes section)

@jreback
Copy link
Contributor

jreback commented Jun 10, 2014

can you address the comments above. thanks

@jreback jreback modified the milestones: 0.15.0, 0.14.1, 0.15.1 Jun 30, 2014
@jreback
Copy link
Contributor

jreback commented Sep 4, 2014

@waitingkuo status?

@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 4, 2014
@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

closing as stale

@jreback jreback closed this Jan 18, 2015
@gfyoung
Copy link
Member

gfyoung commented Jul 20, 2017

@jreback : I'd like to maybe resurrect this given that I've had my share of fun with duplicate columns and indices (which appears to be the cause of this behavior). What do you think of these changes now?

@gfyoung gfyoung modified the milestones: No action, 0.16.0 Jul 20, 2017
@jreback
Copy link
Contributor

jreback commented Jul 22, 2017

i thought this is much better nowadays - sure go for revival

@gfyoung
Copy link
Member

gfyoung commented Jul 22, 2017

i thought this is much better nowadays

Sadly not...the test written here fails on master

@gfyoung
Copy link
Member

gfyoung commented Jul 24, 2017

Further testing indicates that this PR was not as ready to be merged as previously thought. It ran into issues when nested duplicates were encountered. Also, no support for the Python engine was provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: read multi-index column csv with index_col=False borks
3 participants