BUG: parsing multi-column headers in read_csv (GH6051) #6170

waitingkuo · 2014-01-29T15:47:23Z

Bug: mangle_dupe_cols cannot work while the header is a list.

Originally, has_mi_columns will be set as 1 while the header is a list. And the mangle_dupe_cols things would not work when has_mi_columns == 1.

This pull request is to

for the list with single element, for example [0], the has_mi_columns will be set as 0, and the header will do the original mangle_dupe_cols things as the header is a single integer.
for the list with more than one element, append the sequence number in the last element of the duplicated multi-column

For example

Male,Male,Male,Male,Male,Female,Female
A,B,B,B,C,D,D
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7

would be converted to

In [2]: pd.read_csv('woo.csv', header=[0,1])
Out[2]: 
   Male                  Female     
      A  B  B.1  B.2  C       D  D.1
0     1  2    3    4  5       6    7
1     1  2    3    4  5       6    7
2     1  2    3    4  5       6    7
3     1  2    3    4  5       6    7
4     1  2    3    4  5       6    7
5     1  2    3    4  5       6    7

[6 rows x 7 columns]

jreback · 2014-01-29T16:59:05Z

@waitingkuo IIUC, the last case is to handle an 'incorrect' multi-index?

waitingkuo · 2014-01-29T17:12:23Z

@jreback Yes, multi-index cannot be appended the sequence number in the original version.

jreback · 2014-01-29T17:24:00Z

I think at the very least this should be a warning (and then effectively mangle the names as you have indicated / maybe even raise.

waitingkuo · 2014-01-29T19:37:13Z

Should it also give a warning for the single column case?

jreback · 2014-01-29T19:40:21Z

can you post what that case is?

jreback · 2014-02-14T03:40:33Z

@waitingkuo can you rebase and see if their are any more issues?

jreback · 2014-02-15T21:06:52Z

can you rebase this?

jreback · 2014-03-09T15:09:26Z

@waitingkuo what's the state of this?

waitingkuo · 2014-03-13T07:16:21Z

@jreback,
do you mean we still need to give some warning even mangle_dupe_cols is set as True?

jreback · 2014-03-13T10:25:07Z

I think you need to give warning for your case 2) above

waitingkuo · 2014-03-13T11:45:39Z

What if I use pd.read_csv('woo.csv', header=[0,1], mangle_dupe_cols=True). Should we still give warning?

jreback · 2014-03-13T12:24:12Z

I think you still should do a warning, because you are 'expectnig' a multi-index on the columns, but you won't get one because of the incorrect format. (mangle is the default), kind of an older option (now that can parse multi-column). It was supposed to be defaulted to False in future versions but that broke things so didn't go forward with that.

waitingkuo · 2014-03-13T15:09:55Z

now it warns if duplicated columns have been mangled

woo.csv

A,B,B,B,C,D,D
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7

woo_mul.csv

Male,Male,Male,Male,Male,Female,Female
A,B,B,B,C,D,D
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7

In [1]: import pandas as pd

In [2]: pd.read_csv('woo.csv', header=[0])
/Users/willy/git/pandas/pandas/io/parsers.py:972: DtypeWarning: Duplicated columns have been mangled
  self._reader = _parser.TextReader(src, **kwds)
Out[2]: 
   A  B  B.1  B.2  C  D  D.1 
0  1  2    3    4  5  6    7   
1  1  2    3    4  5  6    7   
2  1  2    3    4  5  6    7   
3  1  2    3    4  5  6    7   
4  1  2    3    4  5  6    7   
5  1  2    3    4  5  6    7   

[6 rows x 7 columns]

In [3]: pd.read_csv('woo_mul.csv', header=[0,1])
Out[3]: 
   Male                  Female     
      A  B  B.1  B.2  C       D  D.1 
0     1  2    3    4  5       6    7   
1     1  2    3    4  5       6    7   
2     1  2    3    4  5       6    7   
3     1  2    3    4  5       6    7   
4     1  2    3    4  5       6    7   
5     1  2    3    4  5       6    7   

[6 rows x 7 columns]

jreback · 2014-03-13T15:12:38Z

pandas/io/tests/test_parsers.py

+4,5,6
+7,8,9
+"""
+        df = self.read_csv(StringIO(data), header=[0])


use the assert_produces_warning here to assert that you are actually showing the warning (around the read_csv)

jreback · 2014-03-13T15:16:41Z

also need a one-liner in release notes, API change (and same line in v0.14.0.txt)

waitingkuo · 2014-03-13T16:26:38Z

Could you give me a hand?

i use assert_produce_warning as

2607         with tm.assert_produces_warning(DtypeWarning):
2608             df = self.read_csv(StringIO(data), header=[0,1], mangle_dupe_cols=True)

But seems it doesn't catch my warning

(pandas-dev)appledembp-2:tests willy$ nosetests test_parsers.py
....................../Users/willy/git/pandas/pandas/io/parsers.py:972: DtypeWarning: Duplicated columns have been mangled
  self._reader = _parser.TextReader(src, **kwds)
............................................................................S......................................................F.................................................................................S..........S...............................................................................................................
======================================================================
FAIL: test_list_of_multiple_headers_with_duplicated_column_pairs (pandas.io.tests.test_parsers.TestCParserLowMemory)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/willy/git/pandas/pandas/io/tests/test_parsers.py", line 2608, in test_list_of_multiple_headers_with_duplicated_column_pairs
    df = self.read_csv(StringIO(data), header=[0,1], mangle_dupe_cols=True)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/Users/willy/git/pandas/pandas/util/testing.py", line 1505, in assert_produces_warning
    % expected_warning.__name__)
AssertionError: Did not see expected warning of class 'DtypeWarning'.

----------------------------------------------------------------------
Ran 358 tests in 10.321s

FAILED (SKIP=3, failures=1)

Anything went wrong? That warning is warned in parser.pyx, a cython file. Does it matter?

jreback · 2014-03-13T16:43:20Z

add this to the setUp method

warnings.filterwarnings(action='error', category=DtypeWarning)

this will then raise on the error and you can see where its coming from

jreback · 2014-04-09T13:05:04Z

@waitingkuo update on this?

jreback · 2014-04-22T15:28:32Z

@waitingkuo can you address this soon?

jreback · 2014-05-30T14:32:04Z

looks good, can you rebase, and add a whatsnew (0.14.1) entry (bug fixes section)

jreback · 2014-06-10T15:40:59Z

can you address the comments above. thanks

jreback · 2014-09-04T00:35:02Z

@waitingkuo status?

jreback · 2015-01-18T21:40:11Z

closing as stale

gfyoung · 2017-07-20T08:54:06Z

@jreback : I'd like to maybe resurrect this given that I've had my share of fun with duplicate columns and indices (which appears to be the cause of this behavior). What do you think of these changes now?

jreback · 2017-07-22T14:00:17Z

i thought this is much better nowadays - sure go for revival

gfyoung · 2017-07-22T19:17:33Z

i thought this is much better nowadays

Sadly not...the test written here fails on master

gfyoung · 2017-07-24T04:02:34Z

Further testing indicates that this PR was not as ready to be merged as previously thought. It ran into issues when nested duplicates were encountered. Also, no support for the Python engine was provided.

waitingkuo added 3 commits March 13, 2014 14:44

TST: add test cases for parsing duplicated multiple-column header

7f86033

BUG: fix the bug when parsing multiple-column header

08ed426

BUG: use lzip instead of zip to fix py3 compatible issue

b7079a6

EHN: give warning if duplicated columns have been found

6c35663

jreback reviewed Mar 13, 2014
View reviewed changes

jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014

jreback modified the milestones: 0.14.1, 0.15.0 May 30, 2014

jreback modified the milestones: 0.15.0, 0.14.1, 0.15.1 Jun 30, 2014

jreback modified the milestones: 0.15.1, 0.15.0 Sep 4, 2014

jreback closed this Jan 18, 2015

gfyoung modified the milestones: No action, 0.16.0 Jul 20, 2017

gfyoung mentioned this pull request Jul 25, 2017

BUG: Thoroughly dedup columns in read_csv #17060

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: parsing multi-column headers in read_csv (GH6051) #6170

BUG: parsing multi-column headers in read_csv (GH6051) #6170

waitingkuo commented Jan 29, 2014

jreback commented Jan 29, 2014

waitingkuo commented Jan 29, 2014

jreback commented Jan 29, 2014

waitingkuo commented Jan 29, 2014

jreback commented Jan 29, 2014

jreback commented Feb 14, 2014

jreback commented Feb 15, 2014

jreback commented Mar 9, 2014

waitingkuo commented Mar 13, 2014

jreback commented Mar 13, 2014

waitingkuo commented Mar 13, 2014

jreback commented Mar 13, 2014

waitingkuo commented Mar 13, 2014

jreback Mar 13, 2014

jreback commented Mar 13, 2014

waitingkuo commented Mar 13, 2014

jreback commented Mar 13, 2014

jreback commented Apr 9, 2014

jreback commented Apr 22, 2014

jreback commented May 30, 2014

jreback commented Jun 10, 2014

jreback commented Sep 4, 2014

jreback commented Jan 18, 2015

gfyoung commented Jul 20, 2017

jreback commented Jul 22, 2017

gfyoung commented Jul 22, 2017

gfyoung commented Jul 24, 2017

BUG: parsing multi-column headers in read_csv (GH6051) #6170

BUG: parsing multi-column headers in read_csv (GH6051) #6170

Conversation

waitingkuo commented Jan 29, 2014

jreback commented Jan 29, 2014

waitingkuo commented Jan 29, 2014

jreback commented Jan 29, 2014

waitingkuo commented Jan 29, 2014

jreback commented Jan 29, 2014

jreback commented Feb 14, 2014

jreback commented Feb 15, 2014

jreback commented Mar 9, 2014

waitingkuo commented Mar 13, 2014

jreback commented Mar 13, 2014

waitingkuo commented Mar 13, 2014

jreback commented Mar 13, 2014

waitingkuo commented Mar 13, 2014

jreback Mar 13, 2014

Choose a reason for hiding this comment

jreback commented Mar 13, 2014

waitingkuo commented Mar 13, 2014

jreback commented Mar 13, 2014

jreback commented Apr 9, 2014

jreback commented Apr 22, 2014

jreback commented May 30, 2014

jreback commented Jun 10, 2014

jreback commented Sep 4, 2014

jreback commented Jan 18, 2015

gfyoung commented Jul 20, 2017

jreback commented Jul 22, 2017

gfyoung commented Jul 22, 2017

gfyoung commented Jul 24, 2017