Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame and DataMatrix column ordering #12

Closed
hector13 opened this issue Oct 3, 2010 · 2 comments
Closed

DataFrame and DataMatrix column ordering #12

hector13 opened this issue Oct 3, 2010 · 2 comments

Comments

@hector13
Copy link

hector13 commented Oct 3, 2010

First, thank you for the pandas package -- it's incredibly useful and well done.

I know that one of the fundamental concepts behind the data structures is that column ordering doesn't matter. And, as long as one only uses pandas' data access/manipulation functions (eg, sum(), ewma(), etc.), this works fine. But often, it's useful to access the underling values in a numpy array for some more complicated data manipulation. Using the values attribute (or values() method for a series) does this, but it's not always obvious what order the values come back in.

For example:

In [1]: dm = DataMatrix(np.arange(2*3).reshape(2,3), index=[1,0], columns=['B', 'A', 'C' ])

In [2]: dm 
Out[2]: 
     B              A              C  
1    0              1              2  
0    3              4              5              

In [3]: df = DataFrame(np.arange(2*3).reshape(2,3), index=[1,0], columns=['B', 'A', 'C' ])

In [4]: df 
Out[4]: 
     A              B              C  
1    1              0              2  
0    4              3              5              

In [5]: df.values
Out[5]: 
array([[1, 0, 2],
       [4, 3, 5]])

In [6]: dm.values
Out[6]: 
array([[0, 1, 2],
       [3, 4, 5]])

DataMatrix seems to respect the passed in ordering of columns, while DataFrame does not. I know this is documented, and not the biggest deal in the world, but does seem to cause quite a bit of confusion for some. Is it possible to have both data types keep the ordering that's passed in? If a user passes in the same column name twice, could this just throw an exception? Something stills need to be done when an operation is performed on two DataFrames (eg, combining them), but instead of reordering in alphabetical order, how about preserving the column ordering from left to right?

Anyway, my bigger concern is actually the following:

In [7]: dm.reindex(columns=['C','B','A']).values
Out[7]: 
array([[2, 0, 1],
       [5, 3, 4]])

In [8]: df.reindex(columns=['C','B','A']).values
Out[8]: 
array([[1, 0, 2],
       [4, 3, 5]])

Regardless of the ordering of the columns after creating a DataFrame/Matrix, a naive users (ie, me) would expect calling reindex and values would return an ndarray with the columns in the same order as was requested. But it looks like this only happens for DataMatrixes (and I'm not even sure that's always guaranteed).

@wesm
Copy link
Member

wesm commented Oct 3, 2010

You make a good point, and I think it might be time well-invested to have a set ordering in DataFrame for the reason that you mention. Otherwise you have this inconsistent behavior.

@wesm
Copy link
Member

wesm commented Dec 11, 2010

I refactored DataFrame to have a set column ordering. A few more changes might be needed for 100% support but the basic use cases (e.g. what you listed above) work and are consistent with DataMatrix now. All the unit tests pass-- but there still might be some "bugs" (inconsistencies) that I will locate over the next few weeks or so.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants