Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetch_openml: Add an option to which returns a DataFrame #11818

Closed
jnothman opened this issue Aug 15, 2018 · 17 comments · Fixed by #13902
Closed

fetch_openml: Add an option to which returns a DataFrame #11818

jnothman opened this issue Aug 15, 2018 · 17 comments · Fixed by #13902
Labels
Enhancement help wanted Moderate Anything that requires some knowledge of conventions and best practices

Comments

@jnothman
Copy link
Member

fetch_openml currently rejects STRING-valued attributes and ordinal-encodes all NOMINAL attributes, in order to return an array or sparse matrix of floats by default.

We should have a parameter that instead returns a DataFrame of features as the 'data' entry in the returned Bunch. This would (by default) keep nominals as pd.Categorical and strings as objects. Columns would have names determined from the ARFF attribute names / OpenML metadata. Perhaps we would also set the DataFrame's index corresponding to the is_row_identifier attribute in OpenML.

See #10733 for the general issue of an API for returning DataFrames in sklearn.datasets.

@jnothman jnothman added Enhancement Moderate Anything that requires some knowledge of conventions and best practices help wanted labels Aug 15, 2018
@jnothman
Copy link
Member Author

An extra challenge may be supporting SparseDataFrame output. But first we need to confirm that Scikit-learn generally supports SparseDataFrames....

@rth
Copy link
Member

rth commented Aug 15, 2018

An extra challenge may be supporting SparseDataFrame output. But first we need to confirm that Scikit-learn generally supports SparseDataFrames....

@jorisvandenbossche would know more, but as far as I understand SparseDataFrames are not really sparse sense of encoding only non null values, but are roughly equivalent to compressed dense dataframes (cf docs).

At least for text processing with a typical sparse document-term matrix, in my experience the performance is not even remotely comparable to using CSR (as in not usable). IMO, sparse xarrays (pydata/xarray#1375) would be more what would be needed for text processing, but that's not really in scope.

Though maybe SparseDataFrame would do just fine e.g. for one hot encoded data with relatively small number of features..

@jorisvandenbossche
Copy link
Member

as far as I understand SparseDataFrames are not really sparse sense of encoding only non null values, but are roughly equivalent to compressed dense dataframes (cf docs).

I am not sure there is necessarily a difference (I didn't write the docs :-)), it's just that pandas uses NaN as the default "fill value" instead of 0, but you can easily choose 0 when having mostly zero data.
And then it is very similar to sparse coo format, with the main difference it is limited to 1D (so when having a sparse DataFrame, each column stores a 1D sparse 'array').

At least for text processing with a typical sparse document-term matrix, in my experience the performance is not even remotely comparable to using CSR (as in not usable).

Compared to CSR, I assume the main (inherent) limitation comes from the fact you have a huge number of 1D sparse arrays, instead of 1 2D sparse array (and of course, in addition, the implementation in pandas might certainly be not the most optimized as well).

So I think SparseDataFrame is simply not really fitting the needs for use cases with a huge number of features.


That said, I would not consider the sparse functionality as the most stable part of pandas (I never used it in practice myself, but there is currently some refactor being done: pandas-dev/pandas#22325 (review)).

So I think a good first step would already be to support dense DataFrames.

@jnothman jnothman changed the title Add an option to fetch_openml which returns a DataFrame fetch_openml: Add an option to which returns a DataFrame Aug 15, 2018
@KOLANICH
Copy link

Some datasets have binary and numerical features incorrectly marked as categorical, so we need not to rely on the metadata, but heuristically convert the stuff, I just call set and if the len(...) is 2, than it should be bool, then pds.loc[fn] = (pds.loc[fn] == cat[1]).

@amueller
Copy link
Member

We shouldn't be using heuristics here. What if the two values are "2.7" and "3.4" as strings. Your heuristic would make them boolean, but they might be floats. It's impossible to know without metadata. Scikit-learn tries not to be too magic (which means it sometimes requires the user to be very explicit).

@KOLANICH
Copy link

KOLANICH commented Nov 27, 2018

Your heuristic would make them boolean, but they might be floats.

It is better to process them as booleans in this case, rather than floats, isn't it?

@amueller
Copy link
Member

Why? How would you know that? That depends entirely on the semantics of the data and your model, right? What if the test set has 3.5? is that a different category or just a bit larger than 3.4?

@KOLANICH
Copy link

KOLANICH commented Nov 27, 2018

What if the test set has 3.5

We don't get test sets separately, we get whole datasets and split them ourselves. We decide basing on all the data.

And there is no difference between 0 and 1 and 2.7+0*0.7 and 2.7+1*0.7, this is just a linear dependency, any sensible model would have captured this.

@amueller
Copy link
Member

amueller commented Nov 27, 2018

We don't get test sets separately, we get whole datasets and split them ourselves. We decide basing on all the data.

This is not a good assumption for machine learning and not the assumption scikit-learn is based on.

@KOLANICH
Copy link

This is not a good assumption for machine learning and not the concepts scikit-learn is based on.

We are currently speaking not about ML in general, but about fetching datasets from openml.

@amueller
Copy link
Member

Fair.

But what if the dataset has three values? Then you don't do it? That's pretty unexpected behavior.

@amueller
Copy link
Member

Btw, #12502 is somewhat related.

@KOLANICH
Copy link

KOLANICH commented Nov 27, 2018

But what if the dataset has three values? Then you don't do it? That's pretty unexpected behavior.

Yes. But why should I be surprised? The whole point of having machine-readable datasets is to apply models to them without any human intervention, so I shouldn't even know that something have changed, my library will do the preprocessing for me.

@amueller
Copy link
Member

You're suggesting to actually discard the machine-readable information, i.e. the meta-data.

@KOLANICH
Copy link

misinformation

@jnothman
Copy link
Member Author

jnothman commented Nov 28, 2018 via email

@rth
Copy link
Member

rth commented Feb 4, 2019

Just to give some feedback on this as a user. Tried to load https://www.openml.org/d/1461 which is heterogeneous dataset with fetch_openml. It can be represented nicely with a DataFrame and can be read from the orignal csv with one line of pd.read_csv.

When using fetch_openml function, categorical features are encoded as ordinals and than cast to float (since we want an array). From my perspective on this dataset, this prevent one from doing anything useful with it. The python-openml package also doesn't support loading data as DataFrame until openml/openml-python#548 is merged.

In terms of usability of OpenML datasets returning DataFrames would be really nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement help wanted Moderate Anything that requires some knowledge of conventions and best practices
Projects
None yet
5 participants