Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Adds fetch_openml pandas dataframe support #13902

Merged
merged 43 commits into from
Jul 12, 2019

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Fixes #11819
Fixes #11818
Supersedes #11875
Supersedes #13177

What does this implement/fix? Explain your changes.

This PR adds return_frame to fetch_openml. There were some design decision made to get this to work (when return_frame is True):

  1. The data attribute will contain the whole data frame. Usually a pandas dataframe is use to explore the data, using groupbys, etc. Having the target in the dataframe helps with this aspect.
  2. The target attribute will be set to None. (The target is in the pandas dataframe)
  3. A new target_names attribute was added. Together with feature_names, this allows the pandas dataframe to be split features and targets later.
  4. String datatypes will always be objects
  5. Nominal data types will always be categories
  6. A real, numeric, or integer with missing with always be floats
  7. Integer without missing with be ints.

There will be an unrelated error from test_scale_and_stability.

@jnothman
Copy link
Member

The data attribute will contain the whole data frame. Usually a pandas dataframe is use to explore the data, using groupbys, etc. Having the target in the dataframe helps with this aspect.

What about if return_X_y=True?

I agree that including the target in the frame corresponds to common pandas usage, but it doesn't correspond to Scikit-learn convention. I'd rather leave the target out unless you have a more compelling reason otherwise.

check_pandas_support('fetch_openml with return_frame=True')
import pandas as pd

df = pd.DataFrame(arrf_data['data'], columns=list(features_dict.keys()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arff_data['data'] is a generator, right? Does that work with all supported versions of Pandas? Can you make sure that it does not just do list(data) and explode the memory usage?

Copy link
Member

@glemaitre glemaitre May 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise we can pass it to pd.DataFrame.from_records() which support generator from 0.12

Actually @jorisvandenbossche mentioned that it should not be implemented in from_records either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.Dataframe does call list(data), which we need to avoid.

Otherwise we can pass it to pd.DataFrame.from_records()

Just tested from_records out and it seems to work. :)

sklearn/datasets/openml.py Show resolved Hide resolved
@thomasjpfan
Copy link
Member Author

I agree that splitting up the dataframe into target and data matches the Scikit-learn convention more. If we break up data and target, we would need to guide users to recombine the dataset to do any exploratory data analysis that needs the target in the dataframe. This is a little counter to what is expected when loading a dataframe from a csv, where the target is included. Users are use to breaking up the dataframe themselves.

In the end, I am okay with going either way on this matter.

@jnothman
Copy link
Member

jnothman commented May 22, 2019 via email

-------
df : pd.DataFrame
"""
check_pandas_support('fetch_openml with return_frame=True')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that it would be nice to have directly something like:

pd = check_pandas_support(...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that would be nice to have. I'll include it and see what others think.

check_pandas_support('fetch_openml with return_frame=True')
import pandas as pd

df = pd.DataFrame(arrf_data['data'], columns=list(features_dict.keys()),
Copy link
Member

@glemaitre glemaitre May 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise we can pass it to pd.DataFrame.from_records() which support generator from 0.12

Actually @jorisvandenbossche mentioned that it should not be implemented in from_records either.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan thanks for taking this up! Added a few comments

except ImportError as e:
raise ImportError(
"{} requires pandas. You can install pandas with "
"`pip install pandas`".format(caller_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally wouldn't suggest to pip install pandas (eg for all conda users, this is the wrong suggestion)

df : pd.DataFrame
"""
pd = check_pandas_support('fetch_openml with return_frame=True')
df = pd.DataFrame.from_records(arrf_data['data'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason github prevents me of replying on the previous comments about this ..
But, I would personally just use pd.DataFrame(..). from_records is not more efficient (it expands the generator first as well), and less well maintained as the main constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What might work is to load the data in chunks to limit memory consumption

dtype = pd.CategoricalDtype(attributes[column])
dtypes[column] = dtype

return df.astype(dtypes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my original PR, I had the special case of all numeric columns (that prevents yet another copy for that specific case).
Just mentioning, but I think it might be worth it (didn't check again with a timing though)

@thomasjpfan
Copy link
Member Author

This PR has been updated with:

  1. An nrows parameter in fetch_openml to control how many rows to consume at a time when constructing the dataframe.
  2. There is a less copying in _convert_arff_data_dataframe.
  3. Features where openml says the data is is_ignore or is_row_identifier are remove from the dataframe. If a dataset contains these columns, they are removed, and the dataframe is copied. This forces fetch_openml to return a non-view dataframe.
  4. Raises an error when return_X_y=True and return_frame=True. To correctly follow scikit-learn convention, X could still be a dataframe, but y can be reshaped into a 1-D array (if there is one target). Letting y be a numpy array appears to contradict return_frame=True.

arrf_data_gen = _chunk_iterable(arrf['data'], nrows)
dfs = [pd.DataFrame(list(data), columns=arrf_columns, dtype=object)
for data in arrf_data_gen]
df = pd.concat(dfs, copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can forget the copy=False here :) (that's never possible along this axis)

If True, returns a Bunch where the data attribute is a pandas
DataFrame.

nrows : int, default=5000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rather chunksize as how you called it above?
For an nrows argument, I would expect that it specifies to only return me that many (first) rows.

arrf_columns = list(attributes)

arrf_data_gen = _chunk_iterable(arrf['data'], nrows)
dfs = [pd.DataFrame(list(data), columns=arrf_columns, dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this object dtype needed?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not convinced by the sense of returning the target in the frame. I see it as incompatible with the rest of sklearn.datasets (apart from making it easy to leak y to an estimator).


_monkey_patch_webbased_functions(monkeypatch, data_id, True)

bunch = fetch_openml(data_id=data_id, return_frame=True, cache=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best to compare the bunches returned with return_frame=False vs True. This is especially needed in case we were to add something to the bunch later...


chunksize : int, default=5000
Number of rows to read at a time when constructing a dataframe.
Only used when ``return_frame`` is True.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could be using this more generally instead of the messy np.fromiter code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you find messy about the np.fromiter code?

DataFrame.

chunksize : int, default=5000
Number of rows to read at a time when constructing a dataframe.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to estimate against working_memory instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible. We would need to estimate the amount of memory used per row.

@thomasjpfan
Copy link
Member Author

This PR was updated with an additional bunch entry, dataframe, which still includes the target. When return_frame=True, both data and target are None. A user would need to use target_columns and feature_columns to split features from the target. Adding a dataframe entry helps distinguish this API from the rest of the sklearn.datasets api.

I think the common use case for using a dataframe is to do EDA (using the target), and then use the target_columns and feature_columns to split the data. The user would also need to be careful with the shape of y, which is already the case when working with dataframes.

Lets say we split the dataframe up into data and target, the target would need to be reshaped, lose its pandas categorical dtype information, and its column name. If we want a user to reconstruct the original dataframe, they would need to manually reconstruct the categorical dtype and place the original column name back. All of this work would be required to do EDA.

Currently the sklearn.dataset api is designed to return a data and target that can be directly fed into a scikit-learn estimator. I am envisioning a new dataframe entry, where a user would want to explore the data first, before building a model.

@jnothman
Copy link
Member

jnothman commented May 25, 2019 via email

@jnothman
Copy link
Member

jnothman commented May 25, 2019 via email

@thomasjpfan
Copy link
Member Author

(although even there the user should probably not look at the target - and
arguably the features - on their test set)

There are some plots that would like having the target in the dataframe. For example, the iris dataset with seaborn, uses the target, species, to control the hue.

We are forcing users to learn a new way of
interacting with datasets (whether changing to fetch_openml from load_iris,
or vice versa) despite them being functionally required to use return_frame
in many cases. Does that make sense?

This makes sense. Keeping the datasets API consistent is important. Okay, I will update this PR to split the features and the target.

I should also point out that return_frame='auto' could be a nice feature,
for datasets with any non-numeric features. The current approach precludes
that possibility.

This sounds reasonable. It does not feel like too much magic.

@thomasjpfan
Copy link
Member Author

This PR was updated with:

  1. Renames data_column to data_columns in the implementation.
  2. Uses get_chunk_n_rows to get chunksize.
  3. Only create Bunch object in one place.
  4. Filters out columns as the dataframe is being read.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only checked the diff since last review, but I think this is almost there.

first_row = next(arrf['data'])
first_df = pd.DataFrame([first_row], columns=arrf_columns)

row_bytes = first_df.memory_usage(deep=True).sum()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just comment to that effect. In practice the temporary list would have a further intp overhead per feature per sample if I'm not mistaken.

@@ -582,8 +569,11 @@ def fetch_openml(name=None, version='active', data_id=None, data_home=None,
below for more information about the `data` and `target` objects.

as_frame : boolean, default=False
If True, returns a Bunch where the data attribute is a pandas
DataFrame.
If True, where the data is a pandas DataFrame including columns with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"where the" -> "the returned"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not addressed?

If True, where the data is a pandas DataFrame including columns with
appropriate dtypes (numeric, string or categorical). The target is
a pandas DataFrame or Series depending on the number of target_columns.
If ``return_X_y`` is True, then ``(data, target)`` will be pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is necessary if the previous statement does not mention a Bunch. You could just say "this applies regardless of return_X_y"

appropriate dtypes (numeric, string or categorical). The target is
a pandas DataFrame or Series depending on the number of target_columns.
If ``return_X_y`` is True, then ``(data, target)`` will be pandas
DataFrames or Series as describe above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*described

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still wouldn't mind a test that essentially checked the diff between the as_frame=True and the as_frame=False bunch. But otherwise LGTM. Please add a what's new entry. And perhaps open a follow-up issue towards as_frame='auto'.

@thomasjpfan
Copy link
Member Author

PR updated with:

  1. Test to compare as_frame=True and as_frame=False on a purely numerical dataset.

@glemaitre glemaitre self-requested a review July 2, 2019 15:00
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is almost done. Can you add a what's new entry.

raise ValueError('Unsupported feature: {}'.format(feature))


def _chunk_generator(gen, chunksize):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this function in the utils next to the other generator.
I don't think that we can factorize it with the current one.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I would like to hear from @ogrisel and @jorisvandenbossche regarding if they agree with the API.
I think this is in line with what we discussed earlier but just to confirm before to click on the green button ;)

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@@ -71,9 +71,6 @@
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression())])

X = data.drop('survived', axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to document this idiom somewhere. It's sooo common and I'm not sure if it's anywhere else in the docs.
Maybe give it as an alternative here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment using the frame attribute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to document this idiom somewhere.

Is y = X.pop('survived') a better idiom? Or not because it modifies X in-place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer frame.drop(columns='survived') instead of specifying the axis=1

@@ -489,26 +557,38 @@ def fetch_openml(name=None, version='active', data_id=None, data_home=None,
If True, returns ``(data, target)`` instead of a Bunch object. See
below for more information about the `data` and `target` objects.

as_frame : boolean, default=False
If True, where the data is a pandas DataFrame including columns with
appropriate dtypes (numeric, string or categorical). The target is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't mention the frame attribute?

details : dict
More metadata from OpenML
frame : pandas DataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Only present when as_frame=True?

@glemaitre
Copy link
Member

@thomasjpfan Thanks. Maybe we can have some example that could benefit from it now.

@rth
Copy link
Member

rth commented Jul 12, 2019

I'm really excited that this was added! Thanks to everyone who worked on it!

"""
pd = check_pandas_support('fetch_openml with as_frame=True')

attributes = OrderedDict(arrf['attributes'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typos, should be arff

@jorisvandenbossche
Copy link
Member

@thomasjpfan thanks for taking this to the finish line !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants