[MRG] Adds fetch_openml pandas dataframe support #13902

thomasjpfan · 2019-05-18T04:24:58Z

Reference Issues/PRs

Fixes #11819
Fixes #11818
Supersedes #11875
Supersedes #13177

What does this implement/fix? Explain your changes.

This PR adds return_frame to fetch_openml. There were some design decision made to get this to work (when return_frame is True):

The data attribute will contain the whole data frame. Usually a pandas dataframe is use to explore the data, using groupbys, etc. Having the target in the dataframe helps with this aspect.
The target attribute will be set to None. (The target is in the pandas dataframe)
A new target_names attribute was added. Together with feature_names, this allows the pandas dataframe to be split features and targets later.
String datatypes will always be objects
Nominal data types will always be categories
A real, numeric, or integer with missing with always be floats
Integer without missing with be ints.

There will be an unrelated error from test_scale_and_stability.

jnothman · 2019-05-21T23:33:50Z

The data attribute will contain the whole data frame. Usually a pandas dataframe is use to explore the data, using groupbys, etc. Having the target in the dataframe helps with this aspect.

What about if return_X_y=True?

I agree that including the target in the frame corresponds to common pandas usage, but it doesn't correspond to Scikit-learn convention. I'd rather leave the target out unless you have a more compelling reason otherwise.

jnothman · 2019-05-22T02:54:15Z

sklearn/datasets/openml.py

+    check_pandas_support('fetch_openml with return_frame=True')
+    import pandas as pd
+
+    df = pd.DataFrame(arrf_data['data'], columns=list(features_dict.keys()),


arff_data['data'] is a generator, right? Does that work with all supported versions of Pandas? Can you make sure that it does not just do list(data) and explode the memory usage?

Otherwise we can pass it to pd.DataFrame.from_records() which support generator from 0.12

Actually @jorisvandenbossche mentioned that it should not be implemented in from_records either.

pd.Dataframe does call list(data), which we need to avoid.

Otherwise we can pass it to pd.DataFrame.from_records()

Just tested from_records out and it seems to work. :)

sklearn/datasets/openml.py

thomasjpfan · 2019-05-22T03:31:25Z

I agree that splitting up the dataframe into target and data matches the Scikit-learn convention more. If we break up data and target, we would need to guide users to recombine the dataset to do any exploratory data analysis that needs the target in the dataframe. This is a little counter to what is expected when loading a dataframe from a csv, where the target is included. Users are use to breaking up the dataframe themselves.

In the end, I am okay with going either way on this matter.

jnothman · 2019-05-22T04:00:49Z

return_frame='include-target' in a future version?? or is that hella ugly?

glemaitre · 2019-05-22T12:51:01Z

sklearn/datasets/openml.py

+    -------
+    df : pd.DataFrame
+    """
+    check_pandas_support('fetch_openml with return_frame=True')


Do you think that it would be nice to have directly something like:

pd = check_pandas_support(...)

Yup, that would be nice to have. I'll include it and see what others think.

glemaitre · 2019-05-22T12:53:29Z

sklearn/datasets/openml.py

+    check_pandas_support('fetch_openml with return_frame=True')
+    import pandas as pd
+
+    df = pd.DataFrame(arrf_data['data'], columns=list(features_dict.keys()),


Otherwise we can pass it to pd.DataFrame.from_records() which support generator from 0.12

Actually @jorisvandenbossche mentioned that it should not be implemented in from_records either.

jorisvandenbossche

@thomasjpfan thanks for taking this up! Added a few comments

jorisvandenbossche · 2019-05-22T19:49:06Z

sklearn/utils/__init__.py

+    except ImportError as e:
+        raise ImportError(
+            "{} requires pandas. You can install pandas with "
+            "`pip install pandas`".format(caller_name)


I personally wouldn't suggest to pip install pandas (eg for all conda users, this is the wrong suggestion)

jorisvandenbossche · 2019-05-22T19:57:09Z

sklearn/datasets/openml.py

+    df : pd.DataFrame
+    """
+    pd = check_pandas_support('fetch_openml with return_frame=True')
+    df = pd.DataFrame.from_records(arrf_data['data'],


For some reason github prevents me of replying on the previous comments about this ..
But, I would personally just use pd.DataFrame(..). from_records is not more efficient (it expands the generator first as well), and less well maintained as the main constructor.

What might work is to load the data in chunks to limit memory consumption

jorisvandenbossche · 2019-05-22T20:01:01Z

sklearn/datasets/openml.py

+            dtype = pd.CategoricalDtype(attributes[column])
+        dtypes[column] = dtype
+
+    return df.astype(dtypes)


In my original PR, I had the special case of all numeric columns (that prevents yet another copy for that specific case).
Just mentioning, but I think it might be worth it (didn't check again with a timing though)

…port

thomasjpfan · 2019-05-24T18:04:27Z

This PR has been updated with:

An nrows parameter in fetch_openml to control how many rows to consume at a time when constructing the dataframe.
There is a less copying in _convert_arff_data_dataframe.
Features where openml says the data is is_ignore or is_row_identifier are remove from the dataframe. If a dataset contains these columns, they are removed, and the dataframe is copied. This forces fetch_openml to return a non-view dataframe.
Raises an error when return_X_y=True and return_frame=True. To correctly follow scikit-learn convention, X could still be a dataframe, but y can be reshaped into a 1-D array (if there is one target). Letting y be a numpy array appears to contradict return_frame=True.

jorisvandenbossche · 2019-05-24T19:21:35Z

sklearn/datasets/openml.py

+    arrf_data_gen = _chunk_iterable(arrf['data'], nrows)
+    dfs = [pd.DataFrame(list(data), columns=arrf_columns, dtype=object)
+           for data in arrf_data_gen]
+    df = pd.concat(dfs, copy=False)


You can forget the copy=False here :) (that's never possible along this axis)

jorisvandenbossche · 2019-05-24T19:22:43Z

sklearn/datasets/openml.py

+        If True, returns a Bunch where the data attribute is a pandas
+        DataFrame.
+
+    nrows : int, default=5000


Maybe rather chunksize as how you called it above?
For an nrows argument, I would expect that it specifies to only return me that many (first) rows.

jorisvandenbossche · 2019-05-24T19:25:02Z

sklearn/datasets/openml.py

+    arrf_columns = list(attributes)
+
+    arrf_data_gen = _chunk_iterable(arrf['data'], nrows)
+    dfs = [pd.DataFrame(list(data), columns=arrf_columns, dtype=object)


Is this object dtype needed?

jnothman

I'm still not convinced by the sense of returning the target in the frame. I see it as incompatible with the rest of sklearn.datasets (apart from making it easy to leak y to an estimator).

jnothman · 2019-05-25T14:24:25Z

sklearn/datasets/tests/test_openml.py

+
+    _monkey_patch_webbased_functions(monkeypatch, data_id, True)
+
+    bunch = fetch_openml(data_id=data_id, return_frame=True, cache=False)


I think it would be best to compare the bunches returned with return_frame=False vs True. This is especially needed in case we were to add something to the bunch later...

jnothman · 2019-05-25T14:24:59Z

sklearn/datasets/openml.py

+
+    chunksize : int, default=5000
+        Number of rows to read at a time when constructing a dataframe.
+        Only used when ``return_frame`` is True.


we could be using this more generally instead of the messy np.fromiter code.

What do you find messy about the np.fromiter code?

jnothman · 2019-05-25T14:25:15Z

sklearn/datasets/openml.py

+        DataFrame.
+
+    chunksize : int, default=5000
+        Number of rows to read at a time when constructing a dataframe.


Is it possible to estimate against working_memory instead?

It is possible. We would need to estimate the amount of memory used per row.

thomasjpfan · 2019-05-25T15:06:08Z

This PR was updated with an additional bunch entry, dataframe, which still includes the target. When return_frame=True, both data and target are None. A user would need to use target_columns and feature_columns to split features from the target. Adding a dataframe entry helps distinguish this API from the rest of the sklearn.datasets api.

I think the common use case for using a dataframe is to do EDA (using the target), and then use the target_columns and feature_columns to split the data. The user would also need to be careful with the shape of y, which is already the case when working with dataframes.

Lets say we split the dataframe up into data and target, the target would need to be reshaped, lose its pandas categorical dtype information, and its column name. If we want a user to reconstruct the original dataframe, they would need to manually reconstruct the categorical dtype and place the original column name back. All of this work would be required to do EDA.

Currently the sklearn.dataset api is designed to return a data and target that can be directly fed into a scikit-learn estimator. I am envisioning a new dataframe entry, where a user would want to explore the data first, before building a model.

jnothman · 2019-05-25T22:13:07Z

So you want to keep target in the DF for two reasons: one is to support EDA (although even there the user should probably not look at the target - and arguably the features - on their test set), the other because we don't properly support categoricals as y elsewhere in scikit-learn. Should we be supporting the latter? I am really quite uncomfortable about, on the one hand, saying fetch_openml is not very useful for datasets with heterogeneous types until it supports frame output, and on the other hand that a frame is fundamentally semantically different to how we represent other datasets and hence requires different instructions to work with as train/test data for scikit-learn estimators. We are forcing users to learn a new way of interacting with datasets (whether changing to fetch_openml from load_iris, or vice versa) despite them being functionally required to use return_frame in many cases. Does that make sense? A version that returns an integrated frame belongs in the openml python library... And it can also live here but imo must be separated from a function which has API like all the other dataset leaders.

jnothman · 2019-05-25T22:16:47Z

I should also point out that return_frame='auto' could be a nice feature, for datasets with any non-numeric features. The current approach precludes that possibility.

thomasjpfan · 2019-05-25T22:44:06Z

(although even there the user should probably not look at the target - and
arguably the features - on their test set)

There are some plots that would like having the target in the dataframe. For example, the iris dataset with seaborn, uses the target, species, to control the hue.

We are forcing users to learn a new way of
interacting with datasets (whether changing to fetch_openml from load_iris,
or vice versa) despite them being functionally required to use return_frame
in many cases. Does that make sense?

This makes sense. Keeping the datasets API consistent is important. Okay, I will update this PR to split the features and the target.

I should also point out that return_frame='auto' could be a nice feature,
for datasets with any non-numeric features. The current approach precludes
that possibility.

This sounds reasonable. It does not feel like too much magic.

thomasjpfan · 2019-06-13T20:20:07Z

This PR was updated with:

Renames data_column to data_columns in the implementation.
Uses get_chunk_n_rows to get chunksize.
Only create Bunch object in one place.
Filters out columns as the dataframe is being read.

jnothman

I've only checked the diff since last review, but I think this is almost there.

jnothman · 2019-06-18T08:39:13Z

sklearn/datasets/openml.py

+    first_row = next(arrf['data'])
+    first_df = pd.DataFrame([first_row], columns=arrf_columns)
+
+    row_bytes = first_df.memory_usage(deep=True).sum()


Maybe just comment to that effect. In practice the temporary list would have a further intp overhead per feature per sample if I'm not mistaken.

jnothman · 2019-06-18T08:41:02Z

sklearn/datasets/openml.py

@@ -582,8 +569,11 @@ def fetch_openml(name=None, version='active', data_id=None, data_home=None,
        below for more information about the `data` and `target` objects.

    as_frame : boolean, default=False
-        If True, returns a Bunch where the data attribute is a pandas
-        DataFrame.
+        If True, where the data is a pandas DataFrame including columns with


"where the" -> "the returned"

Not addressed?

jnothman · 2019-06-18T08:42:14Z

sklearn/datasets/openml.py

+        If True, where the data is a pandas DataFrame including columns with
+        appropriate dtypes (numeric, string or categorical). The target is
+        a pandas DataFrame or Series depending on the number of target_columns.
+        If ``return_X_y`` is True, then ``(data, target)`` will be pandas


I am not sure this is necessary if the previous statement does not mention a Bunch. You could just say "this applies regardless of return_X_y"

jnothman · 2019-06-18T08:42:19Z

sklearn/datasets/openml.py

+        appropriate dtypes (numeric, string or categorical). The target is
+        a pandas DataFrame or Series depending on the number of target_columns.
+        If ``return_X_y`` is True, then ``(data, target)`` will be pandas
+        DataFrames or Series as describe above.


jnothman

I still wouldn't mind a test that essentially checked the diff between the as_frame=True and the as_frame=False bunch. But otherwise LGTM. Please add a what's new entry. And perhaps open a follow-up issue towards as_frame='auto'.

thomasjpfan · 2019-06-25T16:36:02Z

PR updated with:

Test to compare as_frame=True and as_frame=False on a purely numerical dataset.

glemaitre

I think that this is almost done. Can you add a what's new entry.

glemaitre · 2019-07-02T15:05:27Z

sklearn/datasets/openml.py

+    raise ValueError('Unsupported feature: {}'.format(feature))
+
+
+def _chunk_generator(gen, chunksize):


I would move this function in the utils next to the other generator.
I don't think that we can factorize it with the current one.

…port

glemaitre

LGTM. I would like to hear from @ogrisel and @jorisvandenbossche regarding if they agree with the API.
I think this is in line with what we discussed earlier but just to confirm before to click on the green button ;)

doc/whats_new/v0.22.rst

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

amueller · 2019-07-02T16:20:27Z

examples/compose/plot_column_transformer_mixed_types.py

@@ -71,9 +71,6 @@
 clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

-X = data.drop('survived', axis=1)


I would like to document this idiom somewhere. It's sooo common and I'm not sure if it's anywhere else in the docs.
Maybe give it as an alternative here?

Added a comment using the frame attribute.

I would like to document this idiom somewhere.

Is y = X.pop('survived') a better idiom? Or not because it modifies X in-place.

I personally prefer frame.drop(columns='survived') instead of specifying the axis=1

amueller · 2019-07-02T16:23:01Z

sklearn/datasets/openml.py

@@ -489,26 +557,38 @@ def fetch_openml(name=None, version='active', data_id=None, data_home=None,
        If True, returns ``(data, target)`` instead of a Bunch object. See
        below for more information about the `data` and `target` objects.

+    as_frame : boolean, default=False
+        If True, where the data is a pandas DataFrame including columns with
+        appropriate dtypes (numeric, string or categorical). The target is


It doesn't mention the frame attribute?

amueller · 2019-07-02T16:23:32Z

sklearn/datasets/openml.py

        details : dict
            More metadata from OpenML
+        frame : pandas DataFrame


"Only present when as_frame=True?

…port

glemaitre · 2019-07-12T16:08:02Z

@thomasjpfan Thanks. Maybe we can have some example that could benefit from it now.

rth · 2019-07-12T16:14:07Z

I'm really excited that this was added! Thanks to everyone who worked on it!

amueller · 2019-07-12T23:38:05Z

sklearn/datasets/openml.py

+    """
+    pd = check_pandas_support('fetch_openml with as_frame=True')
+
+    attributes = OrderedDict(arrf['attributes'])


typos, should be arff

jorisvandenbossche · 2019-07-13T02:20:41Z

@thomasjpfan thanks for taking this to the finish line !

thomasjpfan added 3 commits May 17, 2019 15:21

TST Adds tests

f4754a9

ENH Adds support for dataframes in open_ml

e67182e

BUG Add datafiles

98bfa76

oanise93 mentioned this pull request May 18, 2019

[MRG] Adds the ability load datasets from OpenML containing string attributes #13177

Closed

jnothman reviewed May 22, 2019

View reviewed changes

glemaitre reviewed May 22, 2019

View reviewed changes

thomasjpfan added 2 commits May 22, 2019 11:49

ENH Uses categories from arrf file

f162cd4

STY Fix

f3818f1

jorisvandenbossche reviewed May 22, 2019

View reviewed changes

thomasjpfan added 5 commits May 24, 2019 13:35

ENH Adds nrows for chunking

61c3dea

Merge remote-tracking branch 'upstream/master' into openml_pandas_sup…

d7651cc

…port

STY Fix

052491f

DOC Adds more comments

e7a6f9c

DOC Fixes example

95b4153

ENH Uses object types when loading into dataframe

6c8c709

jorisvandenbossche reviewed May 24, 2019

View reviewed changes

thomasjpfan added 2 commits May 24, 2019 15:40

CLN Address comments

26b03b2

STY Fix

f5a60bd

jnothman reviewed May 25, 2019

View reviewed changes

thomasjpfan added 3 commits May 25, 2019 16:39

TST Fix pandas test

599666f

TST Adds small chunksize for testing

b8011a6

TST Uses cats directly

f71aeb6

jnothman reviewed Jun 18, 2019

View reviewed changes

jnothman approved these changes Jun 18, 2019

View reviewed changes

TST Adds check for all numerical data

52211bb

glemaitre self-requested a review July 2, 2019 15:00

glemaitre reviewed Jul 2, 2019

View reviewed changes

thomasjpfan added 3 commits July 2, 2019 11:22

CLN Moves _chunk_generator to utils

8b5610b

DOC Adds whats_new

3873332

Merge remote-tracking branch 'upstream/master' into openml_pandas_sup…

729791a

…port

glemaitre approved these changes Jul 2, 2019

View reviewed changes

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved

CLN: Update doc/whats_new/v0.22.rst

e147420

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

amueller reviewed Jul 2, 2019

View reviewed changes

sklearn/datasets/openml.py

details : dict

More metadata from OpenML

frame : pandas DataFrame

Copy link

Member

amueller Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Only present when as_frame=True?

thomasjpfan added 2 commits July 2, 2019 13:16

Merge remote-tracking branch 'upstream/master' into openml_pandas_sup…

83dcdc9

…port

CLN Address comments

c34707e

This was referenced Jul 12, 2019

fetch_openml: Add an option to ignore some features, especially STRING type #11819

Closed

API for returning datasets as DataFrames #10733

Closed

glemaitre merged commit cf3e303 into scikit-learn:master Jul 12, 2019

vnmabus mentioned this pull request Jul 12, 2019

Pandas API daviddiazvico/scikit-datasets#13

Open

rth mentioned this pull request Jul 12, 2019

[WIP] fetch_openml: ability to return DataFrame #11875

Closed

amueller reviewed Jul 12, 2019

View reviewed changes

amueller mentioned this pull request Aug 6, 2019

Save target column name into the Bunch for fetch_openml #12684

Closed

gitsteph mentioned this pull request Nov 2, 2019

[MRG] adding as_frame functionality for california housing dataset loader #15486

Closed

reshamas mentioned this pull request Dec 22, 2019

ENH adding as_frame functionality for CA housing dataset loader #15950

Merged

wconnell mentioned this pull request Dec 27, 2019

ENH add as_frame functionality for toy datasets #15980

Merged


		_monkey_patch_webbased_functions(monkeypatch, data_id, True)

		bunch = fetch_openml(data_id=data_id, return_frame=True, cache=False)

		raise ValueError('Unsupported feature: {}'.format(feature))


		def _chunk_generator(gen, chunksize):

[MRG] Adds fetch_openml pandas dataframe support #13902

[MRG] Adds fetch_openml pandas dataframe support #13902

Conversation

thomasjpfan commented May 18, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

jnothman commented May 21, 2019

Choose a reason for hiding this comment

glemaitre May 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan commented May 22, 2019

jnothman commented May 22, 2019 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre May 22, 2019 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan commented May 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan commented May 25, 2019

jnothman commented May 25, 2019 via email

jnothman commented May 25, 2019 via email

thomasjpfan commented May 25, 2019

thomasjpfan commented Jun 13, 2019

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

thomasjpfan commented Jun 25, 2019

glemaitre left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre commented Jul 12, 2019

rth commented Jul 12, 2019

Choose a reason for hiding this comment

jorisvandenbossche commented Jul 13, 2019

glemaitre May 22, 2019 •

edited

Loading

glemaitre May 22, 2019 •

edited

Loading

glemaitre left a comment •

edited

Loading