Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize array-from-ctypes in basic.py #3927

Merged
merged 1 commit into from
Feb 17, 2021

Conversation

asford
Copy link
Contributor

@asford asford commented Feb 8, 2021

Approximately %80 of runtime when loading "low column count, high row
count" DataFrames into Datasets is consumed in np.fromiter, called
as part of the Dataset.get_field method.

This is particularly pernicious hotspot, as unlike other ctypes-based
methods this is a hot loop over a python iterator loop and causes
significant GIL-contention in multi-threaded applications.

Replace np.fromiter with a direct call to np.ctypeslib.as_array,
which allows a single-shot copy of the underlying array.

This reduces the load time of a ~35 million row categorical dataframe
with 1 column from ~5 seconds to ~1 second, and allows multi-threaded
execution.

Approximately %80 of runtime when loading "low column count, high row
count" DataFrames into Datasets is consumed in `np.fromiter`, called
as part of the `Dataset.get_field` method.

This is particularly pernicious hotspot, as unlike other ctypes-based
methods this is a hot loop over a python iterator loop and causes
significant GIL-contention in multi-threaded applications.

Replace `np.fromiter` with a direct call to `np.ctypeslib.as_array`,
which allows a single-shot `copy` of the underlying array.

This reduces the load time of a ~35 million row categorical dataframe
with 1 column from ~5 seconds to ~1 second, and allows multi-threaded
execution.
@asford
Copy link
Contributor Author

asford commented Feb 9, 2021

Profile shot, v3.1.1 via conda-forge:

image

Post-patch, v3.1.1:

image

@jameslamb
Copy link
Collaborator

Thanks so much for taking the time to contribute @asford ! I think we'll need a little time to research this and might have some followup questions, but the benchmarking graphic and the fact that all the tests are passing is really encouraging.

@asford
Copy link
Contributor Author

asford commented Feb 10, 2021

My pleasure, thanks for all the effort on y'alls part maintaining the tool.

Not sure what would be useful from an review perspective, but this is actually quite easy to evaluate by monkeypatching the associated functions. If it'd be useful I'd be happy to setup a colab or binder with the head-to-head demo or setup a pytest-benchmark test case between the two code paths.

@guolinke
Copy link
Collaborator

Will the copy increase the memory usage?

@asford
Copy link
Contributor Author

asford commented Feb 14, 2021

The existing np.fromiter based implementation is creating a copy of the underlying float array, just in a highly inefficient way by iterating over the pointer via the ctypes interface. This change preserves the same semantics by copying the contents of the internal buffer.

Though there could be a minor reduction is memory consumption by making this return a view, rather than copy, of the ctypes pointer but this opens a huge can of worms wrt lifetime of the underlying buffer. I've no idea idea how lightgbm's internal memory management strategy is structured, but t's far-safer to perform the copy so that the standard memory management semantics for numpy are obeyed. (ie. the array contents are valid for the lifetime of the ndarray object).

As you can see from the second, post fix, profiling result there are a lot of big existing gains to be had in tuning the "Dataset from DataFrame" story. For example, the current categorical normalization performs at-least-one full-size copy of the source dataframe and performs a slow single-threaded normalization pass.

@jameslamb given your work with dask dataframes you may find this strategy useful. I was seeing some GME-circa-January '21 gains in dataset construction performance. 🚀

Code snippet for threaded, multi-column Dataset construction
def merge_dataset(target: lgb.Dataset, other: lgb.Dataset) -> lgb.Dataset:
    """In-place merge of lightgbm dataset features.
    Args:
        target: Dataset, updated in place.
        other: Dataset, features added to target.
    Returns:
        In-place updated target.
    """

    # LightGBM contains a booster reference, in C, which is handled via
    # "add_features_for".

    # pandas-categorical metadata, needed to ensure proper cat encoding,
    # is stored on the python-level object and must be manually merged

    if target.pandas_categorical or other.pandas_categorical:
        pandas_categorical = (
            target.pandas_categorical if target.pandas_categorical else []
        ) + (other.pandas_categorical if other.pandas_categorical else [])
    else:
        pandas_categorical = None

    categorical_column = target.params.get("categorical_column", []) + [
        c + target.num_feature() for c in other.params.get("categorical_column", [])
    ]

    target.add_features_from(other)

    target.pandas_categorical = pandas_categorical
    if categorical_column:
        target.params["categorical_column"] = categorical_column

    return target


def construct_dataset_threaded(
    df: pd.DataFrame, pre_apply: Callable[[pd.DataFrame], pd.DataFrame] = lambda df: df
) -> lgb.Dataset:
    """Multi-threaded Dataset setup.
    Args:
        df: Target dataframe, multi-threaded construct on a per-column basis.
        pre_apply: Pre-process a single-column frame before Dataset constructor.
    Returns:
        Dataset, with feature-per-column.
    """

    @dask.delayed
    def d_construct(col_df: pd.DataFrame) -> lgb.Dataset:
        return lgb.Dataset(pre_apply(col_df)).construct()

    with dask.config.set(scheduler="threads"):
        (d_lgb_datasets,) = dask.persist([d_construct(df[[c]]) for c in df.columns])

        dataset = d_lgb_datasets[0].compute()
        for d_other in d_lgb_datasets[1:]:
            merge_dataset(dataset, d_other.compute())

        return dataset

@StrikerRUS
Copy link
Collaborator

@asford Thanks a lot for your explanation!

For the reference, source code of fromiter function:
https://github.com/numpy/numpy/blob/b4491833d636344ef295f1609bcf4742c1d94d10/numpy/core/src/multiarray/multiarraymodule.c#L2206-L2231
https://github.com/numpy/numpy/blob/72be9ca10b5391fc036c4cd61830e168712b51ea/numpy/core/src/multiarray/ctors.c#L3731-L3848

I'm afraid only about that we are loosing dtype information with new interface. Will numpy preserve original type of cptr and will not cast for example float to default float64? According to the source code of as_array, it will preserve, but I'd like to be 100% sure.

@asford
Copy link
Contributor Author

asford commented Feb 15, 2021

As far as I can tell, the requested types in these functions were probably to workaround the fact that np.fromiter may disregard the type of the input ctypes pointer.

As you mention, as_array infers the array dtype from the source pointer, so the ctype dtype is passed through. You could be extremely defensive and assert on the expected dtype before returning, however I'm not a yuge fan of asserts "on the way out".

The pull is is set to "Allow edits by maintainers". I'm totally fine with edits here to match the expectations of the project; will likely not be online to update this pull for several days.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM based on great comments from @asford !

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great contribution, thanks so much for the time and effort you put into improving LightGBM!

Copy link
Collaborator

@guolinke guolinke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@jameslamb jameslamb merged commit de8c610 into microsoft:master Feb 17, 2021
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants