Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering w/o Pandas? #407

Open
lilyminium opened this issue Nov 16, 2021 · 3 comments
Open

Filtering w/o Pandas? #407

lilyminium opened this issue Nov 16, 2021 · 3 comments

Comments

@lilyminium
Copy link
Contributor

lilyminium commented Nov 16, 2021

Have you considered filtering data sets without converting to Pandas under the hood? It can be difficult to hold both in memory at the same time, especially as the dataframe is wide format. This would also allow for progress bars. My intuition is that creating an index mask array from the properties directly and creating a new dataset from that will be substantially faster than converting to and from pandas.

Edit: happy to implement this and benchmark it, but I'm not sure how many users are relying on using dataframes for filtering.

Edit: It's also kind of odd that the filters don't take units, given the emphasis placed on units elsewhere by OpenFF. While it's clearly because of the imposed units from the dataframe, it's also not clear at all if you start the filtering wiht a dataset. e.g.

pressure_filter = FilterByPressure.apply(
    data_set,
    FilterByPressureSchema(minimum_pressure=101.224, maximum_pressure=101.426)
)
@lilyminium
Copy link
Contributor Author

lilyminium commented Nov 16, 2021

Using this function from StackOverflow (code below), a dataframe of a mixed data set is about twice the size of the actual data set (the data set in question is deduplicated ThermoML, so 704447 properties).

>>> df = data_set.to_pandas()
>>> get_obj_size(data_set)
2940669776
>>> get_obj_size(df)
6402499431

Code:

import gc
import sys

def get_obj_size(obj):
    marked = {id(obj)}
    obj_q = [obj]
    sz = 0

    while obj_q:
        sz += sum(map(sys.getsizeof, obj_q))

        # Lookup all the object referred to by the object in obj_q.
        # See: https://docs.python.org/3.7/library/gc.html#gc.get_referents
        all_refr = ((id(o), o) for o in gc.get_referents(*obj_q))

        # Filter object that are already marked.
        # Using dict notation will prevent repeated objects.
        new_refr = {o_id: o for o_id, o in all_refr if o_id not in marked and not isinstance(o, type)}

        # The new obj_q will be the ones that were not marked,
        # and we will update marked with their ids so we will
        # not traverse them again.
        obj_q = new_refr.values()
        marked.update(new_refr.keys())

    return sz

@lilyminium
Copy link
Contributor Author

lilyminium commented Nov 16, 2021

I thought the disparity would be smaller with a one-property-type data set, but the dataframe number is suspiciously close to the one for all of ThermoML. There are only 2912 properties in this data set. Hmmmm.

>>> get_obj_size(solvation_data)
13313813
>>> get_obj_size(solvation_data_df)
6402502040

@mattwthompson
Copy link
Member

Ref #506 (though I'm not caught up enough to understand this one)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants