Filtering w/o Pandas? #407

lilyminium · 2021-11-16T17:45:54Z

Have you considered filtering data sets without converting to Pandas under the hood? It can be difficult to hold both in memory at the same time, especially as the dataframe is wide format. This would also allow for progress bars. My intuition is that creating an index mask array from the properties directly and creating a new dataset from that will be substantially faster than converting to and from pandas.

Edit: happy to implement this and benchmark it, but I'm not sure how many users are relying on using dataframes for filtering.

Edit: It's also kind of odd that the filters don't take units, given the emphasis placed on units elsewhere by OpenFF. While it's clearly because of the imposed units from the dataframe, it's also not clear at all if you start the filtering wiht a dataset. e.g.

pressure_filter = FilterByPressure.apply(
    data_set,
    FilterByPressureSchema(minimum_pressure=101.224, maximum_pressure=101.426)
)

The text was updated successfully, but these errors were encountered:

lilyminium · 2021-11-16T18:02:43Z

Using this function from StackOverflow (code below), a dataframe of a mixed data set is about twice the size of the actual data set (the data set in question is deduplicated ThermoML, so 704447 properties).

>>> df = data_set.to_pandas()
>>> get_obj_size(data_set)
2940669776
>>> get_obj_size(df)
6402499431

Code:

import gc
import sys

def get_obj_size(obj):
    marked = {id(obj)}
    obj_q = [obj]
    sz = 0

    while obj_q:
        sz += sum(map(sys.getsizeof, obj_q))

        # Lookup all the object referred to by the object in obj_q.
        # See: https://docs.python.org/3.7/library/gc.html#gc.get_referents
        all_refr = ((id(o), o) for o in gc.get_referents(*obj_q))

        # Filter object that are already marked.
        # Using dict notation will prevent repeated objects.
        new_refr = {o_id: o for o_id, o in all_refr if o_id not in marked and not isinstance(o, type)}

        # The new obj_q will be the ones that were not marked,
        # and we will update marked with their ids so we will
        # not traverse them again.
        obj_q = new_refr.values()
        marked.update(new_refr.keys())

    return sz

lilyminium · 2021-11-16T18:06:24Z

I thought the disparity would be smaller with a one-property-type data set, but the dataframe number is suspiciously close to the one for all of ThermoML. There are only 2912 properties in this data set. Hmmmm.

>>> get_obj_size(solvation_data)
13313813
>>> get_obj_size(solvation_data_df)
6402502040

mattwthompson · 2023-04-10T17:07:29Z

Ref #506 (though I'm not caught up enough to understand this one)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering w/o Pandas? #407

Filtering w/o Pandas? #407

lilyminium commented Nov 16, 2021 •

edited

Loading

lilyminium commented Nov 16, 2021 •

edited

Loading

lilyminium commented Nov 16, 2021 •

edited

Loading

mattwthompson commented Apr 10, 2023

Filtering w/o Pandas? #407

Filtering w/o Pandas? #407

Comments

lilyminium commented Nov 16, 2021 • edited Loading

lilyminium commented Nov 16, 2021 • edited Loading

lilyminium commented Nov 16, 2021 • edited Loading

mattwthompson commented Apr 10, 2023

lilyminium commented Nov 16, 2021 •

edited

Loading

lilyminium commented Nov 16, 2021 •

edited

Loading

lilyminium commented Nov 16, 2021 •

edited

Loading