ENH: minimal support for dask.dataframe query planning (dask-expr) #285

jorisvandenbossche · 2024-03-11T12:17:36Z

Tackling #284, for now in a minimal way, i.e. keeping as much as possible the current code based on "legacy" dask and then turn that into an expression.

For example, for sjoin/clip/file IO, we manually build up the graph to create a collection using HighLevelGraph, for now this graph is just turned into an expression with from_graph. Long term, we can turn each of those into proper expression classes.

I am not yet fully sure how to best organize the code. For now, the new file expr.py is essentially duplicating a lot of core.py. I organized the commits such that the first just does this copy, and then you can look at the second commit to see the actual changes that were needed to core.py (in expr.py) to make it work with dask-expr -> d28521f

The problem is that I can't do this directly, because we want to keep supporting legacy dask for a while as well (our code is actually also still using this for intermediate results)

Some other remarks:

There is one actual test failure that I haven't yet be able to debug/fix (split_out keyword in dissolve), this is xfailed for now
The spatial_partitions handling is not yet very robust. For now this just takes the same approach as in core.py to attach this to the collection. But longer term we should attach this to the underlying expression of the collection. The current version "works" (the tests are passing), as long as you only do an operation that needs them directly after an operation that sets them.
All the element-wise geospatial methods that we add to the classes are still done using map_partitions like before. Longer term, we should turn those into custom expressions (this can then also allow better optimizations in the expression tree)

Closes #287

martinfleis · 2024-03-14T08:28:01Z

@jorisvandenbossche can you ping me once you want a review of this? I'll do some reading on dask-expr in the meantime.

jorisvandenbossche · 2024-03-14T08:41:03Z

I think this should be ready enough for a review

TomAugspurger · 2024-03-14T12:15:10Z

I spent a bit of time looking at the split-out failure, but didn't make much progress. I'll take another look later.

ReptarK · 2024-04-17T13:04:14Z

Hi ,thanks for the efforts on supporting dask-expr.
Any updates on the integration with it ?

TomAugspurger · 2024-04-17T14:57:25Z

Some discussion at dask/dask-expr#1024.

TomAugspurger · 2024-05-04T17:44:18Z

@jorisvandenbossche I can't push to your branch, but this diff fixes the test test_from_dask_dataframe_with_dask_geoseries. That turned up an error in from_dask_dataframe with geometry=dask_series, which is also fixed in that commit (and I added a test for it directly).

diff --git a/dask_geopandas/expr.py b/dask_geopandas/expr.py
index 99d71d2..136953d 100644
--- a/dask_geopandas/expr.py
+++ b/dask_geopandas/expr.py
@@ -200,6 +200,7 @@ class _Frame(dx.FrameBase, OperatorMethodMixin):
     @classmethod
     def _bind_elemwise_operator_method(cls, name, op, original, *args, **kwargs):
         """bind operator method like GeoSeries.distance to this class"""
+
         # name must be explicitly passed for div method whose name is truediv
         def meth(self, other, *args, **kwargs):
             meta = _emulate(op, self, other)
@@ -505,7 +506,6 @@ class _Frame(dx.FrameBase, OperatorMethodMixin):
         return distances
 
     def geohash(self, as_string=True, precision=12):
-
         """
         Calculate geohash based on the middle points of the geometry bounds
         for a given precision.
@@ -842,7 +842,7 @@ def from_dask_dataframe(df, geometry=None):
     # it via a keyword-argument due to https://github.com/dask/dask/issues/8308.
     # Instead, we assign the geometry column using regular dataframe operations,
     # then refer to that column by name in `map_partitions`.
-    if isinstance(geometry, dd.core.Series):
+    if isinstance(geometry, dx.Series):
         name = geometry.name if geometry.name is not None else "geometry"
         return df.assign(**{name: geometry}).map_partitions(
             geopandas.GeoDataFrame, geometry=name
diff --git a/dask_geopandas/tests/test_core.py b/dask_geopandas/tests/test_core.py
index b28a0c7..fbde582 100644
--- a/dask_geopandas/tests/test_core.py
+++ b/dask_geopandas/tests/test_core.py
@@ -390,27 +390,44 @@ def test_rename_geometry_error(geodf_points):
         dask_obj.rename_geometry("value1")
 
 
-# TODO to_dask_dataframe is now defined on the dask-expr collection, converting
-# to an old-style dd.core.DataFrame (so doing something different as we did here)
-@pytest.mark.xfail(
-    dask_geopandas.backends.QUERY_PLANNING_ON, reason="Need to update test for expr"
-)
 def test_from_dask_dataframe_with_dask_geoseries():
     df = pd.DataFrame({"x": [0, 1, 2, 3], "y": [1, 2, 3, 4]})
     dask_obj = dd.from_pandas(df, npartitions=2)
     dask_obj = dask_geopandas.from_dask_dataframe(
         dask_obj, geometry=dask_geopandas.points_from_xy(dask_obj, "x", "y")
     )
-    # Check that the geometry isn't concatenated and embedded a second time in
-    # the high-level graph. cf. https://github.com/geopandas/dask-geopandas/issues/197
-    k = next(k for k in dask_obj.dask.dependencies if k.startswith("GeoDataFrame"))
-    deps = dask_obj.dask.dependencies[k]
-    assert len(deps) == 1
+
+    if dask_geopandas.backends.QUERY_PLANNING_ON:
+        deps = dask_obj.expr.dependencies()
+        assert len(deps) == 1
+        dep = deps[0]
+        other = list(dask_obj.dask.values())[0][3]["geometry"].dependencies()[0]
+        assert dep is other
+
+    else:
+        # Check that the geometry isn't concatenated and embedded a second time in
+        # the high-level graph. cf. https://github.com/geopandas/dask-geopandas/issues/197
+        k = next(k for k in dask_obj.dask.dependencies if k.startswith("GeoDataFrame"))
+        deps = dask_obj.dask.dependencies[k]
+        assert len(deps) == 1
 
     expected = df.set_geometry(geopandas.points_from_xy(df["x"], df["y"]))
+    dask_obj.geometry.compute()
     assert_geoseries_equal(dask_obj.geometry.compute(), expected.geometry)
 
 
+def test_set_geometry_to_dask_series():
+    df = pd.DataFrame({"x": [0, 1, 2, 3], "y": [1, 2, 3, 4]})
+
+    dask_obj = dd.from_pandas(df, npartitions=2)
+    dask_obj = dask_geopandas.from_dask_dataframe(
+        dask_obj, geometry=dask_geopandas.points_from_xy(dask_obj, "x", "y")
+    )
+    expected = geopandas.GeoDataFrame(df, geometry=geopandas.points_from_xy(df.x, df.y))
+    result = dask_obj.geometry.compute()
+    assert_geoseries_equal(result, expected.geometry)
+
+
 def test_from_dask_dataframe_with_column_name():
     df = pd.DataFrame({"x": [0, 1, 2, 3], "y": [1, 2, 3, 4]})
     df["geoms"] = geopandas.points_from_xy(df["x"], df["y"])

TomAugspurger · 2024-05-04T18:08:16Z

PR to your branch at jorisvandenbossche#1 that includes that diff and a fix for a couple more issues.

The final xfail with dask-expr is from split_out. I haven't had a chance to follow up on dask/dask-expr#1024, but it seems like it might be related to shuffling in dask-expr.

Fixups

jorisvandenbossche · 2024-05-06T10:12:07Z

@TomAugspurger thanks for the fixes!
(and good you mention it here, because apparently you don't watch your own fork by default, so I didn't get any notification from your PR on my repo ..)

jorisvandenbossche · 2024-05-06T11:39:01Z

dask_geopandas/expr.py

-    def to_legacy_dataframe(self, optimize: bool = True, **optimize_kwargs):
+    def to_dask_dataframe(self):


@TomAugspurger I renamed this back to the original name, but now this is fine because dask-expr upstream renamed the to_dask_dataframe they added to to_legacy_dataframe, so there is no longer a name clash.

to_legacy_dataframe will convert to a legacy dask.dataframe collection, which is needed for certain implementations that still use the original implementation (e.g. dask still does this for parquet IO right now).

While our to_dask_dataframe is meant to convert your dask-geopandas object to a dask object (regardless of it being a legacy collection or a new expression).

(and so naming this implementation to_legacy_dataframe actually broke the parquet tests, because it is not doing what dask is expecting, i.e. it doesn't return a legacy collection, just the same object but where the partitions are pd.DataFrames instead of GeoDataFrames)

Ah, makes sense, thanks.

jorisvandenbossche · 2024-05-06T12:35:05Z

@martinfleis unless you have time on the short term, I would suggest we already merge this to have a working main branch with latest dask, and further fixes/improvements can be done in follow-ups

martinfleis · 2024-05-06T12:54:13Z

@jorisvandenbossche Go ahead

jorisvandenbossche added 3 commits March 11, 2024 13:04

duplicate core.py

9567c21

support dask-expr

d28521f

enable in tests

80976e6

jorisvandenbossche force-pushed the dask-expr branch from 0699da4 to 80976e6 Compare March 11, 2024 12:22

aazuspan mentioned this pull request Apr 6, 2024

Implement MVP alxmrs/dask-ee#1

Closed

TomAugspurger mentioned this pull request Apr 13, 2024

DataFrame subclass lost in groupby.agg with split_out set. dask/dask-expr#1024

Open

martinfleis mentioned this pull request Apr 22, 2024

AttributeError: 'DataFrame' object has no attribute 'within' #289

Open

Tom Augspurger added 2 commits May 4, 2024 11:44

Merge remote-tracking branch 'upstream/main' into dask-expr

01f0b86

fixed from_dask_dataframe geometry

2b9af30

Tom Augspurger added 2 commits May 4, 2024 13:03

Fixed from_geopandas, to_dask_dataframe

4ee6a51

fixed test

720fa97

Merge pull request #1 from TomAugspurger/dask-expr

d850239

Fixups

restore to_dask_dataframe (fix to_legacy_dataframe)

1000144

jorisvandenbossche marked this pull request as ready for review May 6, 2024 11:32

fix doc requirements

5592911

jorisvandenbossche commented May 6, 2024

View reviewed changes

small cleanup

501f8dc

jorisvandenbossche mentioned this pull request May 6, 2024

BUG: to_parquet() failing with dask=2024.4.1 #287

Closed

add one build with recent but pre-expressions dask version

8ef4966

jorisvandenbossche merged commit cc9076f into geopandas:main May 6, 2024
12 checks passed

jorisvandenbossche deleted the dask-expr branch May 6, 2024 13:08

jorisvandenbossche mentioned this pull request May 6, 2024

Support latest dask.dataframe with query planning (dask-expr) #284

Open

hoxbro mentioned this pull request Aug 8, 2024

Support dask-expr DataFrame holoviz/spatialpandas#146

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: minimal support for dask.dataframe query planning (dask-expr) #285

ENH: minimal support for dask.dataframe query planning (dask-expr) #285

jorisvandenbossche commented Mar 11, 2024 •

edited

Loading

martinfleis commented Mar 14, 2024

jorisvandenbossche commented Mar 14, 2024

TomAugspurger commented Mar 14, 2024

ReptarK commented Apr 17, 2024 •

edited

Loading

TomAugspurger commented Apr 17, 2024

TomAugspurger commented May 4, 2024 •

edited by jorisvandenbossche

Loading

TomAugspurger commented May 4, 2024

jorisvandenbossche commented May 6, 2024

jorisvandenbossche May 6, 2024

TomAugspurger May 6, 2024

jorisvandenbossche commented May 6, 2024

martinfleis commented May 6, 2024

		def to_legacy_dataframe(self, optimize: bool = True, **optimize_kwargs):
		def to_dask_dataframe(self):

ENH: minimal support for dask.dataframe query planning (dask-expr) #285

ENH: minimal support for dask.dataframe query planning (dask-expr) #285

Conversation

jorisvandenbossche commented Mar 11, 2024 • edited Loading

martinfleis commented Mar 14, 2024

jorisvandenbossche commented Mar 14, 2024

TomAugspurger commented Mar 14, 2024

ReptarK commented Apr 17, 2024 • edited Loading

TomAugspurger commented Apr 17, 2024

TomAugspurger commented May 4, 2024 • edited by jorisvandenbossche Loading

TomAugspurger commented May 4, 2024

jorisvandenbossche commented May 6, 2024

jorisvandenbossche May 6, 2024

Choose a reason for hiding this comment

TomAugspurger May 6, 2024

Choose a reason for hiding this comment

jorisvandenbossche commented May 6, 2024

martinfleis commented May 6, 2024

jorisvandenbossche commented Mar 11, 2024 •

edited

Loading

ReptarK commented Apr 17, 2024 •

edited

Loading

TomAugspurger commented May 4, 2024 •

edited by jorisvandenbossche

Loading