Remove Dask and Spark DataFrame Support #2705

thehomebrewnerd · 2024-04-15T15:59:03Z

Removes support for creating EntitySets from Dask and Spark dataframes.

codecov · 2024-04-15T16:14:33Z

Codecov Report

Attention: Patch coverage is 99.85141% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 99.38%. Comparing base (21d0bf0) to head (dcee907).

Files	Patch %	Lines
...s/computational_backends/feature_set_calculator.py	87.50%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2705       +/-   ##
===========================================
+ Coverage   86.97%   99.38%   +12.41%     
===========================================
  Files         404      397        -7     
  Lines       24230    22292     -1938     
===========================================
+ Hits        21073    22156     +1083     
+ Misses       3157      136     -3021

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tamargrey · 2024-05-01T15:02:13Z

A couple places that probably still need to be updated:

__dask_tokenize__ still exists in entityset.py - probably can be removed
The Parallel Computation by Partitioning Data section in the performance.ipynb file has some references to spark - Not sure if we want to just remove all references entirely or leave some subset of it (maybe just reference that an old version of featuretools can be used if you want to use it with Spark), but we should definitely remove the link to the Feature Engineering on Spark Notebook, which will no longer exist.

thehomebrewnerd · 2024-05-01T15:10:18Z

The Parallel Computation by Partitioning Data section in the performance.ipynb file has some references to spark - Not sure if we want to just remove all references entirely or leave some subset of it (maybe just reference that an old version of featuretools can be used if you want to use it with Spark), but we should definitely remove the link to the Feature Engineering on Spark Notebook, which will no longer exist.

@tamargrey I think those references are still valid since they relate to manually partitioning data and not doing it via pyspark dataframes in an EntitySet. The Feature Engineering on Spark Notebook will still exist here: https://github.com/alteryx/predict-customer-churn/blob/main/churn/4.%20Feature%20Engineering%20on%20Spark.ipynb. The linked article in that section is gone though, so I'll remove that reference.

featuretools/computational_backends/feature_set_calculator.py

featuretools/primitives/utils.py

featuretools/primitives/standard/aggregation/any_primitive.py

tamargrey

lgtm!

Nate Parsons added 3 commits April 15, 2024 10:00

mass deletion

b0db117

cleanup tests

b87bc42

fix

dcee907

thehomebrewnerd self-assigned this Apr 15, 2024

thehomebrewnerd marked this pull request as draft April 15, 2024 15:59

Nate Parsons added 17 commits April 15, 2024 12:52

try to fix unit tests

faa1507

fix ww main test ci yaml

e1852f2

more ci work

f82757e

fix fixture

b8e86ac

update release notes

2af8460

update miniconda hash

4e01885

more cleanup

8a9d000

docs cleanup

0f92825

update release notes

3eaa781

revert 3.12 change

55f9072

remove sql and update checker

c96b9a9

fix test

22cd27f

try install test fix

27ff467

remove dask references

b9e8922

remove sql from complete install due to psycopg2 issue

bc23a37

more install fixes

a57a239

lint

194ac3b

thehomebrewnerd marked this pull request as ready for review April 30, 2024 20:41

thehomebrewnerd requested review from sbadithe, tamargrey, jeff-hernandez, dvreed77 and jeremyliweishih April 30, 2024 20:52

fix complete install

ce00784

remove dask_tokenize

398cee6

Nate Parsons added 2 commits May 1, 2024 10:12

remove outdated link

de0b893

revert dask_tokenize change

5526327

tamargrey reviewed May 1, 2024

View reviewed changes

featuretools/computational_backends/feature_set_calculator.py Outdated Show resolved Hide resolved

featuretools/primitives/utils.py Show resolved Hide resolved

featuretools/primitives/standard/aggregation/any_primitive.py Show resolved Hide resolved

Nate Parsons added 3 commits May 1, 2024 14:29

remove isinstance checks

c70f489

remove agg_type

1bb560b

remove use of cache for install test

1065317

tamargrey approved these changes May 1, 2024

View reviewed changes

thehomebrewnerd merged commit 12ad75a into main May 1, 2024
25 checks passed

thehomebrewnerd deleted the remove-dask-spark-support branch May 1, 2024 20:53

thehomebrewnerd mentioned this pull request May 14, 2024

v1.31.0 #2728

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Dask and Spark DataFrame Support #2705

Remove Dask and Spark DataFrame Support #2705

thehomebrewnerd commented Apr 15, 2024

codecov bot commented Apr 15, 2024

tamargrey commented May 1, 2024

thehomebrewnerd commented May 1, 2024

tamargrey left a comment

Remove Dask and Spark DataFrame Support #2705

Remove Dask and Spark DataFrame Support #2705

Conversation

thehomebrewnerd commented Apr 15, 2024

codecov bot commented Apr 15, 2024

Codecov Report

tamargrey commented May 1, 2024

thehomebrewnerd commented May 1, 2024

tamargrey left a comment

Choose a reason for hiding this comment