Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to create table and image parameters from local data #1077

Merged
merged 38 commits into from
Apr 1, 2023

Conversation

Fanjia-Yan
Copy link
Contributor

@Fanjia-Yan Fanjia-Yan commented Mar 11, 2023

Describe your changes and why you are making these changes

This PR adds feature to use local data by creating a parameter data = client.create_param("data", default="path/to/data.csv", as_type=ArtifactType.TABLE, format="csv"). The created param will behave normally as a parameter artifact.

The current size of local data is set to have a maximum of 32MB. The reason for limiting the local data is because our server has a default limit for parsing http request.

Testing

Add 3 Test suite of integration tests:

  1. Basic Functionality of using table local data: Publishing workflow using parameter from local csv, parquet, and json file
  2. Basic Functionality of using image local data: Publishing workflow using parameter from local image
  3. Error checking test: Check if invalid file path, mismatch artifact type can be detected

Existing Problem

Currently previewing and publishing workflow using local data parameter is extremely slow around (10 - 20 seconds) for a small(1 MB) of data. We have figured out that was due to serializing using json on Jupiter and json unmarshall the data on server.

Related issue number (if any)

https://linear.app/aqueducthq/issue/ENG-2584/allow-users-to-create-table-and-image-parameters-from-local-data

Loom demo (if any)

Checklist before requesting a review

  • I have created a descriptive PR title. The PR title should complete the sentence "This PR...".
  • I have performed a self-review of my code.
  • I have included a small demo of the changes. For the UI, this would be a screenshot or a Loom video.
  • If this is a new feature, I have added unit tests and integration tests.
  • I have run the integration tests locally and they are passing.
  • I have run the linter script locally (See python3 scripts/run_linters.py -h for usage).
  • All features on the UI continue to work correctly.
  • Added one of the following CI labels:
    • run_integration_test: Runs integration tests
    • skip_integration_test: Skips integration tests (Should be used when changes are ONLY documentation/UI)

@Fanjia-Yan Fanjia-Yan requested a review from kenxu95 March 11, 2023 01:22
@Fanjia-Yan Fanjia-Yan marked this pull request as ready for review March 11, 2023 01:44
@Fanjia-Yan Fanjia-Yan added the run_integration_test Triggers integration tests label Mar 13, 2023
Copy link
Contributor

@kenxu95 kenxu95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create a follow-up task for making this more performant and solving the sizing bug? 1MB is a pretty restrictive bar.

@@ -38,6 +38,10 @@ runs:
shell: bash
run: pip3 install SQLAlchemy==1.4.46

- name: Install pyArrow for reading parquet table
shell: bash
run: pip3 install pyarrow==7.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we add pyarrow to our SDK requirements.txt then? I'd rather we make our dependencies very explicit cc: @cw75

integration_tests/sdk/aqueduct_tests/param_test.py Outdated Show resolved Hide resolved
integration_tests/sdk/aqueduct_tests/param_test.py Outdated Show resolved Hide resolved
integration_tests/sdk/aqueduct_tests/param_test.py Outdated Show resolved Hide resolved
sdk/aqueduct/utils/local_data.py Outdated Show resolved Hide resolved
sdk/aqueduct/utils/local_data.py Outdated Show resolved Hide resolved
sdk/aqueduct/utils/local_data.py Outdated Show resolved Hide resolved
sdk/aqueduct/utils/serialization.py Outdated Show resolved Hide resolved
sdk/aqueduct/artifacts/create.py Outdated Show resolved Hide resolved
@kenxu95
Copy link
Contributor

kenxu95 commented Mar 21, 2023

@Fanjia-Yan just confirming that this is still being worked on and doesn't require a review from me yet

@Fanjia-Yan
Copy link
Contributor Author

@Fanjia-Yan just confirming that this is still being worked on and doesn't require a review from me yet

There is one thing I would like to try before requesting a review. The File limit is currently set to 1MB which I am getting more and more disturbed by the value and I hope to get this thing straight before merging into main. I observe that the when we are parsing http request from server, large json file(>32MB), the parsing is faulty but no error is returned. This might be some issue of max file size we can set or work around in golang http request library. I am still investigating that.
Screen Shot 2023-03-21 at 3 20 00 PM

@Fanjia-Yan Fanjia-Yan requested a review from kenxu95 March 23, 2023 16:26
Copy link
Contributor

@kenxu95 kenxu95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good! I do think this means we have add pyarrow as a requirement then to our SDK right? Want to get Chengang's confirmation on this, but could you add this requirement to our SDK? We're going to need to do it when we transition to parquet from json anyways.

sdk/aqueduct/backend/api_client.py Show resolved Hide resolved
artifact_metadata.from_local_data for artifact_metadata in list(dag.artifacts.values())
):
raise InvalidUserActionException(
"Local Data after serialization are too large. Aqueduct uses json serialization. The maximum size of workflow with local data is %sMB, the current size is %sMB."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Local data after serialization is too large".

@kenxu95 kenxu95 self-requested a review March 23, 2023 23:38
@Fanjia-Yan Fanjia-Yan requested a review from kenxu95 March 27, 2023 22:02
Copy link
Contributor

@kenxu95 kenxu95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very close, but not quite production quality yet. The big issue I have is with the overly aggressive restriction on pyarrow version. Let's always set this to the latest if we can get away with it.

@@ -7,6 +7,7 @@ IPython<=8.9.0
croniter<=1.3.8
pydantic>=1.9.0,<=1.10.4
plotly<=5.13.0
pyarrow <=7.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait this is way too restrictive for users. The latest version of pyarrow is v11.0.0 right? Can we set that as the upper bound? (After you test it against v11, ofc :)

description:
A description of what this parameter represents.

use_local:
whether this parameter uses local data source or not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please capitalize.

default: Any,
description: str = "",
use_local: bool = False,
as_type: Optional[ArtifactType] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just call this as?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as is a python keyword, overriding it with a variable might be a bit tricky. Will something like _as or astype suffice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's just keep it as_type then

sdk/aqueduct/client.py Show resolved Hide resolved
@@ -188,19 +196,39 @@ def create_param(self, name: str, default: Any, description: str = "") -> BaseAr
The name to assign this parameter.
default:
The default value to give this parameter, if no value is provided.
Every parameter must have a default.
Every parameter must have a default. If decided to use local data,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammer: "if we decide"

use_local:
whether this parameter uses local data source or not.
as_type:
The expected type of the local data. Currently Local Data has support for ArtifactType.TABLE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace the second sentence with "Only supported types are ArtifactType.TABLE and ArtifactType.IMAGE

sdk/aqueduct/client.py Show resolved Hide resolved
return create_param_artifact(
self._dag,
name,
default,
description,
explicitly_named=True,
is_local_data=is_local_data,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just use use_local here?

and ArtifactType.IMAGE.
format:
If local data type is ArtifactType.TABLE, the user has to specify the table format.
We currently support "JSON", "CSV", and "Parquet".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the captialization consistent between the three examples: eg: "json", "csv", "parquet"

@Fanjia-Yan Fanjia-Yan requested a review from kenxu95 March 30, 2023 19:47
@Fanjia-Yan
Copy link
Contributor Author

Fixed all the grammar and error message suggestion. The newest pyarrow 11.0.0 behaves correctly so I have set the requirement to that :)

Copy link
Contributor

@kenxu95 kenxu95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! lgtm just one super nit :)

if use_local:
if not isinstance(default, str):
raise InvalidUserArgumentException(
"the default value must be a path to local data."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: Capitalize "the"

@Fanjia-Yan Fanjia-Yan merged commit d9d782e into main Apr 1, 2023
@vsreekanti vsreekanti deleted the eng-2584-allow-users-to-create-table-and-image branch April 4, 2023 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_integration_test Triggers integration tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants