-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow users to create table and image parameters from local data #1077
Allow users to create table and image parameters from local data #1077
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you create a follow-up task for making this more performant and solving the sizing bug? 1MB is a pretty restrictive bar.
@@ -38,6 +38,10 @@ runs: | |||
shell: bash | |||
run: pip3 install SQLAlchemy==1.4.46 | |||
|
|||
- name: Install pyArrow for reading parquet table | |||
shell: bash | |||
run: pip3 install pyarrow==7.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we add pyarrow
to our SDK requirements.txt then? I'd rather we make our dependencies very explicit cc: @cw75
@Fanjia-Yan just confirming that this is still being worked on and doesn't require a review from me yet |
There is one thing I would like to try before requesting a review. The File limit is currently set to 1MB which I am getting more and more disturbed by the value and I hope to get this thing straight before merging into main. I observe that the when we are parsing http request from server, large json file(>32MB), the parsing is faulty but no error is returned. This might be some issue of max file size we can set or work around in golang http request library. I am still investigating that. |
…ub.com:aqueducthq/aqueduct into eng-2584-allow-users-to-create-table-and-image
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good! I do think this means we have add pyarrow as a requirement then to our SDK right? Want to get Chengang's confirmation on this, but could you add this requirement to our SDK? We're going to need to do it when we transition to parquet from json anyways.
sdk/aqueduct/backend/api_client.py
Outdated
artifact_metadata.from_local_data for artifact_metadata in list(dag.artifacts.values()) | ||
): | ||
raise InvalidUserActionException( | ||
"Local Data after serialization are too large. Aqueduct uses json serialization. The maximum size of workflow with local data is %sMB, the current size is %sMB." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "Local data after serialization is too large".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very close, but not quite production quality yet. The big issue I have is with the overly aggressive restriction on pyarrow
version. Let's always set this to the latest if we can get away with it.
sdk/requirements/python-3-9.txt
Outdated
@@ -7,6 +7,7 @@ IPython<=8.9.0 | |||
croniter<=1.3.8 | |||
pydantic>=1.9.0,<=1.10.4 | |||
plotly<=5.13.0 | |||
pyarrow <=7.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait this is way too restrictive for users. The latest version of pyarrow is v11.0.0 right? Can we set that as the upper bound? (After you test it against v11, ofc :)
sdk/aqueduct/client.py
Outdated
description: | ||
A description of what this parameter represents. | ||
|
||
use_local: | ||
whether this parameter uses local data source or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please capitalize.
default: Any, | ||
description: str = "", | ||
use_local: bool = False, | ||
as_type: Optional[ArtifactType] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just call this as
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as
is a python keyword, overriding it with a variable might be a bit tricky. Will something like _as
or astype
suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let's just keep it as_type
then
sdk/aqueduct/client.py
Outdated
@@ -188,19 +196,39 @@ def create_param(self, name: str, default: Any, description: str = "") -> BaseAr | |||
The name to assign this parameter. | |||
default: | |||
The default value to give this parameter, if no value is provided. | |||
Every parameter must have a default. | |||
Every parameter must have a default. If decided to use local data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grammer: "if we decide"
sdk/aqueduct/client.py
Outdated
use_local: | ||
whether this parameter uses local data source or not. | ||
as_type: | ||
The expected type of the local data. Currently Local Data has support for ArtifactType.TABLE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we replace the second sentence with "Only supported types are ArtifactType.TABLE and ArtifactType.IMAGE
sdk/aqueduct/client.py
Outdated
return create_param_artifact( | ||
self._dag, | ||
name, | ||
default, | ||
description, | ||
explicitly_named=True, | ||
is_local_data=is_local_data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just use use_local
here?
sdk/aqueduct/client.py
Outdated
and ArtifactType.IMAGE. | ||
format: | ||
If local data type is ArtifactType.TABLE, the user has to specify the table format. | ||
We currently support "JSON", "CSV", and "Parquet". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the captialization consistent between the three examples: eg: "json", "csv", "parquet"
Fixed all the grammar and error message suggestion. The newest pyarrow 11.0.0 behaves correctly so I have set the requirement to that :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! lgtm just one super nit :)
sdk/aqueduct/client.py
Outdated
if use_local: | ||
if not isinstance(default, str): | ||
raise InvalidUserArgumentException( | ||
"the default value must be a path to local data." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: Capitalize "the"
…llow-users-to-create-table-and-image
Describe your changes and why you are making these changes
This PR adds feature to use local data by creating a parameter
data = client.create_param("data", default="path/to/data.csv", as_type=ArtifactType.TABLE, format="csv")
. The created param will behave normally as a parameter artifact.The current size of local data is set to have a maximum of 32MB. The reason for limiting the local data is because our server has a default limit for parsing http request.
Testing
Add 3 Test suite of integration tests:
Existing Problem
Currently previewing and publishing workflow using local data parameter is extremely slow around (10 - 20 seconds) for a small(1 MB) of data. We have figured out that was due to serializing using json on Jupiter and json unmarshall the data on server.
Related issue number (if any)
https://linear.app/aqueducthq/issue/ENG-2584/allow-users-to-create-table-and-image-parameters-from-local-data
Loom demo (if any)
Checklist before requesting a review
python3 scripts/run_linters.py -h
for usage).run_integration_test
: Runs integration testsskip_integration_test
: Skips integration tests (Should be used when changes are ONLY documentation/UI)