feat: add proposal for data support #650

VoVAllen · 2022-07-22T05:33:48Z

Data function support

Summary

To provide mount data support for envd

Goals

Design a unified, declarative interface and underlying architecture to provide dataset in the development environment in a scalable way

Non-goals:

Support Git-like version control for data

Common Scenarios

Possible sources

local files
Object storage (AWS S3)
NFS-like system (AWS EFS, AWS FSx for OpenZFS)
Block storage (Ceph)
HDFS
Lustre
API endpoint (http path)
SQL results
Other distributed fs (alluxio, juicefs)
Python SDK

Possible form

Images
Text
Embedding binarys
CSV

Access Pattern

The access pattern of most dataset is write once, read multiple times, and concurrently. Therefore

Possible versions/tags

Version by number, V1, V2, V3
Version by scale, sample dataset vs full dataset
Version by time, query range of user activity (7d, 30d) from feature store

We can have a new standard on how to version the data like semver

Proposal

Each version of dataset is immutable. By assuming the data is immutable, we can cache the data and make replication easily, to increase the read throughput in multiple ways.

Usage

User need to create the dataset beforehand. Than declare mounting in the build.envd file.

envd data add -f mnist.yaml

User can create multiple dataset with the same name, but need to be different versions

mnist.yaml

ApiVersion: V1alpha
name: mnist
version: "0.0.1-sample"
sources:
    - type: local # First source will be considered major source, others are the replication of this one
      path: ~/.torch/mnist
    - type: s3
      path: xxx
validation:
    checksum:
        - name: MD5
          value: xxxx

build.envd

def data():
    return [d.mount("mnist", target="./data")] # User can specify mount multiple datasets

Signed-off-by: Jinjing.Zhou <allenzhou@tensorchord.ai>

kemingy · 2022-07-22T08:58:04Z

docs/proposals/data.md

+```
+envd data add -f mnist.yaml
+```


I wonder if we can make it as an envd target so we can get rid of yaml?

terrytangyuan · 2022-07-29T02:07:12Z

docs/proposals/data.md

+
+### Access Pattern
+
+The access pattern of most dataset is write once, read multiple times, and concurrently. Therefore 


Missing some additional context after "therefore"?

terrytangyuan · 2022-07-29T02:08:07Z

docs/proposals/data.md

+The access pattern of most dataset is write once, read multiple times, and concurrently. Therefore 
+
+### Possible versions/tags
+- Version by number, V1, V2, V3


How about semantic versioning?

terrytangyuan · 2022-07-29T02:09:06Z

docs/proposals/data.md

+
+mnist.yaml
+```yaml=
+ApiVersion: V1alpha


What is this version for? How is this different from the version below?

terrytangyuan · 2022-07-29T02:09:55Z

docs/proposals/data.md

+version: "0.0.1-sample"
+sources:


What if there are multiple versions for different sources?

terrytangyuan · 2022-07-29T02:14:38Z

docs/proposals/data.md

+
+## Common Scenarios
+
+### Possible sources


Is there any implementation plan for this? Are you aware of any existing solutions that could support multiple sources? This might be helpful as a reference for the range of sources: https://kubernetes.io/docs/concepts/storage/volumes/#volume-types

terrytangyuan · 2022-07-29T02:16:02Z

docs/proposals/data.md

+```
+def data():
+    return [d.mount("mnist", target="./data")] # User can specify mount multiple datasets


What are the different purposes of the YAML above and this envd syntax?

VoVAllen added 2 commits July 22, 2022 13:23

add rfc

d827bf0

add

b0c4677

Signed-off-by: Jinjing.Zhou <allenzhou@tensorchord.ai>

kemingy reviewed Jul 22, 2022

View reviewed changes

terrytangyuan reviewed Jul 29, 2022

View reviewed changes

gaocegege self-requested a review as a code owner August 28, 2022 11:08

VoVAllen mentioned this pull request Dec 22, 2022

feat(lang): Support data and code integration in envd-server runner #530

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add proposal for data support #650

feat: add proposal for data support #650

VoVAllen commented Jul 22, 2022

kemingy Jul 22, 2022

terrytangyuan Jul 29, 2022

terrytangyuan Jul 29, 2022

terrytangyuan Jul 29, 2022

terrytangyuan Jul 29, 2022

terrytangyuan Jul 29, 2022

terrytangyuan Jul 29, 2022


		### Access Pattern

		The access pattern of most dataset is write once, read multiple times, and concurrently. Therefore

feat: add proposal for data support #650

Are you sure you want to change the base?

feat: add proposal for data support #650

Conversation

VoVAllen commented Jul 22, 2022

Data function support

Summary

Goals

Common Scenarios

Possible sources

Possible form

Access Pattern

Possible versions/tags

Proposal

Usage

kemingy Jul 22, 2022

Choose a reason for hiding this comment

terrytangyuan Jul 29, 2022

Choose a reason for hiding this comment

terrytangyuan Jul 29, 2022

Choose a reason for hiding this comment

terrytangyuan Jul 29, 2022

Choose a reason for hiding this comment

terrytangyuan Jul 29, 2022

Choose a reason for hiding this comment

terrytangyuan Jul 29, 2022

Choose a reason for hiding this comment

terrytangyuan Jul 29, 2022

Choose a reason for hiding this comment