Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add proposal for data support #650

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

VoVAllen
Copy link
Member

Data function support

Summary

To provide mount data support for envd

Goals

Design a unified, declarative interface and underlying architecture to provide dataset in the development environment in a scalable way

Non-goals:

  • Support Git-like version control for data

Common Scenarios

Possible sources

  • local files
  • Object storage (AWS S3)
  • NFS-like system (AWS EFS, AWS FSx for OpenZFS)
  • Block storage (Ceph)
  • HDFS
  • Lustre
  • API endpoint (http path)
  • SQL results
  • Other distributed fs (alluxio, juicefs)
  • Python SDK

Possible form

  • Images
  • Text
  • Embedding binarys
  • CSV

Access Pattern

The access pattern of most dataset is write once, read multiple times, and concurrently. Therefore

Possible versions/tags

  • Version by number, V1, V2, V3
  • Version by scale, sample dataset vs full dataset
  • Version by time, query range of user activity (7d, 30d) from feature store

We can have a new standard on how to version the data like semver

Proposal

Each version of dataset is immutable. By assuming the data is immutable, we can cache the data and make replication easily, to increase the read throughput in multiple ways.

Usage

User need to create the dataset beforehand. Than declare mounting in the build.envd file.

envd data add -f mnist.yaml

User can create multiple dataset with the same name, but need to be different versions

mnist.yaml

ApiVersion: V1alpha
name: mnist
version: "0.0.1-sample"
sources:
    - type: local # First source will be considered major source, others are the replication of this one
      path: ~/.torch/mnist
    - type: s3
      path: xxx
validation:
    checksum:
        - name: MD5
          value: xxxx

build.envd

def data():
    return [d.mount("mnist", target="./data")] # User can specify mount multiple datasets

Signed-off-by: Jinjing.Zhou <allenzhou@tensorchord.ai>
Comment on lines +56 to +58
```
envd data add -f mnist.yaml
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can make it as an envd target so we can get rid of yaml?


### Access Pattern

The access pattern of most dataset is write once, read multiple times, and concurrently. Therefore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing some additional context after "therefore"?

The access pattern of most dataset is write once, read multiple times, and concurrently. Therefore

### Possible versions/tags
- Version by number, V1, V2, V3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about semantic versioning?


mnist.yaml
```yaml=
ApiVersion: V1alpha
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this version for? How is this different from the version below?

Comment on lines +66 to +67
version: "0.0.1-sample"
sources:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there are multiple versions for different sources?


## Common Scenarios

### Possible sources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any implementation plan for this? Are you aware of any existing solutions that could support multiple sources? This might be helpful as a reference for the range of sources: https://kubernetes.io/docs/concepts/storage/volumes/#volume-types

Comment on lines +79 to +81
```
def data():
return [d.mount("mnist", target="./data")] # User can specify mount multiple datasets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the different purposes of the YAML above and this envd syntax?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

3 participants