Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support datasets from huggingface #2351

Open
SingL3 opened this issue Jul 6, 2023 · 3 comments
Open

Support datasets from huggingface #2351

SingL3 opened this issue Jul 6, 2023 · 3 comments
Labels
enhancement New (engineering) enhancements, such as features or API changes.

Comments

@SingL3
Copy link

SingL3 commented Jul 6, 2023

🚀 Feature Request

Now this package can load data from local path / http / s3, is there a plan to support huggingface datasets?

Motivation

Some datasets supply non-jsonl datasets, like parquet.

[Optional] Implementation

Additional context

@SingL3 SingL3 added the enhancement New (engineering) enhancements, such as features or API changes. label Jul 6, 2023
@dakinggg
Copy link
Contributor

The composer Trainer accepts an arbitrary train_dataloader, so I'm not sure what you mean here. Could you please clarify?

@SingL3
Copy link
Author

SingL3 commented Jul 11, 2023

As you can see here:

dataset = load_dataset('json', data_files=destination_path, split='train', streaming=False)

This mean users should be local and it does not support other format of data like parquet.
The benefit of datasets may be it can download automatically if there is no local file.

@dakinggg
Copy link
Contributor

Ah, got it! The code actually does automatically download from object store (

get_file(dataset_uri, destination_path, overwrite=True)
), and the ICL classes expect the data to be in a particular format that probably isn't super common on the HF hub, but we can look more into supporting that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New (engineering) enhancements, such as features or API changes.
Projects
None yet
Development

No branches or pull requests

2 participants