Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in dataset loading #284

Merged
merged 2 commits into from
Sep 27, 2023
Merged

Fix bug in dataset loading #284

merged 2 commits into from
Sep 27, 2023

Conversation

ethanhs
Copy link
Contributor

@ethanhs ethanhs commented Jul 17, 2023

This fixes a bug when loading datasets. d.data_files is a list, so it cannot be directly passed to hf_hub_download, which only takes a single filename.

This fixes a bug when loading datasets. `d.data_files` is a list, so it cannot be directly passed to `hf_hub_download`
@NanoCode012
Copy link
Collaborator

Hey, in my case, I use data_files as a string. If you would like to add this, could you consider checking the type of the variable (whether str or list) and deal appropriately?

@winglian
Copy link
Collaborator

@ethanhs there is a syntax error in this changeset. Can you make sure the pre-commit hooks get run when committing please?

black....................................................................Failed
[98](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:103)
- hook id: black
[99](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:104)
- exit code: 123
[100](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:105)

[101](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:106)
error: cannot format src/axolotl/utils/data.py: Cannot parse: 146:16:                 ds = load_dataset(
[102](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:107)

[103](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:108)
Oh no! 💥 💔 💥
[104](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:109)
37 files left unchanged, 1 file failed to reformat.
[105](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:110)

[106](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:111)
isort....................................................................Passed
[107](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:112)
flake8...................................................................Failed
[108](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:113)
- hook id: flake8
[109](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:114)
- exit code: 1
[110](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:115)

[111](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:116)
src/axolotl/utils/data.py:146:17: E999 SyntaxError: invalid syntax
[112](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:117)

[113](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:118)
pylint...................................................................Failed
[114](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:119)
- hook id: pylint
[115](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:120)
- exit code: 2
[116](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:121)

[117](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:122)
************* Module src.axolotl.utils.data
[118](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:123)
src/axolotl/utils/data.py:146:17: E0001: Parsing failed: 'invalid syntax (<unknown>, line 146)' (syntax-error)

@ethanhs
Copy link
Contributor Author

ethanhs commented Jul 21, 2023

Fixed/ran pre-commit, and checked the type of data_files as requested!

@NanoCode012
Copy link
Collaborator

Hey, could you please clarify on the original bug? Could you provide some examples and error?

accelerate launch scripts/finetune.py examples/openllama-3b/lora.yml --prepare_ds_only --debug
datasets:
  - path: teknium/GPT4-LLM-Cleaned
    type: alpaca
    data_files: 
      - alpaca_gpt4_data_unfiltered.json
      - unnatural_instructions_unfiltered.json

I can run this fine on the current master.

@winglian
Copy link
Collaborator

winglian commented Aug 6, 2023

@NanoCode012 I think the code he is modifying is fallback code. Typically dataset loading from hf hub happens in the conditional block before this one. I honestly don't know when this else clause might get called, but it's probably worth having for that edge case.

@NanoCode012
Copy link
Collaborator

@NanoCode012 I think the code he is modifying is fallback code. Typically dataset loading from hf hub happens in the conditional block before this one. I honestly don't know when this else clause might get called, but it's probably worth having for that edge case.

Sounds good!

@winglian winglian merged commit 8fe0e63 into axolotl-ai-cloud:main Sep 27, 2023
3 checks passed
@ethanhs ethanhs deleted the patch-1 branch October 2, 2023 12:39
mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023
* Fix bug in dataset loading

This fixes a bug when loading datasets. `d.data_files` is a list, so it cannot be directly passed to `hf_hub_download`

* Check type of data_files, and load accordingly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants