Fix bug in dataset loading #284

ethanhs · 2023-07-17T04:46:16Z

This fixes a bug when loading datasets. d.data_files is a list, so it cannot be directly passed to hf_hub_download, which only takes a single filename.

This fixes a bug when loading datasets. `d.data_files` is a list, so it cannot be directly passed to `hf_hub_download`

NanoCode012 · 2023-07-17T05:00:47Z

Hey, in my case, I use data_files as a string. If you would like to add this, could you consider checking the type of the variable (whether str or list) and deal appropriately?

winglian · 2023-07-17T19:09:23Z

@ethanhs there is a syntax error in this changeset. Can you make sure the pre-commit hooks get run when committing please?

black....................................................................Failed
[98](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:103)
- hook id: black
[99](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:104)
- exit code: 123
[100](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:105)

[101](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:106)
error: cannot format src/axolotl/utils/data.py: Cannot parse: 146:16:                 ds = load_dataset(
[102](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:107)

[103](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:108)
Oh no! 💥 💔 💥
[104](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:109)
37 files left unchanged, 1 file failed to reformat.
[105](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:110)

[106](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:111)
isort....................................................................Passed
[107](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:112)
flake8...................................................................Failed
[108](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:113)
- hook id: flake8
[109](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:114)
- exit code: 1
[110](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:115)

[111](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:116)
src/axolotl/utils/data.py:146:17: E999 SyntaxError: invalid syntax
[112](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:117)

[113](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:118)
pylint...................................................................Failed
[114](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:119)
- hook id: pylint
[115](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:120)
- exit code: 2
[116](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:121)

[117](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:122)
************* Module src.axolotl.utils.data
[118](https://github.com/OpenAccess-AI-Collective/axolotl/actions/runs/5571868452/jobs/10194948978?pr=284#step:4:123)
src/axolotl/utils/data.py:146:17: E0001: Parsing failed: 'invalid syntax (<unknown>, line 146)' (syntax-error)

ethanhs · 2023-07-21T07:32:28Z

Fixed/ran pre-commit, and checked the type of data_files as requested!

NanoCode012 · 2023-07-22T08:04:46Z

Hey, could you please clarify on the original bug? Could you provide some examples and error?

accelerate launch scripts/finetune.py examples/openllama-3b/lora.yml --prepare_ds_only --debug

datasets:
  - path: teknium/GPT4-LLM-Cleaned
    type: alpaca
    data_files: 
      - alpaca_gpt4_data_unfiltered.json
      - unnatural_instructions_unfiltered.json

I can run this fine on the current master.

winglian · 2023-08-06T21:48:42Z

@NanoCode012 I think the code he is modifying is fallback code. Typically dataset loading from hf hub happens in the conditional block before this one. I honestly don't know when this else clause might get called, but it's probably worth having for that edge case.

NanoCode012 · 2023-08-07T14:25:39Z

@NanoCode012 I think the code he is modifying is fallback code. Typically dataset loading from hf hub happens in the conditional block before this one. I honestly don't know when this else clause might get called, but it's probably worth having for that edge case.

Sounds good!

* Fix bug in dataset loading This fixes a bug when loading datasets. `d.data_files` is a list, so it cannot be directly passed to `hf_hub_download` * Check type of data_files, and load accordingly

Fix bug in dataset loading

2697e8a

This fixes a bug when loading datasets. `d.data_files` is a list, so it cannot be directly passed to `hf_hub_download`

Check type of data_files, and load accordingly

0cadda6

winglian merged commit 8fe0e63 into axolotl-ai-cloud:main Sep 27, 2023
3 checks passed

ethanhs deleted the patch-1 branch October 2, 2023 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug in dataset loading #284

Fix bug in dataset loading #284

ethanhs commented Jul 17, 2023

NanoCode012 commented Jul 17, 2023

winglian commented Jul 17, 2023

ethanhs commented Jul 21, 2023

NanoCode012 commented Jul 22, 2023

winglian commented Aug 6, 2023

NanoCode012 commented Aug 7, 2023

Fix bug in dataset loading #284

Fix bug in dataset loading #284

Conversation

ethanhs commented Jul 17, 2023

NanoCode012 commented Jul 17, 2023

winglian commented Jul 17, 2023

ethanhs commented Jul 21, 2023

NanoCode012 commented Jul 22, 2023

winglian commented Aug 6, 2023

NanoCode012 commented Aug 7, 2023