improve: Enhance code readability of prompt_tokenizers.py #707

seungduk-yanolja · 2023-10-09T07:31:22Z

Description

Motivation and Context

As many tokenzer variations are added, its complexity is increasing and it is hard to understand what each code block means.
We better clean up duplicated code and enhance code readability.

How has this been tested?

I don't have Cuda devices so cannot run tests. It would be great if someone can run tests for me.

Changes

Removed Duplication:
- Removed the redundant _tokenize method which was present twice.
- Unified the logic of handling user and assistant roles in tokenization.
Simplified Empty Text Handling:
- Streamlined the empty text tokenization process.
Code Clarifications:
- Simplified the checks for adding eos tokens and stripping bos tokens, making it more readable.
- Unified the constant IGNORE_INDEX replacing varied constants like -100.
Handling Unexpected Cases:
- Enhanced error handling for unexpected roles.
Clean-up:
- Removed unnecessary imports.
- Simplified several parts of the code to improve readability.

Screenshots (if appropriate)

Types of changes

Readability improvement

winglian · 2023-10-09T18:34:57Z

src/axolotl/prompt_strategies/completion.py

-                for i in range(0, len(val), self.sequence_len):
-                    res[key].append(val[i : i + self.sequence_len])
+                for i in range(0, len(val), self.max_length):
+                    res[key].append(val[i : i + self.max_length])


there is a difference in the completion prompts between the sequence len and the max length as they aren't the same/interchangeable in this case. the completion prompter tokenizes at a longer length and then splits the text into correct length

Perhaps, we should also document this (as comments?) as it would be confusing to others as they read the code.

Could you please provide more details? I'm still having trouble understanding. I will leave a comment here if needed. If the difference exists only in the completion tokenizing strategy, doesn't it have to be here instead of its parent class?

reverted the change that I unified sequence_len and max_length. PTAL

@winglian do you still have a concern or can you review this PR?

seungduk-yanolja · 2023-10-11T13:31:40Z

src/axolotl/prompt_tokenizers.py

-            result = self.tokenizer(
-                prompt,
-                truncation=True,
-                max_length=self.sequence_len,


Please note that it was sequence_len before but the merged _tokenize method will use max_length. They were different.

winglian · 2023-10-17T01:34:57Z

@seungduk-yanolja looks good! thank you! I'll work on resolving the merge conflicts tomorrow.

winglian · 2023-10-19T01:23:47Z

@seungduk-yanolja I rebased your PR over current main, so will get this merged once the tests pass. thanks again!

seungduk-yanolja · 2023-10-19T03:28:31Z

I had to sort out these things before sorry and thank you 🙏

…-cloud#707)

seungduk-yanolja mentioned this pull request Oct 9, 2023

Inconsistencies and Potential Improvements in Prompt Tokenizing Strategy #706

Closed

5 tasks

winglian reviewed Oct 9, 2023

View reviewed changes

seungduk-yanolja commented Oct 11, 2023

View reviewed changes

winglian approved these changes Oct 17, 2023

View reviewed changes

seungduk-yanolja added 3 commits October 18, 2023 21:22

Enhance code readability of prompt_tokenizers.py

95e141d

Use max_length instead of sequence_len since they are the same values

21fe299

revert removing sequence_len

fbde27c

winglian force-pushed the tokenizer branch from 46c7027 to fbde27c Compare October 19, 2023 01:22

chore: lint

af04eb0

winglian merged commit 3a99495 into axolotl-ai-cloud:main Oct 19, 2023
4 checks passed

mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023

improve: Enhance code readability of prompt_tokenizers.py (axolotl-ai…

079fb51

…-cloud#707)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve: Enhance code readability of prompt_tokenizers.py #707

improve: Enhance code readability of prompt_tokenizers.py #707

seungduk-yanolja commented Oct 9, 2023

winglian Oct 9, 2023 •

edited

Loading

NanoCode012 Oct 10, 2023

seungduk-yanolja Oct 10, 2023 •

edited

Loading

seungduk-yanolja Oct 11, 2023

seungduk-yanolja Oct 15, 2023

seungduk-yanolja Oct 11, 2023

winglian commented Oct 17, 2023

winglian commented Oct 19, 2023

seungduk-yanolja commented Oct 19, 2023

improve: Enhance code readability of prompt_tokenizers.py #707

improve: Enhance code readability of prompt_tokenizers.py #707

Conversation

seungduk-yanolja commented Oct 9, 2023

Description

Motivation and Context

How has this been tested?

Changes

Screenshots (if appropriate)

Types of changes

winglian Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

NanoCode012 Oct 10, 2023

Choose a reason for hiding this comment

seungduk-yanolja Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

seungduk-yanolja Oct 11, 2023

Choose a reason for hiding this comment

seungduk-yanolja Oct 15, 2023

Choose a reason for hiding this comment

seungduk-yanolja Oct 11, 2023

Choose a reason for hiding this comment

winglian commented Oct 17, 2023

winglian commented Oct 19, 2023

seungduk-yanolja commented Oct 19, 2023

winglian Oct 9, 2023 •

edited

Loading

seungduk-yanolja Oct 10, 2023 •

edited

Loading