Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llm generate upgrade #1034

Open
wants to merge 31 commits into
base: develop
Choose a base branch
from
Open

Llm generate upgrade #1034

wants to merge 31 commits into from

Conversation

plaguss
Copy link
Contributor

@plaguss plaguss commented Oct 11, 2024

Description

This PR updates the output from llm.generate to make it more feature rich.

Previously we only returned the generated text:

GenerateOutput = List[Union[str, None]]

Now it is updated to allow for statistics related to the generation:

LLMOutput = List[Union[str, None]]

class TokenCount(TypedDict):
    input_tokens: List[int]
    output_tokens: List[int]

LLMStatistics = Union[TokenCount, Dict[str, Any]]
"""Initially the LLMStatistics will contain the token count, but can have more variables.
They can be added once we have them defined for every LLM.
"""

class GenerateOutput(TypedDict):
    generations: LLMOutput
    statistics: LLMStatistics

This PR only includes input_tokens and output_tokens as statistics, but we can add as much as needed in the future.

This information is moved to distilabel_metadata in the following way, to avoid collisions between statistics of different steps:

{
    "generations": ["Hello Magpie"],
    f"statistics_{step_name}": {
        "input_tokens": [12],
        "output_tokens": [12],
    },
}

NOTE:
Most Task reuse the same Task.process method to process the generations, and nothing else has to be done, but for tasks like Magpie where the process method is overwritten, this has to be updated.

Closes #738

@plaguss plaguss added this to the 1.5.0 milestone Oct 11, 2024
@plaguss plaguss self-assigned this Oct 11, 2024
Copy link

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-1034/

Copy link

codspeed-hq bot commented Oct 11, 2024

CodSpeed Performance Report

Merging #1034 will not alter performance

Comparing llm-generate-upgrade (1bc28ba) with develop (1f75593)

Summary

✅ 1 untouched benchmarks

@plaguss plaguss added the enhancement New feature or request label Oct 14, 2024
@plaguss plaguss marked this pull request as ready for review October 25, 2024 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Update LLM.generate interface to allow returning arbitrary/extra stuff related to the generation
1 participant