Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add initial design for uniform processors + align model #31197

Merged
merged 49 commits into from
Jun 13, 2024

Conversation

molbap
Copy link
Contributor

@molbap molbap commented Jun 3, 2024

What does this PR do?

Adds a uniform signature for processors. This PR adds the initial design + one model for the larger #30511.

Usage

As before, kwargs that are passed to processors at __call__ time take priority. However, per-modality processors can be instantiated with their own kwargs, and if they are not overriden at call time, they will serve as defaults.

Type hinting of kwargs is preserved if they are passed as structured dictionary entries
image

It also works with kwargs passed without nesting:
image

Merging of kwargs and handling priority order is done in processing_utils through a dedicated method.
The order of operations is as follows:

  1. kwargs passed as before have highest priority to preserve BC.
high_priority_kwargs = {"crop_size" = (224, 224), "padding" = "max_length"}
processor(..., **high_priority_kwargs)
  1. kwargs passed as modality-specific kwargs have second priority. This is the recommended API.
processor(..., text_kwargs={"padding": "max_length"}, images_kwargs={"crop_size": (224, 224)}})
  1. kwargs passed during instantiation of a modality processor have fourth priority.
tokenizer = tokenizer_class(..., {"padding": "max_length"})
image_processor = image_processor_class(...)
processor(tokenizer, image_processor) # will pass max_length unless overriden by kwargs at call
  1. defaults kwargs specified at processor level have lowest priority.
class MyProcessingKwargs(ProcessingKwargs, CommonKwargs, TextKwargs, ImagesKwargs, total=False):
    _defaults = {
        "text_kwargs": {
            "padding": "max_length",
            "max_length": 64,
        },
    }

Missing:

  • Move tests of kwargs checking to a common util

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@molbap
Copy link
Contributor Author

molbap commented Jun 3, 2024

cc @amyeroberts for a review with a narrower scope than the parent PR 😁

@molbap
Copy link
Contributor Author

molbap commented Jun 4, 2024

cc @amyeroberts, sorry to divide my pings 😅 PR is big and I wanted to split it up, this one should be merge-able and be the basis for the rest, I'll rebase afterwards (and @qubvel welcome if you want to take a look!)

It includes the kwargs merging just mentioned in the other PR, moved them to processing common!

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Same comment as for the other PR - it would be good to move out the kwarg prep logic to out of the config and tests to make sure we can properly control tokenizer kwargs with tokenizer.init_kwargs and the input kwargs.

cc @qubvel

src/transformers/tokenization_utils_base.py Outdated Show resolved Hide resolved
@amyeroberts
Copy link
Collaborator

@molbap OK, sounds good! Let's wait for @qubvel's eagle eye. Once tests are added, ping me again for a final review 🤗

Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see _merge_kwargs as a separate method, this is exactly what came to my mind while reviewing the first PR 🙂

src/transformers/processing_utils.py Outdated Show resolved Hide resolved
src/transformers/processing_utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot cleaner!

}

```
"""

_defaults = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_defaults = {
padding: "max_length"
max_lenght: 64

should work no? Or does it not update the default for type-hints?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it works for sure, this was to have a structured dict for defaults. Can change :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, now I remember, it actually can't work like that since Typed Dicts don't support default values, they are made to hold the layout. They can have any attributes however, but it won't pass a value as default -like a dataclass would, but in this case we'd lose typing-, hence the manual operation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok got it thanks! Let's maybe comment about this!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a comment for future code inspectors? I'm assuming here isn't the best place (we don't want it for all models) but didn't find a corresponding one elsewhere on a quick skim

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On that: there's doc in processing_utils.ProcessingKwargs, I added a comment nudging users to check there for documentation!

@molbap molbap added the run-slow label Jun 7, 2024
@molbap
Copy link
Contributor Author

molbap commented Jun 7, 2024

I have updated the PR description to be more self-contained: by using typing_extensions the imports for Unpack work, but type hinting doesn't for versions of python < 3.11. Functionality, though, is preserved for older versions I tested.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Just a few small comments

}

```
"""

_defaults = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a comment for future code inspectors? I'm assuming here isn't the best place (we don't want it for all models) but didn't find a corresponding one elsewhere on a quick skim

tests/test_processing_common.py Show resolved Hide resolved
src/transformers/processing_utils.py Outdated Show resolved Hide resolved
src/transformers/processing_utils.py Outdated Show resolved Hide resolved
@molbap
Copy link
Contributor Author

molbap commented Jun 13, 2024

@amyeroberts FYI Kept digging for the kwargs merging logic, and found an edge case that was giving unreliable results in the tokenizer. Refactored and doubled number of tests to avoid further trickery (including an edge case found earlier by @qubvel ), logic should be easier to read now. Nothing else changed, and tests should pass reliably.

@molbap molbap merged commit c624d5b into huggingface:main Jun 13, 2024
24 checks passed
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Jun 14, 2024
…31197)

* add initial design for uniform processors + align model

* fix mutable default 👀

* add configuration test

* handle structured kwargs w defaults + add test

* protect torch-specific test

* fix style

* fix

* fix assertEqual

* move kwargs merging to processing common

* rework kwargs for type hinting

* just get Unpack from extensions

* run-slow[align]

* handle kwargs passed as nested dict

* add from_pretrained test for nested kwargs handling

* [run-slow]align

* update documentation + imports

* update audio inputs

* protect audio types, silly

* try removing imports

* make things simpler

* simplerer

* move out kwargs test to common mixin

* [run-slow]align

* skip tests for old processors

* [run-slow]align, clip

* !$#@!! protect imports, darn it

* [run-slow]align, clip

* [run-slow]align, clip

* update doc

* improve documentation for default values

* add model_max_length testing

This parameter depends on tokenizers received.

* Raise if kwargs are specified in two places

* fix

* expand VideoInput

* fix

* fix style

* remove defaults values

* add comment to indicate documentation on adding kwargs

* protect imports

* [run-slow]align

* fix

* remove set() that breaks ordering

* test more

* removed unused func

* [run-slow]align
@molbap molbap mentioned this pull request Jun 14, 2024
itazap pushed a commit that referenced this pull request Jun 17, 2024
* add initial design for uniform processors + align model

* fix mutable default 👀

* add configuration test

* handle structured kwargs w defaults + add test

* protect torch-specific test

* fix style

* fix

* fix assertEqual

* move kwargs merging to processing common

* rework kwargs for type hinting

* just get Unpack from extensions

* run-slow[align]

* handle kwargs passed as nested dict

* add from_pretrained test for nested kwargs handling

* [run-slow]align

* update documentation + imports

* update audio inputs

* protect audio types, silly

* try removing imports

* make things simpler

* simplerer

* move out kwargs test to common mixin

* [run-slow]align

* skip tests for old processors

* [run-slow]align, clip

* !$#@!! protect imports, darn it

* [run-slow]align, clip

* [run-slow]align, clip

* update doc

* improve documentation for default values

* add model_max_length testing

This parameter depends on tokenizers received.

* Raise if kwargs are specified in two places

* fix

* expand VideoInput

* fix

* fix style

* remove defaults values

* add comment to indicate documentation on adding kwargs

* protect imports

* [run-slow]align

* fix

* remove set() that breaks ordering

* test more

* removed unused func

* [run-slow]align
itazap pushed a commit that referenced this pull request Jun 17, 2024
* add initial design for uniform processors + align model

* fix mutable default 👀

* add configuration test

* handle structured kwargs w defaults + add test

* protect torch-specific test

* fix style

* fix

* fix assertEqual

* move kwargs merging to processing common

* rework kwargs for type hinting

* just get Unpack from extensions

* run-slow[align]

* handle kwargs passed as nested dict

* add from_pretrained test for nested kwargs handling

* [run-slow]align

* update documentation + imports

* update audio inputs

* protect audio types, silly

* try removing imports

* make things simpler

* simplerer

* move out kwargs test to common mixin

* [run-slow]align

* skip tests for old processors

* [run-slow]align, clip

* !$#@!! protect imports, darn it

* [run-slow]align, clip

* [run-slow]align, clip

* update doc

* improve documentation for default values

* add model_max_length testing

This parameter depends on tokenizers received.

* Raise if kwargs are specified in two places

* fix

* expand VideoInput

* fix

* fix style

* remove defaults values

* add comment to indicate documentation on adding kwargs

* protect imports

* [run-slow]align

* fix

* remove set() that breaks ordering

* test more

* removed unused func

* [run-slow]align
itazap pushed a commit that referenced this pull request Jun 17, 2024
* add initial design for uniform processors + align model

* fix mutable default 👀

* add configuration test

* handle structured kwargs w defaults + add test

* protect torch-specific test

* fix style

* fix

* fix assertEqual

* move kwargs merging to processing common

* rework kwargs for type hinting

* just get Unpack from extensions

* run-slow[align]

* handle kwargs passed as nested dict

* add from_pretrained test for nested kwargs handling

* [run-slow]align

* update documentation + imports

* update audio inputs

* protect audio types, silly

* try removing imports

* make things simpler

* simplerer

* move out kwargs test to common mixin

* [run-slow]align

* skip tests for old processors

* [run-slow]align, clip

* !$#@!! protect imports, darn it

* [run-slow]align, clip

* [run-slow]align, clip

* update doc

* improve documentation for default values

* add model_max_length testing

This parameter depends on tokenizers received.

* Raise if kwargs are specified in two places

* fix

* expand VideoInput

* fix

* fix style

* remove defaults values

* add comment to indicate documentation on adding kwargs

* protect imports

* [run-slow]align

* fix

* remove set() that breaks ordering

* test more

* removed unused func

* [run-slow]align
itazap pushed a commit that referenced this pull request Jun 18, 2024
* add initial design for uniform processors + align model

* fix mutable default 👀

* add configuration test

* handle structured kwargs w defaults + add test

* protect torch-specific test

* fix style

* fix

* fix assertEqual

* move kwargs merging to processing common

* rework kwargs for type hinting

* just get Unpack from extensions

* run-slow[align]

* handle kwargs passed as nested dict

* add from_pretrained test for nested kwargs handling

* [run-slow]align

* update documentation + imports

* update audio inputs

* protect audio types, silly

* try removing imports

* make things simpler

* simplerer

* move out kwargs test to common mixin

* [run-slow]align

* skip tests for old processors

* [run-slow]align, clip

* !$#@!! protect imports, darn it

* [run-slow]align, clip

* [run-slow]align, clip

* update doc

* improve documentation for default values

* add model_max_length testing

This parameter depends on tokenizers received.

* Raise if kwargs are specified in two places

* fix

* expand VideoInput

* fix

* fix style

* remove defaults values

* add comment to indicate documentation on adding kwargs

* protect imports

* [run-slow]align

* fix

* remove set() that breaks ordering

* test more

* removed unused func

* [run-slow]align
itazap pushed a commit that referenced this pull request Jun 20, 2024
* add initial design for uniform processors + align model

* fix mutable default 👀

* add configuration test

* handle structured kwargs w defaults + add test

* protect torch-specific test

* fix style

* fix

* fix assertEqual

* move kwargs merging to processing common

* rework kwargs for type hinting

* just get Unpack from extensions

* run-slow[align]

* handle kwargs passed as nested dict

* add from_pretrained test for nested kwargs handling

* [run-slow]align

* update documentation + imports

* update audio inputs

* protect audio types, silly

* try removing imports

* make things simpler

* simplerer

* move out kwargs test to common mixin

* [run-slow]align

* skip tests for old processors

* [run-slow]align, clip

* !$#@!! protect imports, darn it

* [run-slow]align, clip

* [run-slow]align, clip

* update doc

* improve documentation for default values

* add model_max_length testing

This parameter depends on tokenizers received.

* Raise if kwargs are specified in two places

* fix

* expand VideoInput

* fix

* fix style

* remove defaults values

* add comment to indicate documentation on adding kwargs

* protect imports

* [run-slow]align

* fix

* remove set() that breaks ordering

* test more

* removed unused func

* [run-slow]align
@molbap molbap mentioned this pull request Jul 17, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants