Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨 fix(SigLip): remove spurious exclusion of first vision output token #30952

Merged
merged 1 commit into from
Jul 11, 2024

Conversation

transmissions11
Copy link
Contributor

@transmissions11 transmissions11 commented May 22, 2024

Looks like the first token is spuriously excluded from the average pooling because this code was copy & pasted from modeling_clip, which does prepend a [CLS] token. However SigLip has no such special token, so it would seem we are needlessly excluding a token from being included?

p.s. why not use the pooled_output from the model's map head vs averaging the last hidden states?

@NielsRogge
Copy link
Contributor

There's another PR which suggested to use the pooled output rather than averaging the patch tokens: #30373.

The reason I went with the latter was because of this paper (follow-up paper by the ViT authors which showed that it works better than the CLS token). However averaging patch tokens might work better for ViTs pre-trained using (self-)supervised image classification. For contrastive models like CLIP and SigLIP, perhaps it makes more sense to use the pooled output. I'm happy to change it, although we would need to make it backwards compatible

@amyeroberts amyeroberts changed the title fix(SigLip): remove spurious exclusion of first vision output token 🚨 fix(SigLip): remove spurious exclusion of first vision output token Jun 13, 2024
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

Only thing we need to do is run the model slow tests. Pushing the following should trigger the necessary CI run:
git commit --allow-empty -m "[run-slow] siglip"

Copy link

github-actions bot commented Jul 8, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@amyeroberts
Copy link
Collaborator

@transmissions11 Any update on this?

@NielsRogge
Copy link
Contributor

As this is a breaking change, wondering whether we should just go for the pooled output anyway? It would however break any existing fine-tunes so we might need to add a flag for backwards compatibility.

@amyeroberts
Copy link
Collaborator

amyeroberts commented Jul 11, 2024

@NielsRogge It might be breaking but the previous logic was wrong -- there's no reason to exclude the first token. We don't keep backwards compatibility for bugs, and so I'm not sure it makes sense to have backwards compatibility here for the token.

If we did add the pooled logic, then yes we should have a flag. I'd do this in a separate PR

@amyeroberts amyeroberts merged commit 5258501 into huggingface:main Jul 11, 2024
18 checks passed
amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Jul 19, 2024
…huggingface#30952)

fix(SigLip): remove spurious exclusion of first vision output token in classifier
MHRDYN7 pushed a commit to MHRDYN7/transformers that referenced this pull request Jul 23, 2024
…huggingface#30952)

fix(SigLip): remove spurious exclusion of first vision output token in classifier
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Jul 24, 2024
…huggingface#30952)

fix(SigLip): remove spurious exclusion of first vision output token in classifier
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants