Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The influence of t (temperature) in the E5 Model paper #1588

Open
daegonYu opened this issue Jun 27, 2024 · 4 comments
Open

The influence of t (temperature) in the E5 Model paper #1588

daegonYu opened this issue Jun 27, 2024 · 4 comments
Assignees

Comments

@daegonYu
Copy link

Describe
Model I am using (UniLM, MiniLM, LayoutLM ...): E5

hello. I am a student studying sentence similarity.

“Paper: Text Embeddings by Weakly-Supervised Contrastive Pre-training”
While reading this paper, a question arose. The point is that t is 0.01. In the SimCSE paper, the sentence similarity is set to 0.05 for the task (STS), and in other papers, the sentence similarity is set to 0.02, but in this paper, the sentence similarity was set to 0.01. Can you tell us what effects can be achieved by lowering the temperature?

@intfloat
Copy link
Contributor

Hi @daegonYu ,

This is a hyperparameter for tuning. Empirically we observe that lower temperature will lead to better performance but might cause training instability under float16 precision for large models. A lower temperature allows the logits to vary in a wider range and thus has more flexibility.

@daegonYu
Copy link
Author

daegonYu commented Jul 1, 2024

“A lower temperature allows the logits to vary in a wider range and thus has more flexibility.” This can be interpreted as saying that embeddings make it easier to learn more diverse expressions. But in "https://huggingface.co/intfloat/multilingual-e5-base"

3. Why does the cosine similarity scores distribute around 0.7 to 1.0?

This is a known and expected behavior as we use a low temperature 0.01 for InfoNCE contrastive loss.

For text embedding tasks like text retrieval or semantic similarity, what matters is the relative order of the scores instead of the absolute values, so this should not be an issue.

If embeddings can be expressed in wider range, I think cosine similarity should be distributed over a wide range. Cosine similarity is distributed between 0.7 and 1.0. It's difficult to understand because it seems like something contradictory. Simply put, I wonder why lowering the temperature allows learning a wider range of logits.

@intfloat
Copy link
Contributor

intfloat commented Jul 1, 2024

The logits are calculated with cosine_similarity / t. Therefore, the logits will fall in [-100, 100] with t = 0.01 and [-50, 50] with t=0.02, etc.

However, this does not mean the learned cosine similarity will be in a wider range. On the contrary, the cosine similarity tends to concentrate as the temperature becomes lower.

@daegonYu
Copy link
Author

daegonYu commented Jul 4, 2024

All right. I understand what you said, but why does "the cosine similarity tends to concentrate as the temperature becomes lower." Can you tell if this is happening?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants