Refactory/fix of sampled softmax to add logQ correction #1051

gabrielspmoreira · 2023-04-07T03:32:12Z

Goals ⚽

Sampled softmax is a popular technique to deal with multi-class classification with a very large number of classes.
It has been used to train retrieval models with contrastive learning by using a subset of candidate negative items during training, instead of using all other items as negatives.
This PR adds the important logQ sampling correction proposed by sampled softmax, in order to better approximate it to the full softmax.

Implementation Details 🚧

In general, sampled softmax was implemented before by using ContrastiveOutput as model output and the PopularityBasedSamplerV2 as a sampler (like example below), that uses the log-uniform distribution to approximate the long-tail items frequency (assuming that categorical ids are sorted decreasingly by frequency).

mm.ContrastiveOutput(
            schema["item_id"],
            negative_samplers=PopularityBasedSamplerV2(max_id=10000, max_num_samples=100),
        ),

However, the logQ correction was not implemented as proposed by sampled softmax, to fix the overpenalization of popular items as they are sampled more often as negatives. The logQ correction can be used by PopularityLogitsCorrection, but it requires providing the items frequency distribution. As our sampled softmax implemetation (PopularityBasedSamplerV2 ) uses an approximated log-uniform distribution for sampling, I implemented the corresponding sampling probability of the positive and negative items to allow for the sampling correction (with replacement or not). You can use sampled softmax as in the folllowing example.

mm.ContrastiveOutput(
            schema["item_id"],
            negative_samplers=PopularityBasedSamplerV2(max_id=10000, max_num_samples=100),
            logq_sampling_correction=True,
        ),

It was created a new logq_sampling_correction=False arg to the ContrastiveOutput, that should be set to True when the sampler supports returning the items' sampling probs (like PopularityBasedSamplerV2 does). If it is used, then PopularityLogitsCorrection doesn't need to be used.

Summary of main API changes

Changed Candidate to optionally store the sampling prob. of each item
Changed the CandidateSampler abstract class to have a with_sampling_probs(items) method, that allows returning the probability of the provided items according to the sampler distribution.
Changed the PopularityBasedSamplerV2 to support both unique and not-unique samples, to compute its distribution in the constructor (get_sampling_distribution()) as it is based on an log-uniform approximation of items' long-tail frequency distribution
Changed ContrastiveOutput to set the sampling probs for both positive and negative stores. Created the arg logq_sampling_correction, that if enabled subtracts the logQ sampled probs based on the Candidate.sampling_prob

Testing Details 🔍

Added the test_contrastive_output_with_sampled_softmax to test and showcase how sampled softmax can be used in Merlin Models

…stiveOutput / Candidate

github-actions · 2023-04-07T03:39:39Z

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-1051

marcromeyn · 2023-04-18T07:03:30Z

merlin/models/tf/outputs/contrastive.py

@@ -132,6 +147,7 @@ def __init__(
        query_name: str = "query",
        candidate_name: str = "candidate",
        store_negative_ids: bool = False,
+        logq_sampling_correction: Optional[bool] = False,


What's the reasoning behind making this false by default? Are there down-sides of doing LogQ correction?

Refactory/fix of sampled softmax on PopularityBasedSamplerV2 / Contra…

638617b

…stiveOutput / Candidate

gabrielspmoreira self-assigned this Apr 7, 2023

gabrielspmoreira requested a review from sararb April 7, 2023 03:33

gabrielspmoreira added this to the Merlin 23.04 milestone Apr 7, 2023

gabrielspmoreira added bug Something isn't working enhancement New feature or request labels Apr 7, 2023

gabrielspmoreira and others added 3 commits April 12, 2023 12:37

Merge branch 'main' into tf/fix_logq_correction

a3b929b

Merge branch 'main' into tf/fix_logq_correction

b4666d6

Merge branch 'main' into tf/fix_logq_correction

ad29278

marcromeyn self-requested a review April 18, 2023 07:02

marcromeyn reviewed Apr 18, 2023

View reviewed changes

marcromeyn approved these changes Apr 18, 2023

View reviewed changes

Merge branch 'main' into tf/fix_logq_correction

bba02f9

gabrielspmoreira merged commit 2a20547 into main Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactory/fix of sampled softmax to add logQ correction #1051

Refactory/fix of sampled softmax to add logQ correction #1051

gabrielspmoreira commented Apr 7, 2023 •

edited

Loading

github-actions bot commented Apr 7, 2023

marcromeyn Apr 18, 2023

Refactory/fix of sampled softmax to add logQ correction #1051

Refactory/fix of sampled softmax to add logQ correction #1051

Conversation

gabrielspmoreira commented Apr 7, 2023 • edited Loading

Goals ⚽

Implementation Details 🚧

Summary of main API changes

Testing Details 🔍

github-actions bot commented Apr 7, 2023

Documentation preview

marcromeyn Apr 18, 2023

Choose a reason for hiding this comment

gabrielspmoreira commented Apr 7, 2023 •

edited

Loading