Code does not match explanation #1228

hoosierEE · 2023-10-29T15:37:39Z

The word2vec tutorial at first gives one definition of negative sampling:

A negative sample is defined as a (target_word, context_word) pair such that the context_word does not appear in the window_size neighborhood of the target_word. For the example sentence, these are a few potential negative samples (when window_size is 2)

However, the implementation uses a second definition:

To produce additional skip-gram pairs that would serve as negative samples for training, you need to sample random words from the vocabulary.

There are several places where this second definition is used. First in the "small" example:

# Get target and context words for one positive skip-gram.
target_word, context_word = positive_skip_grams[0]

# Set the number of negative samples per positive context.
num_ns = 4

context_class = tf.reshape(tf.constant(context_word, dtype="int64"), (1, 1))
negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
    true_classes=context_class,  # class that should be sampled as 'positive'
    num_true=1,  # each positive skip-gram has 1 positive context class
    num_sampled=num_ns,  # number of negative context words to sample
    unique=True,  # all the negative samples should be unique
    range_max=vocab_size,  # pick index of the samples from [0, vocab_size]
    seed=SEED,  # seed for reproducibility
    name="negative_sampling"  # name of this operation
)
print(negative_sampling_candidates)
print([inverse_vocab[index.numpy()] for index in negative_sampling_candidates])

It's used again in the Summary diagram, and later in the definition for generate_training_data:

    # Iterate over each positive skip-gram pair to produce training examples
    # with a positive context word and negative samples.
    for target_word, context_word in positive_skip_grams:
      context_class = tf.expand_dims(
          tf.constant([context_word], dtype="int64"), 1)
      negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
          true_classes=context_class,
          num_true=1,
          num_sampled=num_ns,
          unique=True,
          range_max=vocab_size,
          seed=seed,
          name="negative_sampling")

With a large enough sequence, random sampling is unlikely to pick samples near target_word purely by chance, and as a result the model "works". However if you test with a small example, you can see that this form of sampling excludes only the context_word.

My understanding is that for a context window of [the wide road shimmered] with the target word road, the positive (+) and negative (-) examples should be like this:

[the wide road shimmered] in the hot sun
 +++ ++++      +++++++++  -- --- --- ---

Positive samples for road come from [the, wide, shimmered] and negative samples for the context word shimmered come from [in, the, hot, sun].

Either the text's definition of negative sampling should be changed, or the code should be changed to discard positive samples from the neg_sampling_candidates.

The text was updated successfully, but these errors were encountered:

cantonios · 2023-10-30T16:15:52Z

Agreed, do you want to adjust the code and create a PR to exclude all context words for the target word?

hoosierEE · 2023-10-30T16:35:34Z

I'll give it a try and let you know with a PR.

hoosierEE · 2023-10-31T14:10:59Z

I don't usually work with notebooks so please excuse the noisy diff. It looks like there was a bunch of HTML escaping in the original that wasn't present in the .ipynb downloaded from colab.

I saw an improvement in accuracy for the same number of epochs (92% versus 89%) but generate_training_data runs more slowly (about 2m versus <1m on colab). This is the important part of the diff:

+    # Generate positive context windows for each target word in the sequence.
+    window = defaultdict(list)
+    for i in range(window_size, len(sequence)-window_size):
+      window[sequence[i]].append(sequence[i-window_size:1+i+window_size])

    # Iterate over each positive skip-gram pair to produce training examples
    # with a positive context word and negative samples.
    for target_word, context_word in positive_skip_grams:
      context_class = tf.expand_dims(
          tf.constant([context_word], dtype="int64"), 1)
      negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
          true_classes=context_class,
          num_true=1,
          num_sampled=num_ns,
          unique=True,
          range_max=vocab_size,
          seed=seed,
          name="negative_sampling")

+      # Discard iteration if negative samples overlap with positive context.
+      for target in window[target_word]:
+        if not any(t in target for t in negative_sampling_candidates):
+          break  # All candidates are true negatives: use this skip_gram.
+      else:
+        continue # Discard this skip_gram

No changes to the diagrams, and I left the prose unchanged except for a small correction to the Negative sampling for one skip-gram section:

-  You can call the function on one skip-grams's target word and pass the context word as true class to exclude it from being sampled.
+  You can pass words from the positive class but this does not exclude them from the results. For large vocabularies, this is not a problem because the chance of drawing one of the positive classes is small. However for small data you may see overlap between negative and positive samples. Later we will add code to exclude positive samples for slightly improved accuracy at the cost of longer runtime.

simonwardjones · 2024-08-28T17:53:34Z

Hi, thanks for reporting this. I just wanted to add that this still seems to be an issue in the tensorflow docs.

hoosierEE mentioned this issue Oct 31, 2023

negative sampling excludes positive class #1229

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code does not match explanation #1228

Code does not match explanation #1228

hoosierEE commented Oct 29, 2023

cantonios commented Oct 30, 2023

hoosierEE commented Oct 30, 2023

hoosierEE commented Oct 31, 2023

simonwardjones commented Aug 28, 2024

Code does not match explanation #1228

Code does not match explanation #1228

Comments

hoosierEE commented Oct 29, 2023

cantonios commented Oct 30, 2023

hoosierEE commented Oct 30, 2023

hoosierEE commented Oct 31, 2023

simonwardjones commented Aug 28, 2024