Quick Thoughts

Based on work by Lajanugen Logeswaran & Honglak Lee in "An Efficient Framework for Learning Sentence Representations" published at ICLR 2018.

This implementation is based on the original, but swaps out the RNN encoder for a Transformer. I also use only a single encoder, rather than one for the query and one for the context. (In other words encoders f and g share parameters in Figure 1. See https://arxiv.org/pdf/1803.02893.pdf)

How to use

Raw text -> Sentence per line: sentence_tokenisation.py converts raw text files to text files with a single sentence per line.
Sentences -> TFRecords: create_tfrecords.py first builds a vocab over all sentences before serialising them as integer (index) lists into tfrecords.
Train: train.py runs the model over the training set. The tf-records are not shuffled and the loss computed by the model predicting which sentences in a batch immediately precede/follow every other sentence.

Analysis

analysis.py contains code to explore the patterns learnt by a trained model. A set of sentences are transformed to vectors by the model. By entering a custom sentence, you can search through the text by printing sentences that have the largest dot product between the respective encodings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Quick Thoughts

How to use

Analysis

Files

README.md

Latest commit

History

README.md

File metadata and controls

Quick Thoughts

How to use

Analysis