Skip to content

Java implementation of the Word2Vec algorithm supporting CBOW and Skip-gram models. Features include training on custom corpora, vector operations, and similarity calculations. Ideal for NLP tasks like finding similar words and word analogies.

Notifications You must be signed in to change notification settings

spaceshark123/Word2Vec

Repository files navigation

Java Word2Vec

This is a Java implementation of the popular Word2Vec algorithm for converting words into multidimensional vectors in embedding space. It supports both CBOW (Continuous Bag of Words) and Skip-gram protocols for predicting a word from context words or vice versa, as well as direct vector operations like cosine similarity and embedding vector arithmetic to manipulate words. The entire program has a CLI (console line interface) to interact with and dynamically experiment with Word2Vec without touching a single line of code. The included Word2Vec class also has several useful helper functions like finding similar words, cleaning/reading from a corpus text, and training the neural network with customizable parameters. It uses a dependency on my java NeuralNetwork project to act as the base for the actual prediction and vectorization process and my java ConsoleTool project to create a CLI wrapper for the application.

Table of Contents

Overview

This project implements Word2Vec, a technique for learning word embeddings using a simple neural network architecture. Word2Vec learns high-dimensional vector representations of words from large text corpora, which capture semantic and syntactic similarities between words. This implementation includes two primary training protocols:

  • CBOW (Continuous Bag of Words): Predicts the target word given the surrounding context words.
  • Skip-gram: Predicts the context words given a single target word.

These embeddings can be used for various natural language processing tasks, such as finding similar words, word analogies, and more.

Key Features

  • CBOW and Skip-gram Support: Implements both Continuous Bag of Words and Skip-gram models to allow flexible training and testing.
  • User-friendly Console Interface: Everything is wrapped in an easy-to-use console interface using commands to perform tasks, similar to a shell terminal. This allows for no-code experimentation with the Word2Vec embedding model.
  • Customizable Parameters: Allows users to adjust parameters such as learning rate, embedding size, context window size, minimum word frequency, and number of training epochs.
  • Vector Operations: Supports vector arithmetic and cosine similarity calculations to find relationships between words.
  • Corpus Processing: Includes functionality to read, clean, and preprocess text corpora, handling tokenization and normalization.
  • Save and Load Models: Ability to save trained models to a file and load pre-trained models for reuse or evaluation.

Internal Usage

For use in your own Java projects, simply import the Word2Vec.java class file and it will immediately be usable. The following section covers the proper syntax for

  1. Initializing Word2Vec Model:
Word2Vec.ModelType modelType = Word2Vec.ModelType.CBOW; // or Word2Vec.ModelType.SKIPGRAM
int minFrequency = 5; // minimum times a word needs to occur for it to be added to the model's vocabulary
int windowSize = 5; // how far around the word to look for context
int dimensions = 100; // how many dimensions should the embedding vector have
String corpusString = "The quick brown fox jumps over the lazy dog."; // corpus text is automatically cleaned up for tokenization and parsing

Word2Vec model = new Word2Vec(modelType, corpusString, minFrequency, windowSize, dimensions);
  1. Training the Model: Use the trainModel method to train the Word2Vec model on a given text corpus.
model.train(numberOfEpochs, learningRate);
  1. Finding Similar Words: Use the findSimilarWords method to find words similar to a given input word or embedding vector.
String[] similarWords = model.findSimilarWords("word", topN);

double[] vector = {...};
String[] similarWords2 = model.findSimilarWords(vector, topN);

String closestWord = model.getClosestWord("word");
String closestWord2 = model.getClosestWord(vector);
String closestWord3 = model.getClosestWord(vector, "excludedWord1", "excludedWord2", ...);
  1. Vector Arithmetic: Perform operations like word analogies using vector arithmetic.
double[] kingVector = model.vector("king");
double[] manVector = model.vector("man");
double[] womanVector = model.vector("woman");
double[] queenVector = model.add(model.subtract(kingVector, manVector), womanVector); // king - man + woman = queen
String queenWord = getClosestWord(queenVector); // gets the closest word that matches this new embedding vector
  1. Comparing Words: Compare the similarity of 2 words using cosine similarity in the similarity() function
double sim1 = model.similarity("king", "queen"); // high similarity (strong correlation)
double sim2 = model.similarity("king", "phone"); // near 0 similarity (no correlation)
double sim3 = model.similarity("king", "peasant"); // low similarity (opposite correlation)

Program Usage

  1. Compile the Code: First, make sure you are working in the project directory. If you are running the full project with the console interface, run the following commands to compile and run the program:

    Unix (Mac/Linux) users:

    Compile:

    javac -cp ".:./libraries/jfreechart-1.5.3.jar" Main.java

    Run:

    java -cp ".:./libraries/jfreechart-1.5.3.jar" Main

    Windows users:

    Compile:

    javac -cp ".;./libraries/jfreechart-1.5.3.jar" Main.java

    Run:

    java -cp ".;./libraries/jfreechart-1.5.3.jar" Main

    Or, if you are just using the Word2Vec class, the jfreechart library can be excluded, simplifying the commands to:

    Compile:

    javac Main.java

    Run:

    java Main

Exiting the Program:

  • To exit the program, simply type exit, and the program will terminate.

About

Java implementation of the Word2Vec algorithm supporting CBOW and Skip-gram models. Features include training on custom corpora, vector operations, and similarity calculations. Ideal for NLP tasks like finding similar words and word analogies.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages