GitHub - HNOONa-0/WTOE: Word to emoji

What is this?

A word to emoji converter. that is, Given a word, what is the most appropiate emoji(s) that describe this word?

For example: an appropiate emoji for the word clown is 🤡.

Problem

It's a very tedious task to manually map emoji to a specific word, and one word could be mapped to many emojis and vice versa, for example: the word sad could map to 😢 or 😔, yummy could map to 😋 or 🤤, and emoji 😃 could map to either words smile or happy

Approach: Word Embeddings

Word embedding is a powerful technique used to represent words as numerical vectors in a high-dimensional space. The main idea behind word embeddings is that words with similar meanings or that are used in similar contexts tend to have similar vector representations.

For instance, consider the words "cat," "dog," and "car." In a well-trained word embedding model, the vectors representing "cat" and "dog" would be closer to each other than either of them would be to the vector representing "car." This is because "cat" and "dog" share a common context as pets, while "car" is unrelated in terms of meaning and usage.

And so word embeddings capture 'context' relationships between words through the analysis of large text. By training a model on vast amounts of textual data, such as Twitter posts, news articles, or books, the model learns to associate words that frequently appear together and assigns them similar vector representations.

A popular word embedding model is Word2Vec, which uses techniques like Skip-gram and Continuous Bag of Words (CBOW) to learn word embeddings. the Gensim library is utilized to apply the Word2Vec model on a collection of 100 different emojis. The dataset for training the model is obtained from Twitter using the Snscrape library, consisting of approximately 600,000 tweets.

Below are some results of the program:

Issues

Not all words are present in this model, example:

2) Some outputs are not satisfactory, example:

This is due to many reasons, two of which:

The dataset is small and unfortunately, Twitter API has changed and is now a lot harder to scrape tweets, it's not easy to scrape more tweets.
The quality of the dataset itself. This depends on what query we used to scrape tweets, improving the query we use could go along way.

Ofcourse, there are many models we could use, I wanted to showcase how Word2Vec model could go about solving this problem

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
data.py		data.py
dwl_tweet_files.py		dwl_tweet_files.py
dwl_tweets.py		dwl_tweets.py
model_init.py		model_init.py
model_words.py		model_words.py
my_model.model		my_model.model
program.py		program.py
query_util.py		query_util.py
requirements.txt		requirements.txt
simple_gui.py		simple_gui.py
start.txt		start.txt
tweet_util.py		tweet_util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is this?

Problem

Approach: Word Embeddings

Issues

About

Releases

Packages

Languages

HNOONa-0/WTOE

Folders and files

Latest commit

History

Repository files navigation

What is this?

Problem

Approach: Word Embeddings

Issues

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages