Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking on larger datasets (low accuracy) #2

Open
arjun-mani opened this issue Apr 26, 2021 · 4 comments
Open

Benchmarking on larger datasets (low accuracy) #2

arjun-mani opened this issue Apr 26, 2021 · 4 comments

Comments

@arjun-mani
Copy link

I've been doing some more work with this repo, and I think it'd be productive to do some benchmarking beyond the given example. For example I've started working with the wiki8 text corpus (first 10^8 bytes of Wikipedia) and running some tests; Gensim's implementation gives an accuracy of ~24% on analogies, while I'm only seeing an accuracy of ~5% with this model.

Ideally we wouldn't see this kind of gap, so maybe it'd be a good idea to do some testing on larger datasets? I can also share some code to this end.

@ddehueck
Copy link
Owner

Great catch. And yes this repo is very poorly benchmarked so this type of work is very much appreciated. If you have a repo demonstrating this difference I'd love to take a look!

I should have some free time in the coming weeks to make some improvements.

@arjun-mani
Copy link
Author

Absolutely, really appreciate your responsiveness. I'm a bit busy this week with a deadline (related to this work) but will try to share a repo soon after. A couple of suggestions: adding subsampling of frequent words, and having two weight matrices (separate one for context and center lookup).

@ddehueck
Copy link
Owner

No problem happy to work towards making this repo into a good resource for people.

As for subsampling, it is done in sgns_loss.py wrt to a multinomial distribution in utils.py. I believe I found this method from another source so it may be worth reconsidering the actual implementation.

I've seen the two-weight matrices done before and I'm good to give it a try. Looking forward to making some improvements.

@arjun-mani
Copy link
Author

I may be mistaken but I believe the code in sgns_loss.py is for negative sampling? What I meant by subsampling is to discard training examples based on frequency of center word in the dataset (Sec. 2.3 here: https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants