Skip to content

Icelandic Word Embeddings. Here you can find pre-trained corpora of word embeddings. Current methods: CBOW, Skip-Gram, Fast-Text (from Gensim library). The .model file are available for download.

License

Notifications You must be signed in to change notification settings

BlackKakapo/Icelandic-Word-Embedding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 

Repository files navigation

Icelandic Word Embedding

These vectors was trained with 3 different methods (CBOW, Skip-Gram, FastText) from Gensim library. The dataset is a bunch of text that was taken from internet (news, comments, blogs etc.).

Please notice I do not claim that these vectors are the best for icelandic language!

About Dataset

The text is pre processed and cleaned.

  • 34.001.430 - Sentences (One of purpose is the sentence need to be bigger than 35 characters, including the spaces)
  • 509.359.137 - Words
  • 366.181 - Unique words
  • 14,98 - AVG number of words in sentence

Method Size Min_count Window Download Size
CBOW 300 20 5 Download 770 MB
CBOW 300 20 15 Download 770 MB
Skip-Gram 300 20 5 Download 770 MB
Skip-Gram 300 20 15 Download 770 MB
FastText (SG) 300 20 5 Download 2.75 GB
FastText (SG) 300 20 15 Download 2.75 GB

Some tests with ICE_SG_300_20_15.model

from gensim.models import Word2Vec

model = Word2Vec.load('ICE_SG_300_20_15.model')

resultQuery = model.wv.most_similar('**WORD**')

for result in resultQuery:
    print(result)
    
In: spánn
Out:
('ítalía', 0.811682403087616)
('frakkland', 0.7648149728775024)
('grikkland', 0.7166905403137207)
('argentína', 0.7072430849075317)
('brasilía', 0.7048637270927429)
('slóvenía', 0.6866981983184814)
('rúmenía', 0.6782016158103943)
('andalúsía', 0.6689508557319641)
('portúgal', 0.6633602380752563)
('madríd-samfélagið', 0.6570574045181274)

In: rúmenía
Out:
('búlgaría', 0.787040114402771)
('slóvenía', 0.7784465551376343)
('pólland', 0.7485482692718506)
('belgía', 0.7436263561248779)
('ungverjaland', 0.7411393523216248)
('slóvakía', 0.7309471964836121)
('serbía', 0.7282808423042297)
('lettland', 0.7272511720657349)
('króatía', 0.716016948223114)
('noregur', 0.7095803022384644)

In: ísland
Out: 
('noregur', 0.6620221734046936)
('danmörk', 0.60569828748703)
('finnland', 0.5955715775489807)
('noreg', 0.5871228575706482)
('norður-atlantshafslaxins', 0.581960141658783)
('færeyjar', 0.5819252133369446)
('króatía', 0.5767964720726013)
('belgía', 0.5625469088554382)
('eistland', 0.5602771639823914)
('norðurlöndin', 0.556350827217102)

In: æðislegt
Out:
('yndislegt', 0.8013175129890442)
('geggjað', 0.7791568040847778)
('dásamlegt', 0.762319803237915)
('æðislega', 0.7110927104949951)
('jiii', 0.6798439025878906)
('æðisleg', 0.6753286719322205)
('geðveikt', 0.6734864115715027)
('svooooooo', 0.6663821339607239)
('flott', 0.6644969582557678)
('fínt', 0.6557028889656067)

About

Icelandic Word Embeddings. Here you can find pre-trained corpora of word embeddings. Current methods: CBOW, Skip-Gram, Fast-Text (from Gensim library). The .model file are available for download.

Topics

Resources

License

Stars

Watchers

Forks