This application demonstrates how word embeddings can be used to find similar words. Similar words here are words which occur in the same or similar contexts in the National Corpus of Irish.
Downloadable word embeddings
-
Word embeddingsDownload
word2vec
cng-word2vec.vec.zip
-
Word embeddingsDownload
fasttext
cng-fasttext.vec.zip
- These are space-separated text files, in standard
word2vec
format, compressed with ZIP. - The first line contains the number of words and the number of dimensions (100).
- All remaining lines consist of the word in the first column followed by the vector values in the remaining 100 columns.
- The words are ordered by frequency, with the most frequent words at the top.
- Note: Although the two sets of embeddings are offered here in a format called “the
word2vec
format”, they have been derived using two different machine learning algorithms, one using theword2vec
algorithm and one using thefasttext
algorithm, as implemented in the Gensim library.
How to use
This code sample shows how to load and use the word embeddings
with the Python programming language
together with the Gensim library.
import gensim
# load the vectors:
wv = gensim.models.KeyedVectors.load_word2vec_format('cng-fasttext.vec', binary=False, limit=100000)
# find ten words most similar to 'teach':
similars = wv.most_similar('teach', topn=10)
for similar in similars:
print(similar)
Output:
('tigh', 0.9031928181648254)
('seanteach', 0.773318350315094)
('mbaile', 0.7576225996017456)
('tigín', 0.753011167049408)
('séipéal', 0.7515964508056641)
('teachín', 0.7445628643035889)
('pub', 0.7366455793380737)
('scioból', 0.7314869165420532)
('siopa', 0.7245514988899231)
('bhothán', 0.7238678336143494)