Word embeddings

This application demonstrates how word embeddings can be used to find similar words. Similar words here are words which occur in the same or similar contexts in the National Corpus of Irish.

Downloadable word embeddings

Licence and attribution »

Word embeddings
word2vec
Download cng-word2vec.vec.zip
Word embeddings
fasttext
Download cng-fasttext.vec.zip

These are space-separated text files, in standard word2vec format, compressed with ZIP.
The first line contains the number of words and the number of dimensions (100).
All remaining lines consist of the word in the first column followed by the vector values in the remaining 100 columns.
The words are ordered by frequency, with the most frequent words at the top.
Note: Although the two sets of embeddings are offered here in a format called “the word2vec format”, they have been derived using two different machine learning algorithms, one using the word2vec algorithm and one using the fasttext algorithm, as implemented in the Gensim library.

How to use

This code sample shows how to load and use the word embeddings with the Python programming language together with the Gensim library.

import gensim

# load the vectors:
wv = gensim.models.KeyedVectors.load_word2vec_format('cng-fasttext.vec', binary=False, limit=100000)

# find ten words most similar to 'teach':
similars = wv.most_similar('teach', topn=10)
for similar in similars:
print(similar)

Output:

('tigh', 0.9031928181648254)
('seanteach', 0.773318350315094)
('mbaile', 0.7576225996017456)
('tigín', 0.753011167049408)
('séipéal', 0.7515964508056641)
('teachín', 0.7445628643035889)
('pub', 0.7366455793380737)
('scioból', 0.7314869165420532)
('siopa', 0.7245514988899231)
('bhothán', 0.7238678336143494)