Models for download
These models were trained using fastText from the corpora available in the Sketch Engine using the SkipGram model with dimension 100.
The models are provided in two different formats. Models with the bin extension are encoded in the native binary fastText format. Models with the vec extension use the textual Word2Vec format. We recommend the bin format, as it contains the subword n-gram information, is more compact and also faster to load.
To make use of these models, load them in fastText for supervised classification or use them with the venerable Gensim package for any NLP tasks you need.
You can browse the models using our web frontend at https://embeddings.sketchengine.eu/
Citing the models
If the models contribute to a project which leads to a publication, please acknowledge this by citing the following article:
@article{herman2021precomputed,
title={Precomputed Word Embeddings for 15+ Languages},
author={Herman, Ond{\v{r}}ej},
journal={RASLAN 2021 Recent Advances in Slavonic Natural Language Processing},
pages={41},
year={2021}
}
License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
For commercial use of the dataset please send a letter to inquiries@sketchengine.eu, in which you describe the intended usage and languages that you would like to use.
References
More languages to come
We are continuing building further models for languages for which we have enough data in Sketch Engine. If you would like us to prioritise a language for which there is a large enough corpus in Sketch Engine but we have not built embeddings yet, please send an e-mail to inquiries@sketchengine.eu.
English (Web, 2013, 20 billion tokens)
English (British National Corpus, 100 million tokens)
Early English (1 billion tokens)
Arabic (Web, 2012, 8 Billion tokens)
- Word form [character ngrams] (.bin, .vec)
Chinese Simplified (Web, 2011, 2 billion tokens)
- Word form [character ngrams] (.bin, .vec)
Czech (Web, 2012, 5 billion tokens)
- Word form [?] (.bin, .vec)
- Lemma [character ngrams] (.bin, .vec)
- Lemma (lowercase) [character ngrams] (.bin, .vec)
- Word form [character ngrams] (.bin, .vec)
- Word form [character ngrams,d=200] (.bin, .vec)
- Word form [character ngrams,d=300] (.bin, .vec)
- Word form [character ngrams,d=400] (.bin, .vec)
- Word form [d=200] (.bin, .vec)
Danish (Web, 2014, 2 billion tokens)
- Word form [character ngrams] (.bin, .vec)
- Word form (lowercase) [character ngrams] (.bin, .vec)
- Lemma [character ngrams] (.bin, .vec)
- Lemma (lowercase) [character ngrams] (.bin, .vec)
- Lemma + Part of Speech [character ngrams] (.bin, .vec)
Estonian (Web, 2017, 1.3 billion tokens)
French (Web, 2012, 11 billion tokens)
German (Web, 2013, 20 billion tokens)
Italian (Web, 2010, 3 billion tokens)
Korean (Web, 2012, 500 million tokens)
- Word form [character ngrams] (.bin, .vec)
Portuguese (Web, 2011, 4 billion tokens)
Russian (Web, 2011, 18 billion tokens)
- Word form [character ngrams] (.bin, .vec)
- Word form (lowercase) [character ngrams] (.bin, .vec)
- Lemma [character ngrams] (.bin, .vec)
- Lemma (lowercase) [character ngrams] (.bin, .vec)
- Lemma + Part of Speech [character ngrams] (.bin, .vec)
Slovak (Web, 2021)
Slovenian (Web, 2015, 1 billion tokens)
- Word form [character ngrams] (.bin, .vec)
- Word form (lowercase) [character ngrams] (.bin, .vec)
- Lemma [character ngrams] (.bin, .vec)
Spanish (Web, 2011, 10 billion tokens)
Spanish, American (Web, 2011, 8 billion tokens)
- Word form (lowercase) [character ngrams] (.bin, .vec)
Spanish, European (Web, 2011, 2 billion tokens)
- Word form (lowercase) [character ngrams] (.bin, .vec)
Ukrainian (Web, 2020) dim 300
Copyright 2018-2022 Lexical Computing s. r. o.