Models for download
These models were trained using fastText from the corpora available in the Sketch Engine using the SkipGram model with dimension 100.
The models are provided in two different formats. Models with the bin extension are encoded in the native binary fastText format. Models with the vec extension use the textual Word2Vec format. We recommend the bin format, as it contains the subword n-gram information, is more compact and also faster to load.
To make use of these models, load them in fastText for supervised classification or use them with the venerable Gensim package for any NLP tasks you need.
You can browse the models using our web frontend at https://embeddings.sketchengine.eu/
Citing the models
If the models contribute to a project which leads to a publication, please acknowledge this by citing the following article:
@article{herman2021precomputed,
title={Precomputed Word Embeddings for 15+ Languages},
author={Herman, Ond{\v{r}}ej},
journal={RASLAN 2021 Recent Advances in Slavonic Natural Language Processing},
pages={41},
year={2021}
}
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
For commercial use of the dataset please send a letter to inquiries@sketchengine.eu, in which you describe the intended usage and languages that you would like to use.
References
More languages to come
We are continuing building further models for languages for which we have enough data in Sketch Engine. If you would like us to prioritise a language for which there is a large enough corpus in Sketch Engine but we have not built embeddings yet, please send an e-mail to inquiries@sketchengine.eu.
English (Web, 2013, 20 billion tokens)
English (British National Corpus, 100 million tokens)
Early English (1 billion tokens)
Arabic (Web, 2012, 8 Billion tokens)
- Word form [character ngrams] (.bin, .vec)
Chinese Simplified (Web, 2011, 2 billion tokens)
- Word form [character ngrams] (.bin, .vec)
Czech (Web, 2012, 5 billion tokens)
- Word form [?] (.bin, .vec)
- Lemma [character ngrams] (.bin, .vec)
- Lemma (lowercase) [character ngrams] (.bin, .vec)
- Word form [character ngrams] (.bin, .vec)
- Word form [character ngrams,d=200] (.bin, .vec)
- Word form [character ngrams,d=300] (.bin, .vec)
- Word form [character ngrams,d=400] (.bin, .vec)
- Word form [d=200] (.bin, .vec)
Danish (Web, 2014, 2 billion tokens)
- Word form [character ngrams] (.bin, .vec)
- Word form (lowercase) [character ngrams] (.bin, .vec)
- Lemma [character ngrams] (.bin, .vec)
- Lemma (lowercase) [character ngrams] (.bin, .vec)
- Lemma + Part of Speech [character ngrams] (.bin, .vec)
Estonian (Web, 2017, 1.3 billion tokens)
French (Web, 2012, 11 billion tokens)
German (Web, 2013, 20 billion tokens)
Italian (Web, 2010, 3 billion tokens)
Korean (Web, 2012, 500 million tokens)
- Word form [character ngrams] (.bin, .vec)
Portuguese (Web, 2011, 4 billion tokens)
Russian (Web, 2011, 18 billion tokens)
- Word form [character ngrams] (.bin, .vec)
- Word form (lowercase) [character ngrams] (.bin, .vec)
- Lemma [character ngrams] (.bin, .vec)
- Lemma (lowercase) [character ngrams] (.bin, .vec)
- Lemma + Part of Speech [character ngrams] (.bin, .vec)
Slovak (Web, 2021)
Slovenian (Web, 2015, 1 billion tokens)
- Word form [character ngrams] (.bin, .vec)
- Word form (lowercase) [character ngrams] (.bin, .vec)
- Lemma [character ngrams] (.bin, .vec)
Spanish (Web, 2011, 10 billion tokens)
Spanish, American (Web, 2011, 8 billion tokens)
- Word form (lowercase) [character ngrams] (.bin, .vec)
Spanish, European (Web, 2011, 2 billion tokens)
- Word form (lowercase) [character ngrams] (.bin, .vec)
Ukrainian (Web, 2020) dim 300
Arabic Web 2018 (arTenTen18)
Czech Web (csTenTen 12+17+19)
Danish Web 2020 (daTenTen20)
German Web 2020 (deTenTen20)
Greek Web 2019 (elTenTen19)
Timestamped JSI web corpus English 2014-2022
English Web 2021 (enTenTen21)
Spanish Web 2018 (esTenTen18)
Estonian National Corpus 2021 (Estonian NC 2021)
Persian Web 2018 (faTenTen18)
French Web 2020 (frTenTen20)
Hebrew Web 2021 (heTenTen21)
Hungarian Web 2020 (huTenTen20)
Indonesian Web 2020 (idTenTen20)
Italian Web 2020 (itTenTen20)
Japanese Web 2011 (jaTenTen11)
Korean Web 2018 (koTenTen18)
Dutch Web 2020 (nlTenTen20)
Norwegian Web 2017 (noTenTen17, Bokmål)
Polish Web 2012 (plTenTen12)
Portuguese Web 2020 (ptTenTen20)
Romanian Web 2021 (roTenTen21)
Russian Web 2017 (ruTenTen17)
Swedish Web 2014 (svTenTen14)
Turkish Web 2020 (trTenTen20)
Ukrainian Web 2022
Vietnamese Web 2017 (viTenTen17)
Chinese Web 2017 (zhTenTen17) Traditional
Copyright 2018-2022 Lexical Computing s. r. o.