Models for download

These models were trained using fastText from the corpora available in the Sketch Engine using the SkipGram model with dimension 100.

The models are provided in two different formats. Models with the bin extension are encoded in the native binary fastText format. Models with the vec extension use the textual Word2Vec format. We recommend the bin format, as it contains the subword n-gram information, is more compact and also faster to load.

To make use of these models, load them in fastText for supervised classification or use them with the venerable Gensim package for any NLP tasks you need.

You can browse the models using our web frontend at https://embeddings.sketchengine.eu/

Citing the models

If the models contribute to a project which leads to a publication, please acknowledge this by citing the following article:

@article{herman2021precomputed,
  title={Precomputed Word Embeddings for 15+ Languages},
  author={Herman, Ond{\v{r}}ej},
  journal={RASLAN 2021 Recent Advances in Slavonic Natural Language Processing},
  pages={41},
  year={2021}
}

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

For commercial use of the dataset please send a letter to inquiries@sketchengine.eu, in which you describe the intended usage and languages that you would like to use.

References

More languages to come

We are continuing building further models for languages for which we have enough data in Sketch Engine. If you would like us to prioritise a language for which there is a large enough corpus in Sketch Engine but we have not built embeddings yet, please send an e-mail to inquiries@sketchengine.eu.

English (Web, 2013, 20 billion tokens)

English (British National Corpus, 100 million tokens)

Early English (1 billion tokens)

Arabic (Web, 2012, 8 Billion tokens)

Chinese Simplified (Web, 2011, 2 billion tokens)

Czech (Web, 2012, 5 billion tokens)

Danish (Web, 2014, 2 billion tokens)

Estonian (Web, 2017, 1.3 billion tokens)

French (Web, 2012, 11 billion tokens)

German (Web, 2013, 20 billion tokens)

Italian (Web, 2010, 3 billion tokens)

Korean (Web, 2012, 500 million tokens)

Portuguese (Web, 2011, 4 billion tokens)

Russian (Web, 2011, 18 billion tokens)

Slovak (Web, 2021)

Slovenian (Web, 2015, 1 billion tokens)

Spanish (Web, 2011, 10 billion tokens)

Spanish, American (Web, 2011, 8 billion tokens)

Spanish, European (Web, 2011, 2 billion tokens)

Ukrainian (Web, 2020) dim 300

Arabic Web 2018 (arTenTen18)

Czech Web (csTenTen 12+17+19)

Danish Web 2020 (daTenTen20)

German Web 2020 (deTenTen20)

Greek Web 2019 (elTenTen19)

Timestamped JSI web corpus English 2014-2022

English Web 2021 (enTenTen21)

Spanish Web 2018 (esTenTen18)

Estonian National Corpus 2021 (Estonian NC 2021)

Persian Web 2018 (faTenTen18)

French Web 2020 (frTenTen20)

Hebrew Web 2021 (heTenTen21)

Hungarian Web 2020 (huTenTen20)

Indonesian Web 2020 (idTenTen20)

Italian Web 2020 (itTenTen20)

Japanese Web 2011 (jaTenTen11)

Korean Web 2018 (koTenTen18)

Dutch Web 2020 (nlTenTen20)

Norwegian Web 2017 (noTenTen17, Bokmål)

Polish Web 2012 (plTenTen12)

Portuguese Web 2020 (ptTenTen20)

Romanian Web 2021 (roTenTen21)

Russian Web 2017 (ruTenTen17)

Swedish Web 2014 (svTenTen14)

Turkish Web 2020 (trTenTen20)

Ukrainian Web 2022

Vietnamese Web 2017 (viTenTen17)

Chinese Web 2017 (zhTenTen17) Traditional

Copyright 2018-2022 Lexical Computing s. r. o.