Models for download

These models were trained using fastText from the corpora available in the Sketch Engine using the SkipGram model with dimension 100.

The models are provided in two different formats. Models with the bin extension are encoded in the native binary fastText format. Models with the vec extension use the textual Word2Vec format. We recommend the bin format, as it contains the subword n-gram information, is more compact and also faster to load.

To make use of these models, load them in fastText for supervised classification or use them with the venerable Gensim package for any NLP tasks you need.

You can browse the models using our web frontend at https://embeddings.sketchengine.eu/

Citing the models

If the models contribute to a project which leads to a publication, please acknowledge this by citing the following article:

@article{herman2021precomputed,
  title={Precomputed Word Embeddings for 15+ Languages},
  author={Herman, Ond{\v{r}}ej},
  journal={RASLAN 2021 Recent Advances in Slavonic Natural Language Processing},
  pages={41},
  year={2021}
}

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

For commercial use of the dataset please send a letter to inquiries@sketchengine.eu, in which you describe the intended usage and languages that you would like to use.

References

More languages to come

We are continuing building further models for languages for which we have enough data in Sketch Engine. If you would like us to prioritise a language for which there is a large enough corpus in Sketch Engine but we have not built embeddings yet, please send an e-mail to inquiries@sketchengine.eu.

English (Web, 2013, 20 billion tokens)

English (British National Corpus, 100 million tokens)

Early English (1 billion tokens)

Arabic (Web, 2012, 8 Billion tokens)

Chinese Simplified (Web, 2011, 2 billion tokens)

Czech (Web, 2012, 5 billion tokens)

Danish (Web, 2014, 2 billion tokens)

Estonian (Web, 2017, 1.3 billion tokens)

French (Web, 2012, 11 billion tokens)

German (Web, 2013, 20 billion tokens)

Italian (Web, 2010, 3 billion tokens)

Korean (Web, 2012, 500 million tokens)

Portuguese (Web, 2011, 4 billion tokens)

Russian (Web, 2011, 18 billion tokens)

Slovak (Web, 2021)

Slovenian (Web, 2015, 1 billion tokens)

Spanish (Web, 2011, 10 billion tokens)

Spanish, American (Web, 2011, 8 billion tokens)

Spanish, European (Web, 2011, 2 billion tokens)

Ukrainian (Web, 2020) dim 300

Copyright 2018-2022 Lexical Computing s. r. o.