Pre-trained Word Embeddings

By Ying Lin |

Pre-trained word embeddings used in A Multi-lingual Multi-task Architecture for Low-resource Sequence Labeling1.

Mono-lingual Word Embeddings

Mono-lingual word embeddings are trained using the word2vec package.

Update: I added case-sensitive English, Dutch, and Spanish word embeddings to the following table. Case-sensitive word embeddings may provide better performance. All experiments in the paper use lower-case word embeddings.

Language Dimension Corpus Link
English 50 English Wikipedia (2017-12-20) Download
English 100 English Wikipedia (2017-12-20) Download
English (Case-sensitive) 100 English Wikipedia (2017-12-20) Download
Dutch 50 Dutch Wikipedia (2017-12-20) Download
Dutch 100 Dutch Wikipedia (2017-12-20) Download
Dutch (Case-sensitive) 100 Dutch Wikipedia (2017-12-20) Download
Spanish 50 Spanish Wikipedia (2017-12-20) Download
Spanish 100 Spanish Wikipedia (2017-12-20) Download
Spanish (Case-sensitive) 100 Spanish Wikipedia (2017-12-20) Download
Russian 50 LDC2016E95 Download
Chechen 50 TAC KBP 2017 10-Language EDL Pilot Evaluation Source Corpus Download
Chechen (Latin) 50 TAC KBP 2017 10-Language EDL Pilot Evaluation Source Corpus Download
Russian & Chechen 50 LDC2016E95, TAC KBP 2017 Download

Cross-lingual Word Embeddings

We align English and Dutch (Pair I) and English and Spanish (Pair II) mono-lingual word embeddings with the MUSE package. As we project the English embedding matrix to Dutch/Spanish, English embeddings in Pair I and II are different. Theoretically, Dutch/Spanish mono-lingual and cross-lingual embeddings should be identical, while it seems that their precisions are slightly different (e.g., mono-lingual: de [-0.301450 -0.659255 0.742733 ...] vs. cross-lingual de [-0.30145 -0.65926 0.74273 ...]).

Also, we didn’t map Chechen embeddings to English or Russian due to the small vocabulary size (7,780) and inferior embedding quality.

Pair Language Dimension Corpus Link
I English 50 English Wikipedia (2017-12-20) Download
I Dutch 50 Dutch Wikipedia (2017-12-20) Download
II English 50 English Wikipedia (2017-12-20) Download
II Spanish 50 Spanish Wikipedia (2017-12-20) Download

Reference

[1]: Lin, Y., Yang, S., Stoyanov, V., Ji, H. (2018) A Multi-lingual Multi-task Architecture for Low-resource Sequence Labeling. Proceedings of The 56th Annual Meeting of the Association for Computational Linguistics. [pdf]

@inproceedings{ying2018multi,
    title     = {A Multi-lingual Multi-task Architecture for Low-resource Sequence Labeling},
    author    = {Ying Lin and Shengqi Yang and Veselin Stoyanov and Heng Ji},
    booktitle = {Proceedings of The 56th Annual Meeting of the Association for Computational Linguistics (ACL2018)},
    year      = {2018}
}

Share this: