Cross-Lingual Learning

Cross-lingual learning Most languages do not have training data available to create state-of-the-art models and thus our ability to create intelligent systems for these languages is limited as well. Cross-lingual learning (CLL) is one possible remedy to solve the lack of data for low-resource languages. In essence, it is an effort to utilize annotated data from other languages when building new NLP models. When CLL is considered, target languages usually lack resources, while source languages are resource-rich and they can be used to improve the results for the former....

December 28, 2020 · 13 min

Multi-lingual: M-Bert, LASER, MultiFiT, XLM

Multilingual Models are a type of Machine Learning model that can understand different languages. In this post, I’m going to discuss four common multi-lingual language models Multilingual-Bert (M-Bert), Language-Agnostic SEntence Representations (LASER Embeddings), Efficient multi-lingual language model fine-tuning (MultiFiT) and Cross-lingual Language Model (XLM). Ways of tokenization Word-based tokenization Word-based tokenization works well for the morphologically poor English, but results in very large and sparse vocabularies for morphologically rich languages, such as Polish and Turkish....

August 8, 2020 · 15 min