Cross-lingual learning

Most languages do not have training data available to create state-of-the-art models and thus our ability to create intelligent systems for these languages is limited as well.

Cross-lingual learning (CLL) is one possible remedy to solve the lack of data for low-resource languages. In essence, it is an effort to utilize annotated data from other languages when building new NLP models. When CLL is considered, target languages usually lack resources, while source languages are resource-rich and they can be used to improve the results for the former.

Cross-lingual resources

The domain shift, i.e. the difference between source and target languages, is often quite severe. The languages might have different vocabularies, syntax or even alphabets. Various cross-lingual resources are often employed to address the gap between languages.

Here is a short overview of different resources that might be used.

Multilingual distributional representations

With multilingual word embeddings (MWE), words from multiple languages share one semantic vector space. In this space semantically similar words are close together independently on the language they come from.

During the training MWE usually require additional cross-lingual resources, e.g. bilingual dictionaries or parallel corpora. Multilingual sentence embeddings work on similar principle, but they use sentences instead of words. Ideally, corresponding sentences should have similar representations.

Evaluation of multilingual distributional representations

Evaluation of multilingual distributional representations can be done either intrinsically or extrinsically.

  • With intrinsic evaluation authors usually measure how well does semantic similarity reflect in the vector space, i.e. how far apart are semantically similar words or sentences.

  • On the other hand, extrinsic evaluation measure how good are the representations for downstream tasks, i.e. they are evaluated on how well they perform for CLL.

Parallel corpus

Parallel corpora are one of the most basic linguistic resources. In most cases, sentence-aligned parallel corpus of two languages is used. Wikipedia is sometimes used as a comparable parallel corpus, although due to its complex structure it can also be used as a multilingual knowledge base.

Parallel corpora are most often created for specific domains, such as politics, religion or movie subtitles. Parallel corpora are also used as training sets for machine translation systems and for creating multilingual distributional representations, which makes parallel corpora even more important.

Word Alignments

In some cases, sentence alignment in parallel corpora might not be enough.

For word alignments, one word from a sentence in language $\ell_A$ can be aligned with any number of words from corresponding sentence in language $\ell_B$. In most cases automatic tools are used to perform word alignment over existing parallel sentences. Machine Translation systems can also often provide word alignment information for their generated sentences.

Machine Translation

Machine translation (MT) can be used instead of parallel corpora to generate parallel sentences. Parallel corpus generated by MT is called pseudo parallel corpus. Although in recent years MT achieved great improvements by using neural encoder-decoder models, machine translation is still far away from providing perfect translations. MT models are usually trained from parallel corpora.

By using samples generated by MT systems we inevitably inject noise into our models; The domain shift between a language $\ell_A$ and what MT systems generate as language $\ell_A$ needs to be addressed.

Universal features (out of fashion)

Universal features are inherently language independent to some extent, e.g. emojis or punctuation. These can be used as features for any language. As such, model trained with such universal features should be easily applied to other languages.

The process of creating language independent features for words is called delexicalization. Delexicalized text has words replaced with universal features, such as POS tags. We lose the lexical information of the words in this process, thus the name delexicalization.

Bilingual dictionary

Bilingual dictionaries are the most available cross-lingual resource in our list. They exist for many language pairs and provide a very easy and natural way of connecting words from different languages. However, they are often incomplete and context insensitive.

Pre-trained multilingual language models

Pre-trained language models are a state-of-the-art NLP technique. A large amount of text data is used to train a high capacity language model. Then we can use the parameters from this language model to initialize further training with different NLP tasks. The parameters are fine-tuned with the additional target task data. This is a form of transfer learning, where we use language modeling as a source task. The most well known pre-trained language models are BERT.

Multilingual language models (MLMs) are an extension of this concept. A single language model is trained with multiple languages at the same time.

  • This can be done without any cross-lingual supervision, i.e. we feed the model with text from multiple languages and we do not provide the model with any additional information about the relations between the languages. Such is the case of multilingual BERT model (mBERT). Interestingly enough, even with no information about how are the languages related, the representations this model creates are partially language independent. The model is able to understand the connections between languages even without being explicitly told to do so. To know more about mBERT, check my previous post about common MLMs.

  • The other case are models that directly work with some sort of cross-lingual supervision, i.e. they use data that help them establish a connection between different languages. Such is the case of XLM, which make use of parallel corpora and machine translation to directly teach the model about corresponding sentences. To know more about XLM, check my previous post about common MLMs.

Transfer learning techniques for Cross-lingual Learning

在这里插入图片描述 Four main categories for CLL:

  • Label transfer: Labels or annotations are transferred between corresponding $L_S$ and $L_T$ samples.
  • Feature transfer: Similar to label transfer, but sample features are transferred instead (transfer knowledge about the features of the sample).
  • parameter transfer: Parameter values are transferred between parametric models. This effectively transfers the behaviour of the model.
  • Representation transfer: The expected values for hidden representation are transferred between models. The target model is taught to create desired representations.

Note: Representation transfer is similar to feature transfer. However, instead of simply transferring the features, it teaches the $L_T$model to create these features instead.

Label transfer

Transferring labels means transferring these labels between the samples from different languages. First, a correspondence is established between $L_S$ and $L_T$ samples. Correspondence means that the pair of samples should have the same label. Then the labels are transferred from one language to the other – this step is also called annotation projection. The projected labels can than be used for further training.

There are three types of distinct label transfer and they differ in the language the resulting model takes as an input.

Transferring labels during training (To train an $L_T$ model)

To create an $L_T$ model with label transfer we need an existing annotated dataset or model in $L_S$. Then we establish a correspondence between $L_S$ annotated samples and $L_T$ unannotated samples. The annotations are projected to $L_T$ and resulting distantly supervised $L_T$ dataset is used to train an $L_T$ model.

When machine translation is used to generate the corresponding samples we can

  • either take existing $L_S$ samples and translate them into $L_T$ along with their annotations.
  • Or, we can take unannotated $L_T$ data, translate them into $L_S$ and then annotate these translated $L_S$ samples, and finally, the annotations are projected back to the original samples. The former is the more frequent alternative.
Transferring labels during inference (Use existing $L_S$ for inference)

An unannotated $L_T$ sample has a corresponding $L_S$ sample generated (e.g. by MT) and then a pre-existing $L_S$ model is used to infer its label. This label is then projected back to the original $L_T$ sample. This approach is quite slow and prone to errors.

Parallel Model

This is the least frequent type of label transfer, so I skip it here.

Commons of label transfer techniques
  1. All label-based transfer techniques require a process to generate corresponding samples. Two main approaches are using machine translation and parallel corpora.
  • Machine translation. The biggest disadvantage is that MT systems are still not perfect and only generate very specific dialects of the output languages. This shift between the natural lan- guage and the generated language is a source of noise in CLL.
  • parallel corpus. Parallel corpora can be used to avoid the problem with noisy translation. In parallel corpus both sides are written in natural language. We can then use existing model to annotate the LS side of the corpus and then project the labels to the other half. However, the annotations from the existing model is the source of noise as well. Usually NLP models have limited accuracy and some samples will be mislabeled. Manually labeled parallel corpora exist, but they are very rare.
  1. Cross-lingual projection of the labels is the step when the transfer of knowledge between languages happens for label-based approaches.

  2. $L_S$ model can be trained with additional data translated from $L_T$. This can improve the results during inference, since the model was already exposed to the translated data during training and does not suffer from the domain shift that much.

  3. Label based transfer is notoriously noisy. It consists of several steps and each step can be a source of noise, e.g. imperfect machine translations, imperfect word alignments, imperfect pre-existing models, domain shift between parallel corpora and testing data, etc.

Feature Transfer

This is not a frequent type of transfer, so I skip it.

Parameter transfer

In parameter transfer, the behavior of the model is transfered. The transfer of knowledge happens only on shared layers. The most important technologies for parameter transfer are different language independent representations and pre-trained multilingual language models. There are three scenarios for parameter transfer.

Zero-shot transfer

No $L_T$ data are used during the training. We train the model with $L_S$ only and then apply it to other languages. This is sometimes called direct transfer or model transfer. It is most often used together with language independent representations (e.g. MWE or MLMs. Contextualized word embedding from BERT is an example of MWE).

Joint learning

Both $L_S$ and $L_T$ data are used during the training, and they are used at the same time. The $L_T$ and $L_S$ models share a subset of their parameters, i.e. when one model updates its parameters, it affects the other model(s) as well. This technique of working with parameters is called parameter sharing.

There are two strategies of training:

  • Mixed dataset, which can be applied only when all the parameters are shared. During this strategy the training samples for all the languages are mixed into one training set. Then during the training, one batch can contain samples from multiple languages.

  • Alternate training, which can be applied even when only a subset of parameters is shared. During this strategy, the batch is sampled from one language only. Then this batch is run through the language-specific model and the parameter update is propagated to other models, that share some of the parameters.

Parameter sharing strategies (for each layer):

  • Shared. The parameters are shared between multiple languages.
  • Partially-private. Part of the layer isshared, while the other part is private. The most common way to implement this strategy is to have two distinct parallel layers, one with private strategy and the other with shared strategy. Then, the output of these two layers is concatenated.
  • Private. The parameters are specific for one language.

The sharing of parameters can be realized in two ways:

  • Hard sharing. With hard sharing, the shared layers are exactly the same.
  • Soft sharing. Parameters can also be bound by a regularization term instead. E.g., add an additional term $\left|\theta_{S}-\theta_{T}\right|{2}^{2},$ where $\theta{S}$ and $\theta_{T}$ are the shared parameters for source model and target model, respectively.

Three different parameter sharing strategies: 在这里插入图片描述 a) Mixed batch with samples from both languages ($X_{ST}$) is processed by the model. All theparameters are shared betweenLSandLTLoss functionJSTis used for these batches.

b) Alternate training is used with all parameters shared, each batch is created in either $L_S$ or $L_T$. Loss function $J_{ST}$ is still the same for both cases.

c) Alternate training with only a subset of parameters shared. In this example, the second layer is language-specific(private), while the other layers are shared. Loss function is calculated differently for each language.

Cascade learning

Both $L_S$ and $L_T$ data are used during the training, but not at the same time. Instead we pre-train a model with $L_S$ data (or simply take an existing $L_S$ model) and then fine-tune it with $L_T$ data.

Pre-trained multilingual language models

Pre-trained multilingual language models can be used for parameter based transfer as well.

  1. We can use them as a source of multilingual distributed representations and then build a model on these representations.
  2. We can use the MLM as an initialization of a multilingual model. Then we fine-tune the parameters to perform a target task.

Multilingual language models were able to achieve state-of-the-art results recently and they might become the predominant cross-lingual learning paradigm in the near future.

Representation transfer

With representation transfer, the knowledge about how hidden representations should look like within a model is transferred. It is a technique to extend other approaches. It is often used with deep models that use hidden representations during the calculation. This technique is often using corresponding samples or words from different languages, i.e. we usually need a parallel corpus or a bilingual dictionary.

Future Directions

  • Multilingual datasets. We consider the lack of multilingual datasets to be currently the biggest challenge for CLL. We believe that it is important to provide standardized variants of the datasets in the future for better reproducibility in these additional settings as well.

  • Standardization of linguistic resources. It is hard to compare between various resources when the corpora they are trained on might differ. It is then hard to distinguish, whether an eventual performance improvement comes from a better method or from a better corpus used for training.

  • Pre-trained multilingual language models. Training, fine-tuning and inference are all costly for MLMs. We expect to see methods that optimize the training and inference of these behemoths. As of today, the simple $L_S$ fine-tuning might lead to catastrophic forgetting, including forgetting related to the cross-lingual knowledge.

  • Truly low-resource languages and excluded languages. The CLL methods often rely on an existence of various linguistic resources, such as parallel corpora or MT systems. However,truly low-resource languages might not have these resources in this quantity and/or quality.The fact that methods are currently evaluated on resource-rich languages might then create unrealistic expectations about how well would the methods work on truly low-resource languages.

  • Curse of Multilinguality. Researchers report that adding more languages help initially, but after certain number the performance of the models actually starts to degrade. This is the curse of multilinguality, the apparent inability of models to learn too many languages. This curse is caused by a limited capacity of current parametric models. It can be addressed by increasing the capacity of the models, but the models are costly to train even today and increasing their size even further has only diminishing returns.

  • Combination with multitask learning. We believe,that a combination of cross-lingual and cross-task supervision might lead to more universal models, that are able to solve multiple tasks in multiple languages and then also generalize their broad knowledge into new tasks and new languages more easily.

  • Machine translation. One open question is how to mitigate the domain shift between natural languages and languages generated by MT systems.


Reference