The Polyglot Model: Natural Language Processing in a Multilingual World.
Some things are best not to ask a linguist, such as “How many languages do you speak?” Or “How many languages are there in the world?” This last question, seemingly simple, is a subject of heated debate among scholars of the language. Currently, there are an estimated 6,500 languages spoken in the world. Some continents, like Oceania, concentrate enormous linguistic variation; In Papua New Guinea alone, for example, some 840 languages are spoken; which shows an impressive diversity and richness.
If instead of with a linguist we talk to someone who is dedicated to Natural Language Processing, things change. The development of techniques for automatically analyzing and processing human text is restricted to a handful of languages. In fact, the vast majority of techniques, and especially the impressive developments in recent years, are focused on English. We reach a point where we have algorithms that can produce opinion pieces that differ little or nothing from those that a journalist could have written, but always in English.
The reasons that explain this reality can be divided into two. On the one hand, it is due to the general interest in the scientific community. Research is mostly done in English, and until recently, the Natural Language Processing area was focused on that language. On the other hand, there is a more technical reason: the availability of data. Modern techniques use huge amounts of text to train machine learning models, and that data is hard to come by in other languages.
Data, datos, dati, données.
Depending on the type of model we are training, and the type of training you need, the difficulty of obtaining data will vary. One type of model, called “supervised,” requires annotated examples; that is, examples in which the prediction is already made. For example, imagine an algorithm that identifies whether a text has a positive or negative connotation. Such a model is built by putting together many examples (such as “I loved the hamburger” → Positive / “They have the worst fries I ever ate in my life” → Negative), and showing them to the model so that they learn what makes something positive and something negative. Now these examples have to be done pretty well for the model to learn well. They are generally made human, and it is a slow and expensive process. That is why they are usually done in a single language, which is usually English.
Another type of models are those that learn from text without annotations. This is the type of learning that is used in the famous language models, which revolutionized the area of NLP in recent years. Simply put, models view a lot of text and learn a general idea about how the language works, which can then be used to perform different tasks. A priori it may seem that this situation is more convenient to get data for various languages, since it is not necessary to write them down manually. However, this type of training can deepen the differences between some languages and others. When we say that models see “a lot of text”, we mean that they see quantities like all of Wikipedia. Clearly, these algorithms will not work as well for languages like Guaraní, which has 3,850 articles, compared to 6,138,396 for English.
What can I do if I need algorithms in other languages?
If the development of natural processing algorithms progresses as it appears to be, and their techniques become commonplace in different fields, the tools must necessarily be multilingual. In the case of IOMED, for example, it is crucial to be able to analyze text in the languages in which doctors write. So what can you do? We enter the magical world of inter-language processing: taking advantage of existing resources in one language to generate solutions in another. And although it seems magic, it can be done, even thought it is a world that is just beginning to be explored.
One of the best-known techniques for doing this is to start with models that understand both languages. For this, it is important to understand how text-processing models work. More specifically, how is a word encoded in such a way that a model can understand and process it? One of the most widespread techniques consists of modeling the vocabulary of a language in a matrix, where each word is assigned a vector that represents how it is related to the others. Suppose then that, on the one hand we have a matrix for language A and a model already developed for that same language, and on the other hand we have a matrix for language B. To take advantage of the model for A, we can “align” both matrices, leaving the vectors in the same space, and use them as if they were the same language. Clearly, this works better the closer the languages are, and the simpler the model task.
Another option is to translate the training data. If we have a lot of data annotated in language A, and we want to train a model for language B, we can use some machine translation mechanism to generate a new dataset. Again, the effectiveness of this method will depend on the quality of the translation tools that we have available, which are usually very good for English and similar languages, but are scarce for other languages.
What if we don’t have translators or matrices for our language? It is not a simple question; for 85% of the languages spoken in the world these resources are practically non-existent. There are some very interesting experimental proposals. For example, one direction consists of building models for specific tasks, which they learn by seeing only elements that are common to several languages. Another alternative is to develop systems that generate text in low-resource languages, to build corpora automatically.
It is true that there is still a lot of work to be done, but perhaps in a few years we will see truly polyglot models.