With the growth of large language models, natural language processing has been revolutionized. Many LLMs, like GPT-3.5, LLaMA, and Mixtral, came up last year, which helped tackle diverse language tasks. Even though there are many such LLMs now, open-source models have no reliable models for translation tasks. Thorough research has been done to tackle this challenge.
Consequently, a collaboration between the researchers of Unbabel, the SARDINE Lab at Instituto Superior Técnico, and the researchers of the MICS lab at CentraleSupélec, University of Paris-Saclay, has created a new multilingual model Tower. This Llama 2-based multilingual LLM has 7B parameters specifically designed for translation-related tasks. The main highlight of this model is that, unlike other open-source models, which are predominantly built with English data, Tower supports 10 languages. These languages are English, German, French, Spanish, Chinese, Portuguese, Italian, Russian, Korean, and Dutch.
In addition to multilingual translation, it also has capabilities for pre-translation activities, like grammar improvement, to translation assessment jobs, like machine translation and automatic post-editing. The researchers of this collaboration found that this model performed better than the state-of-the-art counterparts in translation and better than alternative open-source solutions, including ALMA 13B and LLaMA-2 70B.
The researchers used two stages to formulate Tower: extended pre-training and instruction tuning. The researchers emphasized that they used continued pre-training as it enhances LLaMA2’s proficiency in non-English languages, while instruction tuning improves its performance in addressing particular problems without prior experience. To do continued pre-training, they used a dataset of 20 billion tokens evenly distributed among different languages. They sourced two-thirds of the tokens from monolingual data, and they sourced one-third of the data from publicly accessible bilingual datasets, such as OPUS.
The second step of instruction tuning enhanced the model’s ability to handle specific tasks at a higher level in a 0-shot fashion. They developed a dataset named TowerBlocks for supervised fine-tuning. The dataset comprises code instructions and conversational data and has task-specific records. This dataset helped the model to maintain competency across various translation-related tasks by providing prompts for all tasks, including zero and few-shot templates.
In conclusion, TowerInstruct can be a significant step in multilingual machine translation as it outperforms GPT-3.5 and Mixtral 8x7B models. Its features, including automatic post-edition, named-entity recognition, or source error correction, can be very helpful in this domain. As the researchers focus on enhancing the model’s efficiency, this model can be a revolutionary stride in multilingual translation. The researchers of this collaboration are also looking forward to the release of TowerEval, an evaluation repository focused on machine translation and related tasks. This will help users reproduce benchmarks and assess the performance of their language models against Tower’s standards.
Check out the Model and Reference Blog. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our Telegram Channel