Large language models have shown notable achievements in executing instructions, multi-turn conversations, and image-based question-answering tasks. These models include Flamingo, GPT-4V, and Gemini. The fast development of open-source Large Language Models, such as LLaMA and Vicuna, has greatly accelerated the evolution of open-source vision language models. These advancements mainly center on improving visual understanding by utilizing language models with at least 7B parameters and integrating them with a vision encoder. Autonomous driving and robotics are two examples of time-sensitive or real-time interactive applications that could benefit from a faster inference speed and shorter test times.
Regarding mobile technology, Gemini has been a trailblazer for multimodal approaches. Gemini-Nano, a simplified version, contains 1.8/3.25 billion parameters and can be used on mobile devices. Yet, information such as the model’s design, training datasets, and training procedures is confidential and cannot be shared with anybody.
A new study by Midea Group and East China Normal University provides LLaVA-Phi, a little language model-powered vision-language assistant. The most effective open-sourced tiny language model, Phi-2.2, and the robust open-sourced multimodal model, LLaVA-1.5, are combined in this study. The researchers use LLaVA’s high-quality visual instruction tuning data in a two-stage training pipeline. They tested LLaVA-Phi using eight different metrics.
Its performance is on par with, or even better than, other three times larger multimodal models, and it only has three billion parameters.
The team used a wide variety of academic standards developed for multimodal models to thoroughly evaluate LLaVA-Phi. Examples of these tests include VQA-v2, VizWizQA, ScienceQA, and TextQA for general question-answering and more specialized assessments like POPE for object hallucination and MME, MMBench, and MMVet for a comprehensive evaluation of diverse multimodal abilities like visual understanding and visual commonsense reasoning. The proposed method outperformed other big multimodal models that were previously available by demonstrating that the model could answer questions based on visual cues. Amazingly, LLaVA-Phi achieved better results than models like IDEFICS, which rely on a 7B-parameter or greater LLMs.
The top score the model achieved on ScienceQA stands out. The success of their multimodal model in answering math-based questions can be attributed to the Phi-2 language model, which has been trained on mathematical corpora and code production in particular. In the extensive multimodal benchmark of MMBench, LLaVA-Phi outperformed numerous prior art vision-language models based on 7B-LLM.
Another parallel effort that constructs an effective vision-language model, MobileVLM, was also compared. LLaVA-Phi routinely beats all the approaches on all five measures.
The team highlights that since the model has not been fine-tuned to follow multilingual instructions, the LLaVA-Phi architecture cannot process instructions in various languages, including Chinese, because Phi-2 uses the codegenmono tokenizer. They intend to improve training procedures for small language models in the future and investigate the effect of visual encoder size, looking at methods like RLHF and direct preference optimization. These endeavors aim to further improve performance while decreasing model size.
Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 35k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.