UC Berkeley and NYU AI Research Explores the Gap Between the Visual Embedding Space of Clip and Vision-only Self-Supervised Learning

MLLMs, or multimodal large language models, have been advancing lately. By incorporating images into large language models (LLMs) and harnessing the capabilities of LLMs, MLLMs demonstrate exceptional skill in tasks including visual question answering, instruction following, and image understanding. Studies have seen a significant flaw in these models despite their improvements; they still have some shockingly simple and obvious visual flaws.

According to recent research out of UC Berkeley and New York University, these MLLM deficiencies might be caused by visual representation issues. 

Pretrained vision and language models provide the backbone of the majority of MLLMs. To incorporate the various modalities, these models are coupled via various adapters. According to a common theory, any flaw in the pretrained vision models can potentially affect the downstream MLLMs that use them.

Regarding the visual encoder, the pretrained Contrastive Language-Image PreTraining (CLIP) model is often used by most open-source MLLMs. The researchers start by cataloging instances of failure that CLIP has difficulty accurately encoding. In the embedding space, they make use of the incorrect agreements. One of the visually distinct images is probably ambiguously encoded if CLIP encodes them similarly. Such a set of pictures is known as a CLIP-blind pair. To determine how visually similar the two images are, the team employs a vision-only self-supervised encoder like DINOv2. Here, CLIP-blind pairs refer to pictures with identical CLIP embeddings but distinct DINOv2 embeddings. They find that these CLIP-blind combinations cause MLLMs to make mistakes farther down the line. 

A new benchmark called MultiModal Visual Patterns (MMVP) is introduced with these pairs. Evaluating the visual capacities of state-of-the-art MLLMs with basic questions, this benchmark is specifically meant to query disparities in CLIP-blind pairings. The researchers tested GPT-4V and other SOTA MLLMs on the benchmark and discovered that they all fail miserably at answering basic queries about visual features. Most of these models do worse than random guessing; GPT-4V is an outlier. However, even GPT-4V shows a significant performance gap of more than 50% compared to human performance. 

After finding numerous cases of MLLM failure individually, they investigated the systematic visual patterns in MMVP with which CLIP models had difficulty. In MMVP, nine CLIPblind pairs frequently exhibit patterns like “orientation,” “counting,” and “viewpoint,” which present considerable difficulties for the CLIP vision encoder. Increasing the amount of training data and the size of the CLIP model has been a continual and substantial effort. To systematically evaluate if scaling alone can alleviate these difficulties, MMVP cases were grouped into visual patterns. According to the results, model/data scaling is insufficient since no large-scale CLIP-based models could resolve any of the nine visual patterns found. In addition, it was found that the visual patterns that test CLIP models are strongly correlated with the MLLMs’ performance. If CLIP has problems with a specific visual pattern, like “orientation,” MLLMs will probably also have trouble. Evidently, CLIP vision encoders have the potential to become a stumbling block in systems like this.

As a last stage, the team enhances the visual foundation of MLLMs. They focus on improving MLLMs’ visual grounding capabilities by integrating a vision-only self-supervised model, like DINOv2. These methods are called Mixture-of-Features (MoF). To start, a mixture called Additive-MoF (A-MoF) is created by linearly mixing CLIP and DINOv2 characteristics in varying ratios. While this method does show that DINOv2 features improve visual grounding, it does so at the expense of reduced ability to follow instructions. This solution is InterleavedMoF (I-MoF), which combines visual tokens from the CLIP and DINOv2 models in a spatially mixed fashion. While keeping the ability to follow instructions intact, it is discovered that this technique greatly improves visual anchoring. 

The pre-trained CLIP vision encoders used by MLLMs fail to sort significant visual patterns and fail to notice critical visual details in images, which causes them to fail in easy inquiries. However, regarding scalable vision models, CLIP-type models are still the gold standard. The study’s findings disprove the widespread assumption that just expanding data and models will solve all of the problems of CLIP models. The research shows that vision-and-language models and vision-only self-supervised learning models, two common types of visual representation learning models, have their strengths and weaknesses. Their unique strengths extend beyond the usual measures used to compare them, such as linear probing and zero-shot accuracy on ImageNet. New assessment metrics are needed to help create new algorithms for visual representation learning, even if a well-designed Mixture-of-Features approach might overcome visual restrictions and combine the best features of the two learning paradigms. The team hopes that their effort inspires more advancements in vision models. 

Check out the Paper and GithubAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.

Leave a Reply

Your email address will not be published. Required fields are marked *