Are CLIP Models ‘Parroting’ Text in Images? This Paper Explores the Text Spotting Bias in Vision-Language Systems

In recent research, a team of researchers has examined CLIP (Contrastive Language-Image Pretraining), which is a famous neural network that effectively acquires visual concepts using natural language supervision. CLIP, which predicts the most relevant text snippet given an image, has helped advance vision-language modeling tasks. Though CLIP’s effectiveness has established itself as a fundamental model for a number of different applications, CLIP models display biases pertaining to visual text, color, gender, etc.

A team of researchers from Shanghai AI Laboratory, Show Lab, National University of Singapore, and Sun Yat-Sen University have examined CLIP’s visual text bias, particularly with regard to its capacity to identify text in photos. The team has studied the LAION-2B dataset in detail and has found that estimating bias accurately is difficult given the enormous volume of image-text data.

Picture clustering has been used on the complete dataset to solve the problem, ranking each cluster according to CLIP scores. This analysis aims to determine which image-text pair kinds are most favored based on CLIP score measures. Many examples with the highest CLIP scores have been included, consisting of dense contemporaneous text that appears at the pixel level in both the captions and the images. 

The captions that coincide with the samples have been called the ‘Parrot Captions’ since they appear to give CLIP another way to accomplish its objectives by teaching it to recognize text without necessarily grasping the visual notions. The team has studied the significance of the parrot captions by examining the dataset from three angles, i.e., the dataset itself, popular models that have been released, and the model-training procedure.

The team has discovered a notable bias in how visual text material embedded in images is described in LAION-2B captions. They have found that over 50% of the photos have visual text content by thoroughly profiling the LAION-2B dataset utilizing commercial text detection methods. Their analysis of paired image-text data has shown that more than 90% of captions have at least one word that appears simultaneously, with the caption and spotted text from the images having a word overlap of about 30%. This suggests that when trained with LAION-style data, CLIP significantly deviates from the fundamental presumption of semantic congruence between picture and text.

The study has looked into biases in released CLIP models, namely a significant bias in favor of text spotting in different types of web photographs. The team has compared alignment scores before and after text removal to examine how OpenAI’s publicly available CLIP model behaves on the LAION-2B dataset. The findings have shown a strong association between visual text incorporated in images with corresponding parrot captions and CLIP model predictions. 

The team has also demonstrated the text spotting abilities of CLIP and OpenCLIP models, finding that OpenCLIP, which was trained on LAION-2B, shows a greater bias in favor of text spotting than CLIP, which was trained on WIT-400M. The research has focussed on how CLIP models can quickly pick up text recognition skills from parrot captions, but they have trouble making the connection between vision and language semantics. 

Based on text-oriented parameters, such as the embedded text ratio, contemporaneous word ratios, and relative CLIP scores from text removal, several LAION-2B subsets have been sampled. The findings have shown that CLIP models gain good text detection abilities when trained with parrot caption data, but they lose most of their zero-shot generalization ability on image-text downstream tasks. 

In conclusion, this study has focussed on the effects of parrot captions on CLIP model learning. It has shed light on biases associated with visual text in LAION-2B captions and has emphasized the text spotting bias in published CLIP models.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 35k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, LinkedIn Group, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.

Leave a Reply

Your email address will not be published. Required fields are marked *