Commonsense reasoning is an essential facet of human cognition that enables intuitive interpretation and interaction with the world. In NLP, this translates into the ability of LLMs and Multimodal Large Language Models (MLLMs) to interpret human language and visual cues realistically. Despite advancements, these models often struggle to mimic the nuanced commonsense reasoning innate to humans, encompassing basic knowledge, social interactions, moral reasoning, and visual interpretation.
The challenge in NLP research pivots around the models’ ability to employ commonsense knowledge. This critical aspect of intelligence involves not just language interpretation but also the integration of visual cues and contextual understanding. The core issue lies in the models’ limited capacity for human-like commonsense reasoning, essential for understanding basic concepts, social nuances, moral judgments, and visual information processing.
Recent developments have focused on evaluating various LLMs and MLLMs on their effectiveness in commonsense reasoning tasks. These models undergo rigorous testing across diverse datasets designed to probe different dimensions of commonsense reasoning. Despite their sophisticated capabilities, these models often need to improve in tasks requiring deep contextual understanding or abstract thought.
Stanford University and Meta researchers introduce models like Gemini Pro and Gemini Pro Vision to address these challenges. These models are tailored for multimodal integration and mark significant progress, showing impressive results in commonsense reasoning tasks across multiple domains. However, they still grapple with understanding complex scenarios and abstract ideas, which encompass a critical area for improvement.
The study involved comprehensive evaluations using 12 diverse commonsense reasoning datasets covering general, physical, social, and temporal reasoning. Models like Gemini Pro and Gemini Pro Vision were assessed for their performance in language-based and multimodal scenarios. The methodology included evaluating models like Llama2-70b, Gemini Pro, GPT-3.5 Turbo, GPT-4 Turbo using language datasets, and Gemini Pro Vision and GPT-4V for the multimodal dataset. The key findings indicated that while Gemini Pro’s performance was comparable to GPT-3.5 Turbo but it lagged behind GPT-4 Turbo in accuracy, especially in temporal and social reasoning.
In visual commonsense evaluations, Gemini Pro Vision demonstrated proficiency in analyzing graphic scenes and predicting potential consequences which is a crucial aspect of visual commonsense reasoning. However, all models exhibited challenges in specific areas, particularly those involving temporal and social aspects of commonsense reasoning.
In conclusion, the key points can be summarized as follows:
- The study highlights the need for AI systems to mimic human-like commonsense reasoning better.
- Despite advancements, there needs to be more in the models’ ability to grasp complex, abstract concepts inherent in human cognition fully.
- Future research can focus on refining models’ capabilities in specialized domains and improving the nuanced recognition of mental states and emotions in multimodal contexts.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 35k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, LinkedIn Group, Twitter, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Hello, My name is Adnan Hassan. I am a consulting intern at Marktechpost and soon to be a management trainee at American Express. I am currently pursuing a dual degree at the Indian Institute of Technology, Kharagpur. I am passionate about technology and want to create new products that make a difference.