Causal reasoning is the process of identifying the relationship between a cause and its effect, by which people attempt to infer outcomes, making it one of the hallmarks of human intelligence. It leads to better scientific reasoning and rational decision-making, and researchers have tried to create an AI model to easily answer causal questions at a scale that eventually leads to enhanced problem-solving capabilities across diverse domains.
Many previous works have tried exploring the causal reasoning capabilities of LLMs but have failed to capture the true potential of the models in this domain. An LLM may be able to answer causal questions solely based on the repetition of verbal patterns in the training text and not because of its understanding of the relationship between the variables involved. Therefore, a team of researchers from MPI for Intelligent Systems, Tübingen, ETH Zürich, IIT Kharagpur, University of Hong Kong, and the University of Washington have introduced CLADDER, a dataset to test formal causal reasoning in LLMs through symbolic questions and ground truth answers.
CLADDER consists of more than 10,000 causal questions covering diverse queries across the three rugs of the Ladder of Causation (a hierarchy of causal inference tasks) – associational, interventional, and counterfactual. The researchers also considered various causal graphs requiring different causal inference abilities. For better analysis of LLMs, the researchers also generated ground-truth explanations with sequential reasoning. They also verbalized the questions and answers by turning them into stories. Along with the pair of questions and answers, the researchers also generated step-by-step explanations to provide intermediate reasoning steps for better performance.
The team kept the dataset size at 10K to balance the diversity of questions and minimize the inferential costs of LLMs. The dataset itself is balanced across graph structures, query types, stories, and ground-truth answers. CLADDER also has zero human annotation cost and has run through various checks to reduce grammatical errors.
The researchers also designed CausalCOT, a chain-of-thought prompting strategy for simplifying causal reasoning problems by breaking them into simpler steps. The prompting strategy has been built using the GPT-4 model, and it prompts the model to extract the causal query and graph and the available data from the question to output the correct inferences.
For evaluation, the researchers compared the performances of models like GPT, LLaMa, and Alpaca on causal reasoning. The results suggest that all of these models struggle with the reasoning questions in the CLADDER dataset, with GPT-4 achieving an accuracy of 64.28% and CausalCOT outperforming the latter with 66.64% accuracy. CausalCOT also improves reasoning abilities across all levels, with significant improvement on anti-commonsensical and nonsensical data, indicating that the same is beneficial for unseen data.
The researchers also highlighted some of the limitations of their work in the paper. The dataset covers only a few of the commonly studied queries across all three rungs, and future work is needed to extend this to further causal queries. They also pointed out that it is important to test the abilities of LLMs in semi-realistic scenarios as well for better evaluation. Nonetheless, the research paper presents a challenging benchmark for the causal assessment of LLMs, and with its diverse set of questions and scenarios, it is a crucial step toward addressing the limitations of previous works and enhancing the causal reasoning capabilities of LLMs.
Check out the Paper and Code. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 35k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, LinkedIn Group, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.