Impact of Noise on LLM-Models Performance in Abstraction and Reasoning Corpus (ARC) Tasks with Model Temperature Considerations.pdf
The document investigates the impact of noise on the performance of Large Language Models (LLMs) on tasks within the Abstraction and Reasoning Corpus (ARC). It highlights the challenges that current AI models face in generalizing and reasoning under uncertain conditions, specifically when noise is introduced to input data. Through this study, the authors analyze models such as GPT-4o and LLaMA 3.2 to assess how varying noise levels and temperature settings affect their reasoning capabilities. While GPT-4o demonstrates the ability to solve various ARC tasks, the results indicate that models like LLaMA 3.2 struggle significantly, revealing a critical gap in their abilities to handle structured reasoning and abstraction. The document emphasizes the need for AI systems to not only improve their conceptual understanding but also to become more robust and adaptable in the face of real-world ambiguities. By systematically introducing noise into the ARC tasks, the findings aim to guide researchers in developing more resilient models that better align with human cognitive processes, enhancing their ability to generalize from limited examples and maintain accuracy under various conditions.
Impact of Noise on LLM-Models Performance in Abstraction and Reasoning Corpus (ARC) Tasks with Model Temperature Considerations
Nikhil Khandalkar1, Krishna Shinde1, Pavan Yadav1, Lokesh B. Ramegowda1, and Rajarshi Das2 1Enkefalos Technologies 2MQube Cognition 1{nikhil.khandalkar, krishna.shinde, pavan.yadav, lokeshbr}@enkefalos.com 2rajarshi.das@mqube.ai
Abstract
Recent advancements in Large Language Models (LLMs) have sparked interest in their structured reasoning capabilities, particularly in abstraction and pattern recognition tasks. The Abstraction and Reasoning Corpus (ARC) benchmark serves as a key evaluation tool for assessing AI models’ ability to generalize and solve novel reasoning tasks. While GPT-4o successfully solves all ARC tasks at zero noise, models such as DeepSeek R1 and LLaMA 3.2 fail to solve any, raising questions about their abstraction and generalization capabilities beyond pattern matching. To investigate this further, we evaluate these models under varying noise levels and temperature settings. Our findings indicate that introducing noise significantly degrades performance across all models, underscoring their fragility under uncertain conditions. This suggests that while some models demonstrate reasoning abilities, they remain highly sensitive to input perturbations, limiting their robustness. By analyzing how diferent architectures handle noise and uncertainty, we provide insights into the limitations of current AI systems in structured reasoning. Our study highlights the need for more resilient AI models that can adapt to real-world complexity, informing future research on improving generalization, robustness, and alignment with human cognitive flexibility. 1
1 Introduction
As AI systems advance in solving complex reasoning tasks, evaluating their ability to generalize and align with human-like problem-solving strategies becomes crucial. The Abstraction and Reasoning Corpus (ARC) Challenge, introduced by Franois Chollet
1
serves as a benchmark for assessing an AI model’s capacity to perform abstract reasoning, a skill fundamental to human intelligence. Unlike traditional machine learning tasks that rely heavily on pattern recognition over large datasets, the ARC Challenge emphasizes few-shot learning, generalization, and abstraction, requiring models to infer underlying rules from minimal examples [1]. The ARC Challenge consists of diverse problem-solving tasks that demand conceptual reasoning, pattern recognition, and rule abstraction skills that closely resemble human cognitive processing. Each task presents input-output examples demonstrating a transformation rule, and the AI model must deduce the correct rule to generalize to new, unseen test cases. Since human cognition excels at such tasks through structured representation and high-level abstractions, ARC provides an ideal testbed for evaluating the alignment between AI models and human-like reasoning mechanisms. Early approaches to solving ARC primarily relied on symbolic AI and program synthesis using manually defined heuristics to infer rules from example pairs. However, these methods faced scalability issues, as they required handcrafted representations for each task. With the rise of deep learning, researchers explored the application of convolutional neural networks (CNNs), transformers, and large language models (LLMs) to ARC [8]. While some studies demonstrated moderate success using pretrained language models, they often relied on pattern recognition rather than true abstraction, failing to solve tasks that required extrapolation beyond their training distribution [3]. More recent research evaluated and GPT-4o on ARC showing improved performance but still highlighting significant limitations, particularly in handling noise and reasoning under uncertainty. A major challenge identified in prior work is the lack of robust generalization. Studies have shown that LLMs trained on large text corpora struggle to infer abstract rules in a structured reasoning setting. Additionally, research on model robustness suggests that introducing noise into input examples leads to significant performance drops, indicating that current models are highly sensitive to minor perturbations [4]. Building upon these findings, our study conducts a comprehensive evaluation of GPT-4o, and LLaMA 3.2 on ARC tasks under diferent noise levels and temperature settings. By analyzing model performance in these varied conditions, we aim to provide deeper insights into the role of structured reasoning in LLMs and their limitations in handling uncertainty. [7]. Our work extends prior research by systematically quantifying how noise afects reasoning performance and identifying the architectural diferences that contribute to these limitations. To address these concerns, we conduct a comprehensive evaluation of state-of-the-art models, including GPT-4o. LLaMA 3.2, under varying noise levels and model temperature settings.
2 Motivation
The Abstraction and Reasoning Corpus (ARC) benchmark serves as a fundamental testbed for evaluating an AI model’s ability to infer abstract patterns and solve problems requiring human-like reasoning. Unlike traditional machine learning benchmarks that rely on large-scale data-driven pattern recognition, ARC challenges models to generalize from a limited number of examples using conceptual reasoning [1]. This ability to generalize is a hallmark of human intelligence but remains a significant challenge for modern AI systems [8]. In real-world applications, AI systems frequently encounter noisy, ambiguous, or incomplete data. However, current models, including large language models (LLMs) and deep learning-based approaches, often struggle to maintain robust performance under
2