DianJin-R1: Evaluating and Enhancing Financial Reasoning in Large Language Models.pdf

The document presents DianJin-R1, a framework aimed at improving financial reasoning in large language models (LLMs) through innovative supervision and reinforcement learning techniques. Financial reasoning presents unique challenges due to its reliance on specialized domain knowledge, precise calculations, and compliance regulations. To tackle these issues, DianJin-R1 incorporates a high-quality dataset, DianJin-R1-Data, derived from existing financial question sets and a proprietary compliance dataset. The framework consists of two models, DianJin-R1-7B and DianJin-R1-32B, which are fine-tuned to produce structured outputs that include both reasoning steps and final answers. Enhancement of reasoning quality is achieved through Group Relative Policy Optimization (GRPO), which utilizes dual reward signals to promote structured outputs and accurate answers. Evaluation across multiple financial and reasoning benchmarks shows that DianJin-R1 models significantly outperform those that do not incorporate reasoning, notably in complex financial scenarios. Additionally, on real-world datasets, the models achieve notable performance with lower computational costs compared to multi-agent systems, highlighting their practical applicability in enhancing financial reasoning capabilities in LLMs.

DianJin-R1

DianJin-R1: Evaluating and Enhancing Financial Reasoning in Large Language Models

Jie Zhu1, Qian Chen1, Huaixia Dou1,2, Junhui Li2, Lifan Guo1, Feng Chen1, Chi Zhang1 1Qwen DianJin Team, Alibaba Cloud Computing 2Soochow University https://huggingface.co/DianJin https://modelscope.cn/organization/tongyi dianjin https://github.com/aliyun/qwen-dianjin https://tongyi.aliyun.com/dianjin

Abstract

Effective reasoning remains a core challenge for large language models (LLMs) in the financial domain, where tasks often require domain-specific knowledge, precise numerical calculations, and strict adherence to compliance rules. We propose DianJin-R1, a reasoning-enhanced framework designed to address these challenges through reasoning-augmented supervision and reinforcement learning. Central to our approach is DianJin-R1-Data, a high-quality dataset constructed from CFLUE, FinQA, and a proprietary compliance corpus (Chinese Compliance Check, CCC), combining diverse financial reasoning scenarios with verified annotations. Our models, DianJin-R1-7B and DianJin-R1-32B, are fine-tuned from Qwen2.5-7BInstruct and Qwen2.5-32B-Instruct using a structured format that generates both reasoning steps and final answers. To further refine reasoning quality, we apply Group Relative Policy Optimization (GRPO), a reinforcement learning method that incorporates dual reward signals: one encouraging structured outputs and another rewarding answer correctness. We evaluate our models on five benchmarks: three financial datasets (CFLUE, FinQA, and CCC) and two general reasoning benchmarks (MATH-500 and GPQADiamond). Experimental results show that DianJin-R1 models consistently outperform their non-reasoning counterparts, especially on complex financial tasks. Moreover, on the real-world CCC dataset, our single-call reasoning models match or even surpass the performance of multi-agent systems that require significantly more computational cost. These findings demonstrate the effectiveness of DianJin-R1 in enhancing financial reasoning through structured supervision and reward-aligned learning, offering a scalable and practical solution for real-world applications.

1 Introduction

Recent advances in large language models (LLMs) have led to growing interest in enhancing their reasoning abilities. Models such as OpenAI o1 (OpenAI, 2024), DeepSeek R1 (Guo et al., 2025) and QwQ (Qwen, 2024) have shown that explicitly modeling reasoning processes can significantly boost performance on complex tasks (Zhong et al., 2024). Despite these improvements, recent evaluations on financial benchmarks (Xie et al., 2023; 2024; Zhu et al., 2024; Chen et al., 2024; Qian et al., 2025; Liu et al., 2025) reveal that reasoning in this domain remains particularly challenging, given the need for domain-specific knowledge, accurate numerical reasoning, and strict compliance with regulatory requirements. Effectively addressing these challenges calls for specialized reasoning strategies capable of handling both structured financial information and open-ended problem solving. In re

DianJin-R1

sponse, we introduce DianJin-R1, LLMs that incorporate reasoning-augmented supervision and reinforcement learning to enhance performance on financial reasoning tasks.

We begin by constructing a high-quality reasoning dataset, DianJin-R1-Data, using three major sources: CFLUE (Zhu et al., 2024), FinQA (Chen et al., 2021), and our proprietary compliance dataset for the task of Chinese Compliance Check (CCC). CFLUE, which includes over 31,000 reasoning-annotated multiple-choice and open-ended questions from financial qualification mock exams, plays a central role in training due to its scale and diversity. FinQA provides numerical reasoning questions, while CCC focuses on complex compliance scenarios requiring multi-step logic. To ensure the quality of reasoning, we adopt a verification process using GPT-4o (OpenAI, 2024) to check for alignment between generated answers, reasoning steps, and reference explanations. This process results in a reliable set of reasoning-augmented and non-reasoning samples, supporting more robust model training.

For supervised fine-tuning (SFT), we train DianJin-R1-7B and DianJin-R1-32B, based on Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct (Yang et al., 2024), to generate both the reasoning process and final answers using a structured output format with <think> and<answer> tags. To further improve reasoning quality, we apply Group Relative Policy Optimization (GRPO) (Shao et al., 2024), a reinforcement learning algorithm that introduces two reward signals: a format reward to encourage structured outputs and an accuracy reward to promote answer correctness. These mechanisms guide the model to produce coherent, verifiable reasoning paths and reliable answers.

We evaluate our DianJin-R1 models, along with other general reasoning and non-reasoning models, across a diverse set of benchmarks, including CFLUE, FinQA, CCC, MATH-500 (Hendrycks et al., 2021), and GPQA-Diamond (Rein et al., 2024). The results demonstrate that reasoning-augmented models consistently outperform their non-reasoning counterparts, especially in the financial domain. Notably, training on CFLUE alone yields substantial gains across all tasks, and combining all datasets further enhances performance. Our analysis also highlights the benefit of reinforcement learning, particularly when the reward signals align with the task domain.

Finally, we demonstrate a practical application of our approach on the CCC dataset, where a multi-agent system based on LLMs is employed to perform condition-based compliance checks. By assigning specialized agents to each decision node in the workflow, the system effectively integrates intermediate reasoning steps to arrive at the final compliancej udgment.

In summary, DianJin-R1 presents a scalable and effective strategy for enhancing financial reasoning in LLMs by combining high-quality supervision, structured reasoning generation, and reward-driven refinement through reinforcement learning.

2 DianJin-R1-Data Construction

2.1 Data Source

Our dataset originates from different sources: two open-source datasets and an in-house dataset.

CFLUE (Zhu et al., 2024). It is an open-source Chinese benchmark designed to assess the performance of LLMs on a variety of natural language processing (NLP) tasks within the financial domain. Its knowledge assessment component includes 38,638 multiple-choice financial exam questions, sourced from 15 types of financial qualification mock exams that cover various subjects and difficulty levels. To construct a high-quality subset for our study, we apply a three-step filtering process—focusing on question length, difficulty, and ambiguity. First, we apply a length filter to remove questions with fewer than 15 tokens, as these typically require minimal reasoning and offer limited value for assessing deeper understanding. Second, since simple QA pairs may not significantly enhance reasoning ability (Ye et al., 2025; Muennighoff et al., 2025), we apply a difficulty filter to discard questions that are correctly answered by all smaller language models, including

ChatDOC SEO