ChatDOC SEO

Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V3.pdf

This document presents a study evaluating the effectiveness of Large Language Models (LLMs) for detecting code smells, focusing specifically on OpenAI's GPT-4.0 and DeepSeek-V3. It addresses the challenge of identifying the most effective LLM for this purpose by establishing a systematic evaluation framework. The research utilizes a curated dataset of code samples annotated with known code smells from four programming languages: Java, Python, JavaScript, and C++. The evaluation comprises analyzing overall model performance, performance by code smell category, and individual code smell types, employing metrics such as precision, recall, and F1-score. Additionally, the study compares the cost-effectiveness of the LLMs' token-based detection approaches to traditional static analysis tools like SonarQube. By benchmarking these models, the research aims to provide insights that can assist practitioners in selecting the most efficient, cost-effective solutions for automated code smell detection, thereby enhancing the quality and maintainability of software systems. The document contributes to the ongoing integration of AI in software engineering, illustrating the potential of LLMs to revolutionize code quality assessment through automation.

PREPRINT: This is a preprint of the paper accepted by the International Conference on Evaluation and
Assessment in Software Engineering (EASE25) – AI Models and Data Evaluation Track.
Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V3
Ahmed R. Sadik∗ Siddhata Govind†
April 23, 2025
Figure 1: LLM-based collaboration in software engineering.
Abstract
Determining the most efective Large Language Model (LLM) for code smell detection presents a complex challenge. This study introduces a structured methodology and evaluation matrix to tackle this issue, leveraging a curated dataset of code samples consistently annotated with known smells. The dataset spans four prominent programming languages—Java, Python, JavaScript, and C++—allowing for cross-language comparison. We benchmark two state-of-the-art LLMs, OpenAI GPT-4.0 and DeepSeek-V3, using precision, recall, and F1-score as evaluation metrics. Our analysis covers three levels of detail: overall performance, category-level performance, and individual code smell type performance. Additionally, we explore cost-efectiveness by comparing the token-based detection approach of GPT-4.0 with the pattern-matching techniques employed by DeepSeek-V3. The study also includes a cost analysis relative to traditional static analysis tools such as SonarQube. The findings ofer valuable guidance for practitioners in selecting an eficient, cost-efective solution for automated code smell detection.
Keywords: Code Smell Detection, Large Language Models, DeepSeek-V3, GPT-4.0, Multilingual Dataset, SonarQube, Cost-Efectiveness
1
1 Introduction
The integration of Large Language Models (LLMs) into software engineering is rapidly transforming traditional development workflows by fostering novel forms of collaboration between human developers and artificial intelligence systems Sadik et al. [2023a,b]. One compelling vision of this transformation is illustrated in Figure 1, which shows how LLM-based agents can be embedded across the entire software development lifecycle. These agents augment human roles such as product owners, developers, scrum masters, and stakeholders by providing real-time assistance in task management, code improvement, sprint planning, and requirements clarification He et al. [2024], Waseem et al. [2023].
Each artificial agent, powered by an LLM, mirrors its human counterpart by handling tasks such as auto-generating boilerplate code, analyzing sprint retrospectives, or synthesizing user stories from meeting notes. This human-AI collaboration introduces new eficiencies but also calls for robust, interpretable methods to ensure the quality of outputs—especially in critical areas such as code maintainability. One notable example is the detection and refactoring of code smells: recurring patterns in source code that signal deeper design problems Aranda et al. [2024], Lucas et al. [2024]. These smells can compromise long-term maintainability, and while traditionally uncovered through manual code reviews or static analysis tools, LLMs now ofer a scalable, language-agnostic alternative for automating this task Sadik et al. [2023b], Velasco et al. [2024]. Motivated by the role of LLMs in these emerging workflows, this paper evaluates the efectiveness of LLM-based agents in detecting code smells across a multilingual dataset. We benchmark the performance of two leading models—GPT-4.0 and DeepSeek-V3—on a shared smelly-code dataset and compare their results against traditional tools like SonarQube Lenarduzzi et al.[2020], Hong et al. [2024]. To ofer a detailed assessment, we explore three granularity levels: overall model performance, performance by code smell category, and performance by individual smell type. We further analyze performance across programming languages and evaluate cost-efectiveness.
The remainder of this paper is structured as follows. Section 2 introduces the taxonomy of code smells, providing detailed definitions and categories that underpin the analysis. Section 3 presents the multilingual dataset developed for this study, highlighting its design, implementation across four programming languages, and associated software metrics. Section 4 outlines the detection methodology, including prompt design, evaluation metrics, and the experimental setup used to assess LLM performance. Section 5 provides a language-agnostic analysis, evaluating both models at the overall, category, and type levels. Section 6 examines language-specific variations in detection performance across Java, JavaScript, Python, and C++. Section 7 analyzes the cost implications of using GPT-4.0 and DeepSeek-V3 based on pricing models and code complexity. Section 8 compares the capabilities of LLM-based detection with the static analysis tool SonarQube, highlighting key diferences in adaptability, explainability, and integration. Finally, Section 9 concludes with a discussion of key findings, limitations, and opportunities for future research.
2 Code Smells
Figure 2: Codesmells Taxonomy.
In software engineering, maintaining high code quality is essential for developing robust, maintainable, and scalable systems Sadik et al. [2023b]. As software projects evolve, they often accumulate design ineficiencies or anomalies, commonly referred to as code smells Wu et al. [2024]. These code smells serve as indicators of deeper structural issues within the codebase, potentially leading to increased technical debt
2
1. What programming languages were utilized in the case study for building the restaurant management system?
2. How does the dataset visualize key software metrics and what specific metrics are highlighted?
3. What are the primary findings regarding code smells detected by GPT-4.0 and DeepSeek in the study?
4. In what ways do LLMs differ from traditional tools like SonarQube in integrating with development workflows?
5. What recommendations are made for improving code smell detection in software development based on the findings of the paper?