Publication Date
12-1-2025
Document Type
Article
Publication Title
Neurocomputing
Volume
656
DOI
10.1016/j.neucom.2025.131461
Abstract
As Large Language Models (LLMs) are increasingly getting integrated into software development workflows, understanding their reliability, error patterns and interpretability in real-world development scenarios is crucial for establishing their practical utility. This study evaluates and interprets the performance of 15 open-source LLM models, including Code LLaMa, Granite Code, DeepSeek-Coder-V2, and Yi-Coder on code translation and generation from requirements using the Rosetta Code dataset across diverse programming languages and tasks. Syntactic correctness and code quality are quantified using metrics such as CodeBLEU, chrF, and METEOR. Interpretability is explored through Feature Ablation and Shapley Value Sampling to elucidate prompt processing mechanisms. Results indicate high syntactic correctness and quality scores for models such as DeepSeek-Coder-V2 and Yi-Coder, alongside observed sensitivities to specific prompt components. This research provides quantitative and qualitative insights into the capabilities and limitations of open-source code-generating LLMs, informing model selection and the understanding of LLM-generated code.
Funding Number
23-RSG-07-077
Keywords
Code conversion, Model explainability, Transcoding
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Department
Applied Data Science
Recommended Citation
Vishnu S. Pendyala and Neha B. Thakur. "Performance and Interpretability Analysis of Code Generation Large Language Models" Neurocomputing (2025). https://doi.org/10.1016/j.neucom.2025.131461