Reason, Review, Repeat: Hybrid Chain of Thought to Mitigate Hallucinations in Large Language Models

Publication Date

1-1-2026

Document Type

Conference Proceeding

Publication Title

Proceedings of the 2026 20th International Conference on Ubiquitous Information Management and Communication Imcom 2026

DOI

10.1109/IMCOM69009.2026.11360964

Abstract

Hallucinations, plausible yet incorrect, are prevalent across large language models undermining confidence in their reliability. This study investigates mitigation approaches for hallucinations in large language models. The study examines its effectiveness by using code generation tasks as a benchmark. Using 141 coding problems, the study compares zero-shot inference, Chain-of-Thought, and a hybrid Chain-of-Thought approach that incorporates review, optimization, and testing phases. Four large language models that were evaluated through the different approaches were Llama 3.3 and Gemma 2 (general-purpose models), DeepSeek R1 (internal Chain-of-Thought), and Qwen 2.5 Coder (fine-tuned model). Evaluations take into account accuracy, token utilization, and generation time. The results demonstrate that reasoning-enhanced approaches consistently improve accuracy around 3 % to 15 %, with the hybrid Chain-of-Thought methodology showing the most significant gains. The findings suggest that targeted prompting strategies encouraging reasoning, review, and testing can significantly enhance the reliability of LLM-generated code, with consistent improvements observable across different models.

Keywords

Benchmark, Chain-of-Thought, Large Language Model, Natural Language Processing

Department

Computer Science

Share

COinS