Publication Date
1-15-2026
Document Type
Article
Publication Title
IEEE Access
Volume
14
DOI
10.1109/ACCESS.2026.3654948
First Page
18181
Last Page
18192
Abstract
Large Language Models (LLMs) that generate executable code are opening a new path to text-to-3D by converting natural language prompts into scripts for modeling software like Blender. However, there is a lack of evaluation for such methods currently. Existing benchmarks focus on direct mesh synthesis and overlook critical factors like code execution reliability and human-aligned quality. To address this gap, we introduce CodeGen-3D, a comprehensive benchmark and protocol for assessing 3D modeling via code generation. The benchmark consists of 100 diverse prompts derived from Objaverse with rich captions (Cap3D, DiffuRank, and Llama-based), a standardized Blender execution and six-view rendering pipeline, and a three-part evaluation protocol that assesses: 1) error rate (compilation and runtime failures), 2) multi-view CLIP alignment (average and max) for semantic similarity, and 3) a VLM-as-a-Judge pairwise preference task (using GPT-4o) for perceptual quality and prompt adherence. We report results for eight general-purpose LLMs, a specialized BlenderLLM, and an iterative self-correcting agent. Our analysis shows that specialized and iterative systems are substantially more reliable (3–4% failure rate vs. 21–92% for general LLMs) and obtain the highest GPT-4o preference rates (e.g., 33.3% win rate for the iterative agent), even when CLIP scores are similar. This gap between CLIP-based alignment and judge-based preferences underscores the need for feedback-driven, multi-signal evaluation, suggesting that practical text-to-3D systems should combine strong general LLMs with domain specialization or inexpensive self-refinement loops, rather than relying on single-pass code generation. CodeGen-3D establishes a practical, reproducible benchmark and offers strong baselines for future research at the intersection of LLMs, code generation, and 3D modeling.
Keywords
benchmarking and evaluation, blender (bpy), CLIP alignment, code generation, iterative agents, large language models (LLMs), LLM-as-a-judge, procedural modeling, Text-to-3D
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Department
Mechanical Engineering
Recommended Citation
Hao Ji, Kotha Aditya, Sebastian Escalante, and Yunjian Qiu. "CodeGen-3D: A Benchmark for Evaluating LLMs in Zero-Shot and Iterative 3D Modeling in Blender" IEEE Access (2026): 18181-18192. https://doi.org/10.1109/ACCESS.2026.3654948