Faculty Research, Scholarly, and Creative Activity

CodeGen-3D: A Benchmark for Evaluating LLMs in Zero-Shot and Iterative 3D Modeling in Blender

Publication Date

1-15-2026

Document Type

Article

Publication Title

IEEE Access

Volume

DOI

10.1109/ACCESS.2026.3654948

First Page

18181

Last Page

18192

Abstract

Large Language Models (LLMs) that generate executable code are opening a new path to text-to-3D by converting natural language prompts into scripts for modeling software like Blender. However, there is a lack of evaluation for such methods currently. Existing benchmarks focus on direct mesh synthesis and overlook critical factors like code execution reliability and human-aligned quality. To address this gap, we introduce CodeGen-3D, a comprehensive benchmark and protocol for assessing 3D modeling via code generation. The benchmark consists of 100 diverse prompts derived from Objaverse with rich captions (Cap3D, DiffuRank, and Llama-based), a standardized Blender execution and six-view rendering pipeline, and a three-part evaluation protocol that assesses: 1) error rate (compilation and runtime failures), 2) multi-view CLIP alignment (average and max) for semantic similarity, and 3) a VLM-as-a-Judge pairwise preference task (using GPT-4o) for perceptual quality and prompt adherence. We report results for eight general-purpose LLMs, a specialized BlenderLLM, and an iterative self-correcting agent. Our analysis shows that specialized and iterative systems are substantially more reliable (3–4% failure rate vs. 21–92% for general LLMs) and obtain the highest GPT-4o preference rates (e.g., 33.3% win rate for the iterative agent), even when CLIP scores are similar. This gap between CLIP-based alignment and judge-based preferences underscores the need for feedback-driven, multi-signal evaluation, suggesting that practical text-to-3D systems should combine strong general LLMs with domain specialization or inexpensive self-refinement loops, rather than relying on single-pass code generation. CodeGen-3D establishes a practical, reproducible benchmark and offers strong baselines for future research at the intersection of LLMs, code generation, and 3D modeling.

Keywords

benchmarking and evaluation, blender (bpy), CLIP alignment, code generation, iterative agents, large language models (LLMs), LLM-as-a-judge, procedural modeling, Text-to-3D

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Department

Mechanical Engineering

Recommended Citation

Hao Ji, Kotha Aditya, Sebastian Escalante, and Yunjian Qiu. "CodeGen-3D: A Benchmark for Evaluating LLMs in Zero-Shot and Iterative 3D Modeling in Blender" IEEE Access (2026): 18181-18192. https://doi.org/10.1109/ACCESS.2026.3654948

Download

Find in your library

COinS

Faculty Research, Scholarly, and Creative Activity

CodeGen-3D: A Benchmark for Evaluating LLMs in Zero-Shot and Iterative 3D Modeling in Blender

Publication Date

Document Type

Publication Title

Volume

DOI

First Page

Last Page

Abstract

Keywords

Creative Commons License

Department

Recommended Citation

Search

Browse All

Links

Faculty Research, Scholarly, and Creative Activity

CodeGen-3D: A Benchmark for Evaluating LLMs in Zero-Shot and Iterative 3D Modeling in Blender

Authors

Publication Date

Document Type

Publication Title

Volume

DOI

First Page

Last Page

Abstract

Keywords

Creative Commons License

Department

Recommended Citation

Share

Search

Browse All

Links