GameCraft-Bench

Can Agents Build Playable Games End-to-End in a Real Game Engine?

Tongxu Luo1,2*, Rongsheng Wang1*, Jiaxi Bi2,4*†, Chenming Xu1,2*, Zhengyang Tang1,3*, Jianlong Chen1, Juhao Liang1, Ke Ji1, Shuqi Guo1,2, Yuhao Du1,2, Fan Bu1,2, Wenyu Du5, Xiaotong Zhang4, Kyle Li7, Shaobo Wang6, Linfeng Zhang6, Yuxuan Liu3, Xin Lai3, Chenxin Li3, Yiduo Guo3, Zhexin Zhang3, Xinyuan Wang3, Tianyi Bai3, Ziniu Li3, Benyou Wang1,2‡

1The Chinese University of Hong Kong, Shenzhen, 2Shenzhen Loop Area Institute, 3Hunyuan Team, Tencent, 4USTB, 5DualverseAI, 6SJTU, 7NUS

*Equal contributions. Work done during interning at SLAI. Corresponding author.

GameCraft-Bench

GameCraft-Bench evaluates whether coding agents can transform natural-language game specifications into complete, playable Godot projects. Each submission includes a game artifact and replayable interaction traces; the verifier launches the game, replays the traces, records gameplay evidence, and scores observed play with a hidden task rubric.

140Tasks
15Game Families
4Rubric Categories
Godot 4Engine

Abstract

Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay.

Evaluating this setting requires three desiderata: Desideratum I: Engine Grounding, Desideratum II: Artifact Completeness, and Desideratum III: Interactive Verification. Existing game-generation benchmarks cover parts of this contract, but leave open whether agents can carry user intent all the way to a complete engine-native game artifact whose behavior is judged through interaction.

GameCraft-Bench instantiates this framework with 140 Godot tasks across 15 game families. For each task, an agent receives a game specification and must produce a complete Godot project plus replayable demonstration traces. The verifier replays the traces and scores observed play across core mechanics, content depth, functional visuals, and art and presentation. Frontier coding agents remain far from reliable end-to-end game generation: the strongest configuration reaches only 41.46%, and most agents score below 40%.

Overview of GameCraft-Bench
Overview of GameCraft-Bench: agents turn natural-language game specifications into complete Godot projects, the verifier replays submitted interaction traces, and rubric-guided judging scores gameplay evidence.

Benchmark Design

GameCraft-Bench is organized around three desiderata for end-to-end game generation, compared against prior benchmarks, and instantiated as a 140-task Godot suite spanning 15 game families.

Desiderata

Engine Grounding

Games should be evaluated within a concrete game generation and runtime environment.

Artifact Completeness

Games should be evaluated as complete launchable artifacts rather than isolated components.

Interactive Verification

Games should be judged by observed behavior when they are played interactively.

Comparison

Benchmark Engine
Grounding
Artifact
Completeness
Interactive
Verification
GameDevBench Godot × ×
OpenGame-Bench Web ×
WebGameBench* Web
GameCraft-Bench Godot

*WebGameBench is concurrent with our work: it evaluates delivered Web games, while GameCraft-Bench targets complete projects in a dedicated game engine.

Task Suite

The benchmark spans diverse 2D game-generation demands, from continuous control and collision to rule systems, progression, exploration, and presentation-heavy interaction. Counts exclude the example task.

Platformer19
Strategy17
Tycoon16
Open-world15
Roguelike14
Visual novel11
Puzzle8
Shooter7
Simulation6
Card game5
Horror5
Rhythm5
Idle4
Racing4
Sports4

Evaluation Protocol

GameCraft-Bench operationalizes Interactive Verification by turning open-ended game submissions into replayed gameplay evidence and rubric-guided multimodal judging.

  1. Package task, environment, and hidden rubric.
  2. Collect project and replayable traces.
  3. Gate scoring on launchability.
  4. Replay traces into gameplay evidence.
  5. Score evidence with rubric item semantics.

Core Mechanics

Does the requested gameplay loop work under player input?

Content Depth

Is there enough progression, content, and state variation?

Functional Visuals

Can players read state, feedback, transitions, and outcomes?

Art and Presentation

Does the game feel visually coherent and appropriately styled?

Playability verification pipeline
End-to-end evaluation pipeline: the verifier checks launchability, replays submitted traces into gameplay videos and sampled frames, scores the resulting evidence with a hidden rubric, and aggregates category scores into the final score.

Leaderboard

Overall and category scores on the full 140-task benchmark. Scores are percentages; Mechanics, Depth, Visuals, and Art correspond to the four rubric categories.

Rank Agent Model Overall Mechanics Depth Visuals Art
1 Claude Code ClaudeOpus-4.7 high 41.46 55.34 39.48 42.78 36.86
2 Codex OpenAIGPT-5.5 high 39.49 54.36 38.61 41.84 32.94
3 Kimi Code KimiKimi-K2.6 30.65 39.76 28.07 33.66 27.99
4 Claude Code Xiaomi MiMoMiMo-V2.5-Pro 24.10 32.33 22.59 27.45 20.65
5 Code Buddy ZhipuGLM-5.1 18.29 25.23 17.80 21.14 14.59
6 Code Buddy MiniMaxMiniMax-M2.7 10.95 14.27 9.92 14.92 8.85
7 Codex DeepSeekDeepSeek-V4-Pro 2.15 2.25 1.69 1.97 2.63

The strongest configuration reaches only 41.46%, indicating that even frontier coding agents remain far from reliable end-to-end game generation in a real engine.

Analysis Highlights

Recognizable mechanics are easier than complete games.

Agents more often produce recognizable local mechanics, but still fail to assemble them into complete, coherent interactive systems.

Judge results are stable but not bias-free.

The playability judge is stable on fixed gameplay evidence and broadly aligned with preliminary human calibration, but it shows a mild permissive bias on content and presentation.

Rendered gameplay feedback helps debugging.

Rendered gameplay feedback helps agents debug player-facing failures that are invisible from source code and terminal logs alone.

Execution effort alone does not predict quality.

Execution effort alone does not predict playable quality; agents must close the full build--replay--evaluation loop.

Game generation ability is not monolithic.

Mechanics, content, visual feedback, and presentation are only partially coupled across generated games.

Demos

Gameplay recordings show what agents built; playable builds open the corresponding Godot Web export only when requested.

Claude

Opus-4.7

Claude Code
OpenAI

GPT-5.5

Codex
Kimi

Kimi-K2.6

Kimi Code
Xiaomi MiMo

MiMo-2.5-Pro

Claude Code

Citation

@article{luo2026gamecraft,
  title={GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?},
  author={Luo, Tongxu and Wang, Rongsheng and Bi, Jiaxi and Xu, Chenming and Tang, Zhengyang and Chen, Jianlong and Liang, Juhao and Ji, Ke and Guo, Shuqi and Du, Yuhao and Bu, Fan and Du, Wenyu and Zhang, Xiaotong and Li, Kyle and Wang, Shaobo and Zhang, Linfeng and Liu, Yuxuan and Lai, Xin and Li, Chenxin and Guo, Yiduo and Zhang, Zhexin and Wang, Xinyuan and Bai, Tianyi and Li, Ziniu and Wang, Benyou},
  journal={arXiv preprint arXiv:2606.17861},
  year={2026}
}

Acknowledgment

GameCraft-Bench builds on Godot as the game engine runtime and Harbor as the benchmark and agent-execution harness. We thank the open-source communities behind these projects for making reproducible, end-to-end game-generation evaluation possible.

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .