Engine Grounding
Games should be evaluated within a concrete game generation and runtime environment.
GameCraft-Bench evaluates whether coding agents can transform natural-language game specifications into complete, playable Godot projects. Each submission includes a game artifact and replayable interaction traces; the verifier launches the game, replays the traces, records gameplay evidence, and scores observed play with a hidden task rubric.
Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay.
Evaluating this setting requires three desiderata: Desideratum I: Engine Grounding, Desideratum II: Artifact Completeness, and Desideratum III: Interactive Verification. Existing game-generation benchmarks cover parts of this contract, but leave open whether agents can carry user intent all the way to a complete engine-native game artifact whose behavior is judged through interaction.
GameCraft-Bench instantiates this framework with 140 Godot tasks across 15 game families. For each task, an agent receives a game specification and must produce a complete Godot project plus replayable demonstration traces. The verifier replays the traces and scores observed play across core mechanics, content depth, functional visuals, and art and presentation. Frontier coding agents remain far from reliable end-to-end game generation: the strongest configuration reaches only 41.46%, and most agents score below 40%.
GameCraft-Bench is organized around three desiderata for end-to-end game generation, compared against prior benchmarks, and instantiated as a 140-task Godot suite spanning 15 game families.
Games should be evaluated within a concrete game generation and runtime environment.
Games should be evaluated as complete launchable artifacts rather than isolated components.
Games should be judged by observed behavior when they are played interactively.
| Benchmark | Engine Grounding |
Artifact Completeness |
Interactive Verification |
|---|---|---|---|
| GameDevBench | Godot | × | × |
| OpenGame-Bench | Web | ✓ | × |
| WebGameBench* | Web | ✓ | ✓ |
| GameCraft-Bench | Godot | ✓ | ✓ |
*WebGameBench is concurrent with our work: it evaluates delivered Web games, while GameCraft-Bench targets complete projects in a dedicated game engine.
The benchmark spans diverse 2D game-generation demands, from continuous control and collision to rule systems, progression, exploration, and presentation-heavy interaction. Counts exclude the example task.
GameCraft-Bench operationalizes Interactive Verification by turning open-ended game submissions into replayed gameplay evidence and rubric-guided multimodal judging.
Does the requested gameplay loop work under player input?
Is there enough progression, content, and state variation?
Can players read state, feedback, transitions, and outcomes?
Does the game feel visually coherent and appropriately styled?
Overall and category scores on the full 140-task benchmark. Scores are percentages; Mechanics, Depth, Visuals, and Art correspond to the four rubric categories.
| Rank | Agent | Model | Overall↑ | Mechanics↑ | Depth↑ | Visuals↑ | Art↑ |
|---|---|---|---|---|---|---|---|
| 1 | Claude Code | 41.46 | 55.34 | 39.48 | 42.78 | 36.86 | |
| 2 | Codex | 39.49 | 54.36 | 38.61 | 41.84 | 32.94 | |
| 3 | Kimi Code | Kimi-K2.6 |
30.65 | 39.76 | 28.07 | 33.66 | 27.99 |
| 4 | Claude Code | 24.10 | 32.33 | 22.59 | 27.45 | 20.65 | |
| 5 | Code Buddy | 18.29 | 25.23 | 17.80 | 21.14 | 14.59 | |
| 6 | Code Buddy | 10.95 | 14.27 | 9.92 | 14.92 | 8.85 | |
| 7 | Codex | 2.15 | 2.25 | 1.69 | 1.97 | 2.63 |
The strongest configuration reaches only 41.46%, indicating that even frontier coding agents remain far from reliable end-to-end game generation in a real engine.
Agents more often produce recognizable local mechanics, but still fail to assemble them into complete, coherent interactive systems.
The playability judge is stable on fixed gameplay evidence and broadly aligned with preliminary human calibration, but it shows a mild permissive bias on content and presentation.
Rendered gameplay feedback helps agents debug player-facing failures that are invisible from source code and terminal logs alone.
Execution effort alone does not predict playable quality; agents must close the full build--replay--evaluation loop.
Mechanics, content, visual feedback, and presentation are only partially coupled across generated games.
Gameplay recordings show what agents built; playable builds open the corresponding Godot Web export only when requested.
@article{luo2026gamecraft,
title={GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?},
author={Luo, Tongxu and Wang, Rongsheng and Bi, Jiaxi and Xu, Chenming and Tang, Zhengyang and Chen, Jianlong and Liang, Juhao and Ji, Ke and Guo, Shuqi and Du, Yuhao and Bu, Fan and Du, Wenyu and Zhang, Xiaotong and Li, Kyle and Wang, Shaobo and Zhang, Linfeng and Liu, Yuxuan and Lai, Xin and Li, Chenxin and Guo, Yiduo and Zhang, Zhexin and Wang, Xinyuan and Bai, Tianyi and Li, Ziniu and Wang, Benyou},
journal={arXiv preprint arXiv:2606.17861},
year={2026}
}
GameCraft-Bench builds on Godot as the game engine runtime and Harbor as the benchmark and agent-execution harness. We thank the open-source communities behind these projects for making reproducible, end-to-end game-generation evaluation possible.
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .