GameCraft-Bench

GameCraft-Bench evaluates whether coding agents can transform natural-language game specifications into complete, playable Godot projects. Each submission includes a game artifact and replayable interaction traces; the verifier launches the game, replays the traces, records gameplay evidence, and scores observed play with a hidden task rubric.

140Tasks

15Game Families

4Rubric Categories

Godot 4Engine

Abstract

Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay.

Evaluating this setting requires three desiderata: Desideratum I: Engine Grounding, Desideratum II: Artifact Completeness, and Desideratum III: Interactive Verification. Existing game-generation benchmarks cover parts of this contract, but leave open whether agents can carry user intent all the way to a complete engine-native game artifact whose behavior is judged through interaction.

GameCraft-Bench instantiates this framework with 140 Godot tasks across 15 game families. For each task, an agent receives a game specification and must produce a complete Godot project plus replayable demonstration traces. The verifier replays the traces and scores observed play across core mechanics, content depth, functional visuals, and art and presentation. Frontier coding agents remain far from reliable end-to-end game generation: the strongest configuration reaches only 41.46%, and most agents score below 40%.

Overview of GameCraft-Bench: agents turn natural-language game specifications into complete Godot projects, the verifier replays submitted interaction traces, and rubric-guided judging scores gameplay evidence.

Benchmark Design

GameCraft-Bench is organized around three desiderata for end-to-end game generation, compared against prior benchmarks, and instantiated as a 140-task Godot suite spanning 15 game families.

Desiderata

Engine Grounding

Games should be evaluated within a concrete game generation and runtime environment.

Artifact Completeness

Games should be evaluated as complete launchable artifacts rather than isolated components.

Interactive Verification

Games should be judged by observed behavior when they are played interactively.

Comparison

Benchmark	Engine Grounding	Artifact Completeness	Interactive Verification
GameDevBench	Godot	×	×
OpenGame-Bench	Web	✓	×
WebGameBench*	Web	✓	✓
GameCraft-Bench	Godot	✓	✓

*WebGameBench is concurrent with our work: it evaluates delivered Web games, while GameCraft-Bench targets complete projects in a dedicated game engine.

Task Suite

The benchmark spans diverse 2D game-generation demands, from continuous control and collision to rule systems, progression, exploration, and presentation-heavy interaction. Counts exclude the example task.

Platformer19

Strategy17

Tycoon16

Open-world15

Roguelike14

Visual novel11

Puzzle8

Shooter7

Simulation6

Card game5

Horror5

Rhythm5

Idle4

Racing4

Sports4

Evaluation Protocol

GameCraft-Bench operationalizes Interactive Verification by turning open-ended game submissions into replayed gameplay evidence and rubric-guided multimodal judging.

Package task, environment, and hidden rubric.
Collect project and replayable traces.
Gate scoring on launchability.
Replay traces into gameplay evidence.
Score evidence with rubric item semantics.

Core Mechanics

Does the requested gameplay loop work under player input?

Content Depth

Is there enough progression, content, and state variation?

Functional Visuals

Can players read state, feedback, transitions, and outcomes?

Art and Presentation

Does the game feel visually coherent and appropriately styled?

Playability verification pipeline — End-to-end evaluation pipeline: the verifier checks launchability, replays submitted traces into gameplay videos and sampled frames, scores the resulting evidence with a hidden rubric, and aggregates category scores into the final score.

Leaderboard

Overall and category scores on the full 140-task benchmark. Scores are percentages; Mechanics, Depth, Visuals, and Art correspond to the four rubric categories.

Rank	Agent	Model	Overall↑	Mechanics↑	Depth↑	Visuals↑	Art↑
1	Claude Code	Opus-4.7 high	41.46	55.34	39.48	42.78	36.86
2	Codex	GPT-5.5 high	39.49	54.36	38.61	41.84	32.94
3	Kimi Code	Kimi-K2.6	30.65	39.76	28.07	33.66	27.99
4	Claude Code	MiMo-V2.5-Pro	24.10	32.33	22.59	27.45	20.65
5	Code Buddy	GLM-5.1	18.29	25.23	17.80	21.14	14.59
6	Code Buddy	MiniMax-M2.7	10.95	14.27	9.92	14.92	8.85
7	Codex	DeepSeek-V4-Pro	2.15	2.25	1.69	1.97	2.63

The strongest configuration reaches only 41.46%, indicating that even frontier coding agents remain far from reliable end-to-end game generation in a real engine.

Analysis Highlights

Recognizable mechanics are easier than complete games.

Agents more often produce recognizable local mechanics, but still fail to assemble them into complete, coherent interactive systems.

Judge results are stable but not bias-free.

The playability judge is stable on fixed gameplay evidence and broadly aligned with preliminary human calibration, but it shows a mild permissive bias on content and presentation.

Rendered gameplay feedback helps debugging.

Rendered gameplay feedback helps agents debug player-facing failures that are invisible from source code and terminal logs alone.

Execution effort alone does not predict quality.

Execution effort alone does not predict playable quality; agents must close the full build--replay--evaluation loop.

Game generation ability is not monolithic.

Mechanics, content, visual feedback, and presentation are only partially coupled across generated games.

Demos

Gameplay recordings show what agents built; playable builds open the corresponding Godot Web export only when requested.

Opus-4.7

Claude Code

Horror Dollhouse

Opus-4.7 (Claude Code)

Overall 64.20

Mechanics72.00 Depth74.00 Visuals60.00 Art53.00

Try to Play

Idle Spell Tower

Opus-4.7 (Claude Code)

Overall 59.70

Mechanics78.00 Depth61.00 Visuals28.00 Art64.00

Try to Play

Platformer Cozy Harbor Delivery

Opus-4.7 (Claude Code)

Overall 76.30

Mechanics74.00 Depth78.00 Visuals85.00 Art72.00

Try to Play

Platformer Dig Descent

Opus-4.7 (Claude Code)

Overall 52.90

Mechanics70.00 Depth46.00 Visuals74.00 Art43.00

Try to Play

Racing Drift Circuit

Opus-4.7 (Claude Code)

Overall 50.00

Mechanics53.00 Depth62.00 Visuals0.00 Art58.00

Try to Play

Roguelike Dice Throne

Opus-4.7 (Claude Code)

Overall 52.80

Mechanics77.00 Depth53.00 Visuals64.00 Art38.00

Try to Play

Shooter Void Patrol

Opus-4.7 (Claude Code)

Overall 52.60

Mechanics78.00 Depth40.00 Visuals60.00 Art51.00

Try to Play

Sports Skateboard Park

Opus-4.7 (Claude Code)

Overall 57.20

Mechanics80.00 Depth54.00 Visuals65.00 Art48.00

Try to Play

Strategy Pirate Fleet

Opus-4.7 (Claude Code)

Overall 51.90

Mechanics58.00 Depth44.00 Visuals73.00 Art48.00

Try to Play

Tycoon Garden Ecosystem Keeper

Opus-4.7 (Claude Code)

Overall 70.50

Mechanics100.00 Depth82.00 Visuals58.00 Art51.00

Try to Play

GPT-5.5

Codex

Idle Factory Planet

GPT-5.5 (Codex)

Overall 71.80

Mechanics93.00 Depth74.00 Visuals43.00 Art73.00

Try to Play

Idle Spell Tower

GPT-5.5 (Codex)

Overall 70.90

Mechanics100.00 Depth89.00 Visuals28.00 Art59.00

Try to Play

Openworld Airship Trader

GPT-5.5 (Codex)

Overall 60.40

Mechanics78.00 Depth53.00 Visuals45.00 Art67.00

Try to Play

Openworld Submarine

GPT-5.5 (Codex)

Overall 61.60

Mechanics75.00 Depth63.00 Visuals40.00 Art64.00

Try to Play

Platformer Cozy Harbor Delivery

GPT-5.5 (Codex)

Overall 61.20

Mechanics60.00 Depth61.00 Visuals70.00 Art58.00

Try to Play

Puzzle Goo Architect

GPT-5.5 (Codex)

Overall 59.20

Mechanics94.00 Depth47.00 Visuals65.00 Art53.00

Try to Play

Racing Drift Circuit

GPT-5.5 (Codex)

Overall 52.60

Mechanics62.00 Depth59.00 Visuals20.00 Art57.00

Try to Play

Shooter Void Patrol

GPT-5.5 (Codex)

Overall 42.50

Mechanics81.00 Depth34.00 Visuals39.00 Art36.00

Try to Play

Sports Skateboard Park

GPT-5.5 (Codex)

Overall 59.40

Mechanics68.00 Depth57.00 Visuals60.00 Art57.00

Try to Play

Strategy Skirmish

GPT-5.5 (Codex)

Overall 58.20

Mechanics69.00 Depth66.00 Visuals59.00 Art45.00

Try to Play

Tycoon Submarine Pressure Rescue

GPT-5.5 (Codex)

Overall 68.70

Mechanics90.00 Depth85.00 Visuals75.00 Art41.00

Try to Play

Visualnovel Courtroom Clue Trial

GPT-5.5 (Codex)

Overall 58.10

Mechanics98.00 Depth66.00 Visuals69.00 Art28.00

Try to Play

Kimi-K2.6

Kimi Code

Strategy Skirmish

Kimi-K2.6 (Kimi Code)

Overall 39.33

Mechanics52.50 Depth36.25 Visuals46.39 Art33.75

Try to Play

Sports Boxing Gym

Kimi-K2.6 (Kimi Code)

Overall 41.56

Mechanics38.33 Depth42.50 Visuals37.50 Art43.75

Try to Play

Platformer Thunder Valkyrie

Kimi-K2.6 (Kimi Code)

Overall 40.48

Mechanics70.00 Depth36.00 Visuals31.00 Art36.38

Try to Play

Openworld Fishing

Kimi-K2.6 (Kimi Code)

Overall 25.02

Mechanics25.00 Depth15.00 Visuals0.00 Art45.78

Try to Play

MiMo-2.5-Pro

Claude Code

Platformer Dread Wings

MiMo-2.5-Pro (Claude Code)

Overall 61.99

Mechanics100.00 Depth25.00 Visuals70.47 Art79.06

Try to Play

Openworld Cartographer

MiMo-2.5-Pro (Claude Code)

Overall 60.35

Mechanics51.25 Depth56.25 Visuals27.50 Art63.33

Try to Play

Shooter Void Patrol

MiMo-2.5-Pro (Claude Code)

Overall 46.38

Mechanics80.00 Depth20.00 Visuals59.69 Art49.22

Try to Play

Sports Skateboard Park

MiMo-2.5-Pro (Claude Code)

Overall 44.97

Mechanics36.25 Depth47.50 Visuals20.00 Art43.12

Try to Play

Tycoon Railroad Baron

MiMo-2.5-Pro (Claude Code)

Overall 44.08

Mechanics67.50 Depth35.00 Visuals48.75 Art30.00

Try to Play

Citation

@article{luo2026gamecraft,
  title={GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?},
  author={Luo, Tongxu and Wang, Rongsheng and Bi, Jiaxi and Xu, Chenming and Tang, Zhengyang and Chen, Jianlong and Liang, Juhao and Ji, Ke and Guo, Shuqi and Du, Yuhao and Bu, Fan and Du, Wenyu and Zhang, Xiaotong and Li, Kyle and Wang, Shaobo and Zhang, Linfeng and Liu, Yuxuan and Lai, Xin and Li, Chenxin and Guo, Yiduo and Zhang, Zhexin and Wang, Xinyuan and Bai, Tianyi and Li, Ziniu and Wang, Benyou},
  journal={arXiv preprint arXiv:2606.17861},
  year={2026}
}

Acknowledgment

GameCraft-Bench builds on Godot as the game engine runtime and Harbor as the benchmark and agent-execution harness. We thank the open-source communities behind these projects for making reproducible, end-to-end game-generation evaluation possible.

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .