

Can LLMs Truly Build a Complete Project Repository from Scratch? (Chinese Talk)
Can LLMs Truly Build a Complete Project Repository from Scratch?
Findings from Long-Horizon Generation Evaluation
Recent progress in code generation has demonstrated strong performance on short-horizon tasks such as function synthesis and local code completion. However, whether large language models can sustain coherent planning, architectural consistency, and execution reliability across the full lifecycle of building a real project repository remains an open question.
This talk presents findings from NL2Repo-Bench, a long-horizon evaluation benchmark that challenges models to construct a complete, runnable Python repository from scratch using only a natural language specification and an empty workspace. Results show that even with a perfectly designed prompt, current models frequently fail under long-horizon settings, exhibiting logical collapse, fragile cross-file dependencies, and insufficient global planning.
The study highlights long-horizon reasoning as a critical bottleneck for autonomous coding agents.
Paper
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
https://arxiv.org/pdf/2512.12730
Speaker
Shengda Long
Master’s Student, Peking University
Host
Ruiwen Zhou
PhD Student, National University of Singapore