Cover Image for Can LLMs Truly Build a Complete Project Repository from Scratch? (Chinese Talk)
Cover Image for Can LLMs Truly Build a Complete Project Repository from Scratch? (Chinese Talk)
Hosted By

Can LLMs Truly Build a Complete Project Repository from Scratch? (Chinese Talk)

Hosted by NICE AI Talk
YouTube
Registration
Past Event
Welcome! To join the event, please register below.
About Event

Can LLMs Truly Build a Complete Project Repository from Scratch?
Findings from Long-Horizon Generation Evaluation

Recent progress in code generation has demonstrated strong performance on short-horizon tasks such as function synthesis and local code completion. However, whether large language models can sustain coherent planning, architectural consistency, and execution reliability across the full lifecycle of building a real project repository remains an open question.

This talk presents findings from NL2Repo-Bench, a long-horizon evaluation benchmark that challenges models to construct a complete, runnable Python repository from scratch using only a natural language specification and an empty workspace. Results show that even with a perfectly designed prompt, current models frequently fail under long-horizon settings, exhibiting logical collapse, fragile cross-file dependencies, and insufficient global planning.

The study highlights long-horizon reasoning as a critical bottleneck for autonomous coding agents.


Paper

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
https://arxiv.org/pdf/2512.12730


Speaker

Shengda Long
Master’s Student, Peking University

Host

Ruiwen Zhou
PhD Student, National University of Singapore

Hosted By