QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies
QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies
QuantCode-Bench is a benchmark designed to evaluate large language models on generating executable algorithmic trading strategies from natural language descriptions using the Backtrader framework. The benchmark comprises 400 tasks sourced from Reddit, TradingView, StackExchange, GitHub, and synthetic data, categorized by difficulty. Evaluation follows a four-stage pipeline: syntactic correctness (Compilation), successful backtest execution (Backtest), presence of at least one trade (Trade), and semantic alignment with the task description via an LLM judge (Judge). This layered approach reveals that while frontier models achieve near-perfect compilation, they struggle with correct operationalization of trading logic, proper API usage, and adherence to task semantics, with the best single-turn Judge Pass around 75.8% (claude-opus-4.6). In an agentic multi-turn setting with iterative feedback, performance improves dramatically, reaching 95-98% Judge Pass for top models, indicating that many errors are locally repairable. Error analysis shows that the most common failures are strategies that compile and backtest but generate no trades (17.8%) and incorrect handling of Backtrader line objects (13.1%). The benchmark highlights that trading strategy generation is a distinct domain-specific code generation task requiring not only technical correctness but also deep financial logic and semantic understanding. QuantCode-Bench is publicly released to foster research in domain-specific code generation and agentic software repair in finance.
Highlights
- 1Introduces QuantCode-Bench, a benchmark with 400 tasks for generating executable algorithmic trading strategies in Backtrader from natural language descriptions.
- 2Proposes a multi-stage evaluation pipeline (Compilation, Backtest, Trade, Judge) that distinguishes syntactic, execution, and semantic correctness.
- 3Evaluates 17 LLMs in single-turn and agentic multi-turn settings, showing frontier models achieve ~70-76% Judge Pass in single-turn and up to 95-98% with iterative feedback.
- 4Identifies that main failure modes are not syntax but correct operationalization of trading logic, API usage, and semantic alignment with task descriptions.
- 5Releases benchmark publicly for reproducible research in domain-specific code generation for finance.
Methods
- MMulti-stage evaluation pipeline: syntactic correctness, backtest execution, trade presence, and LLM-as-a-Judge semantic validation.
- MSingle-turn and agentic multi-turn settings with up to 10 iterative repair attempts using structured feedback.
- MDataset collection from Reddit, TradingView, StackExchange, GitHub, and synthetic sources, with difficulty categorization (easy, medium, hard).
- MError analysis taxonomy classifying failures into categories like signal conditions not activating, Line object errors, and semantic mismatches.
Results
- RBest single-turn Judge Pass: claude-opus-4.6 at 75.8%, followed by gpt-5.4 at 70.2%.
- RAgentic setting boosts best models to 95-98% Judge Pass (e.g., claude-opus-4.6 at 97.5%).
- RCompilation rates near 100% for strong models, but major drop at Backtest (26.8% failures) and Trade (17.8% no trades) stages.
- RMost frequent errors: signal conditions not activating (17.8%) and __bool__/Line object errors (13.1%) in single-turn.
- RIn agentic setting, semantic failures (Judge rejection) become dominant among unresolved cases (23.7% of last-turn failures).
Analyze Paper
Generate insights from "QuantCode-Bench: A Benchmark for Evaluating the Ability of L...".