QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

—

ArXiv PDF

100%

QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

Alexey Khoroshilov

Alexey Chernysh

Orkhan Ekhtibarov

Nini Kamkia

Dmitry Zmitrovich

Published on 4/16/2026

Equities

Futures

Options

Derivatives

Cross-asset

LLM

Machine learning

Backtesting

QuantCode-Bench is a benchmark designed to evaluate large language models on generating executable algorithmic trading strategies from natural language descriptions using the Backtrader framework. The benchmark comprises 400 tasks sourced from Reddit, TradingView, StackExchange, GitHub, and synthetic data, categorized by difficulty. Evaluation follows a four-stage pipeline: syntactic correctness (Compilation), successful backtest execution (Backtest), presence of at least one trade (Trade), and semantic alignment with the task description via an LLM judge (Judge). This layered approach reveals that while frontier models achieve near-perfect compilation, they struggle with correct operationalization of trading logic, proper API usage, and adherence to task semantics, with the best single-turn Judge Pass around 75.8% (claude-opus-4.6). In an agentic multi-turn setting with iterative feedback, performance improves dramatically, reaching 95-98% Judge Pass for top models, indicating that many errors are locally repairable. Error analysis shows that the most common failures are strategies that compile and backtest but generate no trades (17.8%) and incorrect handling of Backtrader line objects (13.1%). The benchmark highlights that trading strategy generation is a distinct domain-specific code generation task requiring not only technical correctness but also deep financial logic and semantic understanding. QuantCode-Bench is publicly released to foster research in domain-specific code generation and agentic software repair in finance.

Highlights

1Introduces QuantCode-Bench, a benchmark with 400 tasks for generating executable algorithmic trading strategies in Backtrader from natural language descriptions.
2Proposes a multi-stage evaluation pipeline (Compilation, Backtest, Trade, Judge) that distinguishes syntactic, execution, and semantic correctness.
3Evaluates 17 LLMs in single-turn and agentic multi-turn settings, showing frontier models achieve ~70-76% Judge Pass in single-turn and up to 95-98% with iterative feedback.
4Identifies that main failure modes are not syntax but correct operationalization of trading logic, API usage, and semantic alignment with task descriptions.
5Releases benchmark publicly for reproducible research in domain-specific code generation for finance.

Methods

M
Multi-stage evaluation pipeline: syntactic correctness, backtest execution, trade presence, and LLM-as-a-Judge semantic validation.
M
Single-turn and agentic multi-turn settings with up to 10 iterative repair attempts using structured feedback.
M
Dataset collection from Reddit, TradingView, StackExchange, GitHub, and synthetic sources, with difficulty categorization (easy, medium, hard).
M
Error analysis taxonomy classifying failures into categories like signal conditions not activating, Line object errors, and semantic mismatches.

Results

R
Best single-turn Judge Pass: claude-opus-4.6 at 75.8%, followed by gpt-5.4 at 70.2%.
R
Agentic setting boosts best models to 95-98% Judge Pass (e.g., claude-opus-4.6 at 97.5%).
R
Compilation rates near 100% for strong models, but major drop at Backtest (26.8% failures) and Trade (17.8% no trades) stages.
R
Most frequent errors: signal conditions not activating (17.8%) and __bool__/Line object errors (13.1%) in single-turn.
R
In agentic setting, semantic failures (Judge rejection) become dominant among unresolved cases (23.7% of last-turn failures).

0/5turns

Analyze Paper

Generate insights from "QuantCode-Bench: A Benchmark for Evaluating the Ability of L...".

Suggested Actions