Automated Reproducibility Has a Problem Statement Problem

Automated Reproducibility Has a Problem Statement Problem

Thijs Snelleman
Peter Lundestad Lawrence
Holger H. Hoos
Odd Erik Gundersen
Published on 12/30/2025
Cross-asset
AI
Machine learning

This paper addresses the lack of a formal, generalizable problem statement for automating reproducibility in empirical AI research. The authors propose a graph-based representation grounded in the scientific method, where a study is decomposed into hypotheses, experiments, outcomes, analyses, and interpretations. This structure allows for consistent metrics across different automated reproducibility systems. As a proof of concept, they use Google Gemini 2.5 Pro to automatically extract this representation from PDFs of 20 AI papers, with original authors evaluating and correcting the outputs. The evaluation shows that the method captures most elements well, with hypotheses and interpretations requiring only minor edits, but experiment details—especially numerical results and visual data—remain challenging. The authors provide a dataset of extracted representations and corrections, enabling future improvements. This work lays the foundation for comparable and scalable automated reproducibility by providing a unified problem formulation and a benchmark for extraction quality.

Highlights

  • 1Formalizes a generalizable problem statement for reproducibility based on the scientific method, applicable to any empirical AI study.
  • 2Proposes a graph-based representation capturing hypotheses, experiments, outcomes, analyses, and interpretations.
  • 3Demonstrates automated extraction of this representation from PDFs using LLMs, evaluated by original authors on 20 studies.
  • 4Provides a dataset of extracted representations with author corrections and Likert-scale evaluations.
  • 5Identifies key challenges: capturing visual results, handling long papers, and extracting precise experiment details.

Methods

  • M
    Graph-based problem representation derived from the scientific method (hypotheses, experiments, outcomes, analyses, interpretations).
  • M
    LLM-based automated extraction using Google Gemini 2.5 Pro with few-shot prompting and author feedback for prompt refinement.
  • M
    Evaluation via Likert-scale ratings and error analysis (Levenshtein distance, missing elements) by original authors.
  • M
    Dataset construction from 20 AI papers across diverse subfields.

Results

  • R
    75% of studies had all elements correctly captured; 65.52% of hypotheses required minor edits (avg 14.9% character change).
  • R
    Interpretations were more accurate: only 24.32% edited, with avg 4.79% character change.
  • R
    Experiment results had highest error rate: 69.63% required correction or were missing.
  • R
    LLM struggled with visual results (graphs, histograms) and long papers (e.g., missing hypotheses in longer texts).
  • R
    Author feedback was generally positive, indicating the representation captures the essence of studies.
0/5

Analyze Paper

Generate insights from "Automated Reproducibility Has a Problem Statement Problem".

Suggested Actions