Automated Reproducibility Has a Problem Statement Problem
Automated Reproducibility Has a Problem Statement Problem
This paper addresses the lack of a formal, generalizable problem statement for automating reproducibility in empirical AI research. The authors propose a graph-based representation grounded in the scientific method, where a study is decomposed into hypotheses, experiments, outcomes, analyses, and interpretations. This structure allows for consistent metrics across different automated reproducibility systems. As a proof of concept, they use Google Gemini 2.5 Pro to automatically extract this representation from PDFs of 20 AI papers, with original authors evaluating and correcting the outputs. The evaluation shows that the method captures most elements well, with hypotheses and interpretations requiring only minor edits, but experiment details—especially numerical results and visual data—remain challenging. The authors provide a dataset of extracted representations and corrections, enabling future improvements. This work lays the foundation for comparable and scalable automated reproducibility by providing a unified problem formulation and a benchmark for extraction quality.
Highlights
- 1Formalizes a generalizable problem statement for reproducibility based on the scientific method, applicable to any empirical AI study.
- 2Proposes a graph-based representation capturing hypotheses, experiments, outcomes, analyses, and interpretations.
- 3Demonstrates automated extraction of this representation from PDFs using LLMs, evaluated by original authors on 20 studies.
- 4Provides a dataset of extracted representations with author corrections and Likert-scale evaluations.
- 5Identifies key challenges: capturing visual results, handling long papers, and extracting precise experiment details.
Methods
- MGraph-based problem representation derived from the scientific method (hypotheses, experiments, outcomes, analyses, interpretations).
- MLLM-based automated extraction using Google Gemini 2.5 Pro with few-shot prompting and author feedback for prompt refinement.
- MEvaluation via Likert-scale ratings and error analysis (Levenshtein distance, missing elements) by original authors.
- MDataset construction from 20 AI papers across diverse subfields.
Results
- R75% of studies had all elements correctly captured; 65.52% of hypotheses required minor edits (avg 14.9% character change).
- RInterpretations were more accurate: only 24.32% edited, with avg 4.79% character change.
- RExperiment results had highest error rate: 69.63% required correction or were missing.
- RLLM struggled with visual results (graphs, histograms) and long papers (e.g., missing hypotheses in longer texts).
- RAuthor feedback was generally positive, indicating the representation captures the essence of studies.
Analyze Paper
Generate insights from "Automated Reproducibility Has a Problem Statement Problem".