Automating Systematic Reviews with Large Language Models

Introducing otto-SR: An AI-powered workflow revolutionizing evidence synthesis with superhuman speed and accuracy.

Abstract

Systematic reviews (SRs) inform evidence-based decision making. Yet, they take over a year to complete, are prone to human error, and face challenges with reproducibility; limiting access to timely and reliable information. We developed otto-SR, an end-to-end agentic workflow using large language models (LLMs) to support and automate the SR workflow from initial search to analysis. We found that otto-SR outperformed traditional dual human workflows in SR screening (otto-SR: 96.7% sensitivity, 97.9% specificity; human: 81.7% sensitivity, 98.1% specificity) and data extraction (otto-SR: 93.1% accuracy; human: 79.7% accuracy). Using otto-SR, we reproduced and updated an entire issue of Cochrane reviews (n=12) in two days, representing approximately 12 work-years of traditional systematic review work. Across Cochrane reviews, otto-SR incorrectly excluded a median of 0 studies (IQR 0 to 0.25), and found a median of 2.0 (IQR 1 to 6.5) eligible studies likely missed by the original authors. Meta-analyses revealed that otto-SR generated newly statistically significant conclusions in 2 reviews and negated significance in 1 review. These findings demonstrate that LLMs can autonomously conduct and update systematic reviews with superhuman performance, laying the foundation for automated, scalable, and reliable evidence synthesis.

Key Breakthroughs with otto-SR

Superior Screening Performance

otto-SR achieved 96.7% sensitivity and 97.9% specificity, outperforming human reviewers (81.7% sensitivity, 98.1% specificity).

Enhanced Data Extraction

With 93.1% accuracy in data extraction, otto-SR surpassed human accuracy of 79.7%.

Unprecedented Speed

Reproduced and updated 12 Cochrane reviews in just 2 days, a task traditionally requiring ~12 work-years.

Improved Study Discovery

Identified a median of 2.0 eligible studies per review likely missed by original authors.

Impactful Analytical Changes

Meta-analyses using otto-SR's findings led to new statistically significant conclusions in 2 reviews and negated significance in 1 review.

The otto-SR Agentic Workflow

otto-SR is an end-to-end, LLM-based workflow designed for both fully automated and human-in-the-loop systematic reviews. It leverages state-of-the-art AI models to streamline the review process:

1. Citation Upload & PDF Processing
Citations (RIS format) are uploaded. PDFs are processed by Gemini 2.0 Flash into structured Markdown.
2. AI-Powered Screening
The otto-SR Screening Agent (using GPT-4.1) screens abstracts and full texts.
3. Automated Data Extraction
The otto-SR Extraction Agent (using o3-mini-high) performs data extraction.

This automated process significantly reduces manual effort and time compared to traditional human-centric workflows.

Workflow Comparison

Human vs. otto-SR

Traditional Human Workflow

→

Dual Screening

→

Conflict Resolution

→

Dual Extraction

→

Analysis

⏱️ 16+ months • 👥 Multiple reviewers • 💰 $100,000+

otto-SR Automated Workflow

→

AI Screening

→

AI Extraction

→

Analysis

⚡ 2 days • 🤖 Automated • 💡 Superhuman accuracy

Performance Comparison

Screening Sensitivity

81.7%

96.7%

Data Extraction

79.7%

93.1%

The Future of Evidence Synthesis

Rapid Review Updates

Enables quick updates to existing systematic reviews, keeping evidence current.

Living Systematic Reviews

Paves the way for 'living' reviews with frequent (daily/weekly) updates.

Efficient De Novo Reviews

Facilitates faster generation of new reviews with well-defined protocols.

Enhanced Reproducibility

Addresses common reproducibility challenges in systematic reviews.

Machine-Readable Science

Promotes the value of machine-readable formats for scientific publications.

Meet the Research Team

This research is the result of a collaborative effort by a dedicated team. The lead authors include:

Christian Cao

University of Toronto

Contact

Rohit Arora

Harvard Medical School

Contact

Paul Cento

Independent Researcher

Contact

The full list of authors and their affiliations can be found in the published paper. This work involves researchers from University of Toronto, Harvard Medical School, McGill University, University of British Columbia, MIT, University of Waterloo, Mount Sinai Hospital, University of Calgary, and many other esteemed institutions.

Publication Details

Paper Title:

Automation of Systematic Reviews with Large Language Models

DOI:

https://doi.org/10.1101/2025.06.13.25329541

Cite as:

Cao C, Arora R, Cento P, et al. Automation of Systematic Reviews with Large Language Models. medRxiv 2025.06.13.25329541; doi: https://doi.org/10.1101/2025.06.13.25329541

Important Notice

This is a preprint and has not been certified by peer review. It should not be used to guide clinical practice.

Access Full Paper on medRxiv

Automating Systematic Reviews with Large Language Models

Key Breakthroughs with otto-SR

The otto-SR Agentic Workflow

1. Citation Upload & PDF Processing

2. AI-Powered Screening

3. Automated Data Extraction

Traditional Human Workflow

otto-SR Automated Workflow

Performance Comparison

The Future of Evidence Synthesis

Meet the Research Team

Christian Cao

Rohit Arora

Paul Cento

Paper Title:

DOI:

Cite as:

Important Notice