Automating Systematic Reviews with Large Language Models

Introducing otto-SR: An AI-powered workflow revolutionizing evidence synthesis with superhuman speed and accuracy.

Abstract

Systematic reviews (SRs) inform evidence-based decision making. Yet, they take over a year to complete, are prone to human error, and face challenges with reproducibility; limiting access to timely and reliable information. We developed otto-SR, an end-to-end agentic workflow using large language models (LLMs) to support and automate the SR workflow from initial search to analysis. We found that otto-SR outperformed traditional dual human workflows in SR screening (otto-SR: 96.7% sensitivity, 97.9% specificity; human: 81.7% sensitivity, 98.1% specificity) and data extraction (otto-SR: 93.1% accuracy; human: 79.7% accuracy). Using otto-SR, we reproduced and updated an entire issue of Cochrane reviews (n=12) in two days, representing approximately 12 work-years of traditional systematic review work. Across Cochrane reviews, otto-SR incorrectly excluded a median of 0 studies (IQR 0 to 0.25), and found a median of 2.0 (IQR 1 to 6.5) eligible studies likely missed by the original authors. Meta-analyses revealed that otto-SR generated newly statistically significant conclusions in 2 reviews and negated significance in 1 review. These findings demonstrate that LLMs can autonomously conduct and update systematic reviews with superhuman performance, laying the foundation for automated, scalable, and reliable evidence synthesis.

Key Breakthroughs with otto-SR

Superior Screening Performance

otto-SR achieved 96.7% sensitivity and 97.9% specificity, outperforming human reviewers (81.7% sensitivity, 98.1% specificity).

Enhanced Data Extraction

With 93.1% accuracy in data extraction, otto-SR surpassed human accuracy of 79.7%.

Unprecedented Speed

Reproduced and updated 12 Cochrane reviews in just 2 days, a task traditionally requiring ~12 work-years.

Improved Study Discovery

Identified a median of 2.0 eligible studies per review likely missed by original authors.

Impactful Analytical Changes

Meta-analyses using otto-SR's findings led to new statistically significant conclusions in 2 reviews and negated significance in 1 review.

The otto-SR Agentic Workflow

otto-SR is an end-to-end, LLM-based workflow designed for both fully automated and human-in-the-loop systematic reviews. It leverages state-of-the-art AI models to streamline the review process:

  • 1. Citation Upload & PDF Processing

    Citations (RIS format) are uploaded. PDFs are processed by Gemini 2.0 Flash into structured Markdown.

  • 2. AI-Powered Screening

    The otto-SR Screening Agent (using GPT-4.1) screens abstracts and full texts.

  • 3. Automated Data Extraction

    The otto-SR Extraction Agent (using o3-mini-high) performs data extraction.

This automated process significantly reduces manual effort and time compared to traditional human-centric workflows.

Workflow Comparison
Human vs. otto-SR

Traditional Human Workflow

Search
Dual Screening
Conflict Resolution
Dual Extraction
Analysis
⏱️ 16+ months • 👥 Multiple reviewers • 💰 $100,000+

otto-SR Automated Workflow

Search
AI Screening
AI Extraction
Analysis
⚡ 2 days • 🤖 Automated • 💡 Superhuman accuracy
Performance Comparison
Screening Sensitivity
81.7%
96.7%
Data Extraction
79.7%
93.1%

The Future of Evidence Synthesis

Rapid Review Updates

Enables quick updates to existing systematic reviews, keeping evidence current.

Living Systematic Reviews

Paves the way for 'living' reviews with frequent (daily/weekly) updates.

Efficient De Novo Reviews

Facilitates faster generation of new reviews with well-defined protocols.

Enhanced Reproducibility

Addresses common reproducibility challenges in systematic reviews.

Machine-Readable Science

Promotes the value of machine-readable formats for scientific publications.

Meet the Research Team

This research is the result of a collaborative effort by a dedicated team. The lead authors include:

CC

Christian Cao

University of Toronto

Contact
RA

Rohit Arora

Harvard Medical School

Contact
PC

Paul Cento

Independent Researcher

Contact

The full list of authors and their affiliations can be found in the published paper. This work involves researchers from University of Toronto, Harvard Medical School, McGill University, University of British Columbia, MIT, University of Waterloo, Mount Sinai Hospital, University of Calgary, and many other esteemed institutions.

Publication Details

Paper Title:

Automation of Systematic Reviews with Large Language Models

Cite as:

Cao C, Arora R, Cento P, et al. Automation of Systematic Reviews with Large Language Models. medRxiv 2025.06.13.25329541; doi: https://doi.org/10.1101/2025.06.13.25329541

Important Notice

This is a preprint and has not been certified by peer review. It should not be used to guide clinical practice.

Access Full Paper on medRxiv