Automating Systematic Reviews with Large Language Models
Introducing otto-SR: An AI-powered workflow revolutionizing evidence synthesis with superhuman speed and accuracy.
Systematic reviews (SRs) inform evidence-based decision making. Yet, they take over a year to complete, are prone to human error, and face challenges with reproducibility; limiting access to timely and reliable information. We developed otto-SR, an end-to-end agentic workflow using large language models (LLMs) to support and automate the SR workflow from initial search to analysis. We found that otto-SR outperformed traditional dual human workflows in SR screening (otto-SR: 96.7% sensitivity, 97.9% specificity; human: 81.7% sensitivity, 98.1% specificity) and data extraction (otto-SR: 93.1% accuracy; human: 79.7% accuracy). Using otto-SR, we reproduced and updated an entire issue of Cochrane reviews (n=12) in two days, representing approximately 12 work-years of traditional systematic review work. Across Cochrane reviews, otto-SR incorrectly excluded a median of 0 studies (IQR 0 to 0.25), and found a median of 2.0 (IQR 1 to 6.5) eligible studies likely missed by the original authors. Meta-analyses revealed that otto-SR generated newly statistically significant conclusions in 2 reviews and negated significance in 1 review. These findings demonstrate that LLMs can autonomously conduct and update systematic reviews with superhuman performance, laying the foundation for automated, scalable, and reliable evidence synthesis.
Key Breakthroughs with otto-SR
otto-SR achieved 96.7% sensitivity and 97.9% specificity, outperforming human reviewers (81.7% sensitivity, 98.1% specificity).
With 93.1% accuracy in data extraction, otto-SR surpassed human accuracy of 79.7%.
Reproduced and updated 12 Cochrane reviews in just 2 days, a task traditionally requiring ~12 work-years.
Identified a median of 2.0 eligible studies per review likely missed by original authors.
Meta-analyses using otto-SR's findings led to new statistically significant conclusions in 2 reviews and negated significance in 1 review.
The otto-SR Agentic Workflow
otto-SR is an end-to-end, LLM-based workflow designed for both fully automated and human-in-the-loop systematic reviews. It leverages state-of-the-art AI models to streamline the review process:
1. Citation Upload & PDF Processing
Citations (RIS format) are uploaded. PDFs are processed by Gemini 2.0 Flash into structured Markdown.
2. AI-Powered Screening
The otto-SR Screening Agent (using GPT-4.1) screens abstracts and full texts.
3. Automated Data Extraction
The otto-SR Extraction Agent (using o3-mini-high) performs data extraction.
This automated process significantly reduces manual effort and time compared to traditional human-centric workflows.
Traditional Human Workflow
otto-SR Automated Workflow
Performance Comparison
The Future of Evidence Synthesis
Enables quick updates to existing systematic reviews, keeping evidence current.
Paves the way for 'living' reviews with frequent (daily/weekly) updates.
Facilitates faster generation of new reviews with well-defined protocols.
Addresses common reproducibility challenges in systematic reviews.
Promotes the value of machine-readable formats for scientific publications.
Meet the Research Team
This research is the result of a collaborative effort by a dedicated team. The lead authors include:
The full list of authors and their affiliations can be found in the published paper. This work involves researchers from University of Toronto, Harvard Medical School, McGill University, University of British Columbia, MIT, University of Waterloo, Mount Sinai Hospital, University of Calgary, and many other esteemed institutions.
Paper Title:
Automation of Systematic Reviews with Large Language Models
Cite as:
Cao C, Arora R, Cento P, et al. Automation of Systematic Reviews with Large Language Models. medRxiv 2025.06.13.25329541; doi: https://doi.org/10.1101/2025.06.13.25329541
Important Notice
This is a preprint and has not been certified by peer review. It should not be used to guide clinical practice.