Friday, September 26, 2025

Feber et al: Autonomous Agents Reach Clinical Decision Making (Nature Cancer)

Thanks to Joseph Steward at Linked In for pointing out this paper.

See:   https://www.nature.com/articles/s43018-025-00991-6

Here's the authors' abstract.  A special agentic tool made a big improvement in the success rate for GPT-4.

##

Clinical decision-making in oncology is complex, requiring the integration of multimodal data and multidomain expertise. We developed and evaluated an autonomous clinical artificial intelligence (AI) agent leveraging GPT-4 with multimodal precision oncology tools to support personalized clinical decision-making. 

The system incorporates vision transformers for detecting microsatellite instability and KRAS and BRAF mutations from histopathology slides, MedSAM for radiological image segmentation and web-based search tools such as OncoKB, PubMed and Google. Evaluated on 20 realistic multimodal patient cases, the AI agent autonomously used appropriate tools with 87.5% accuracy, reached correct clinical conclusions in 91.0% of cases and accurately cited relevant oncology guidelines 75.5% of the time.

Compared to GPT-4 alone, the integrated AI agent drastically improved decision-making accuracy from 30.3% to 87.2%. These findings demonstrate that integrating language models with precision oncology and search tools substantially enhances clinical accuracy, establishing a robust foundation for deploying AI-driven personalized oncology support systems.

###

They describe the operations this way:

We build and evaluate an AI agent tailored to interact with and draw conclusions from multimodal patient data through tools in oncology. Contrarily to the philosophy of an all-encompassing multimodal generalist foundation model, we see the achievements that specialist unimodal deep learning models have brought to precision medicine[21] as a viable template even in the future by equipping an LLM, specifically GPT-4, with additional functions and resources. These could be precision oncology deep learning models or the abil-ity to perform web search, all referred to herein as tools. Specifically, this study includes the vision model application programming interface (API) dedicated to generating radiology reports from magnetic resonance imaging (MRI) and computed tomography (CT) scans, MedSAM[22] for medical image segmentation and in-house developed vision transformer models trained to detect the presence of genetic alterations directly from routine histopathology slides[23], in particular, to distinguish between tumors with microsatellite instability (MSI) and microsatellite stability (MSS)[24] and to detect the presence or absence of mutations in KRAS and BRAF. Additionally, the system encompasses a basic calculator, capabilities for conducting web searches through Google and PubMed, as well as access to the precision oncology data-base OncoKB[25]. To ground the model’s reasoning on medical evidence, we compile a repository of roughly 6,800 medical documents and clinical scores from a collection of six different official sources, specifically tailored to oncology.

click to enlarge


###

AI CORNER  Chat GPT 5

###

Here’s a ~400-word expert-oriented summary of the Nature Cancer paper for your genomics-industry readers. It highlights what’s genuinely new and interesting beyond the abstract.


A Blueprint for Agentic AI in Oncology

Ferber et al. demonstrate that a modular AI agent built around GPT-4 and a suite of precision-oncology “tools” can outperform a stand-alone LLM for tumor-board–style decision-making. This is less about yet another LLM benchmark and more about how to orchestrate existing components—LLM, retrieval, domain-specific models, search, calculators—into an autonomous clinical-reasoning pipeline.

What the Agent Does

Rather than relying on a monolithic “generalist” multimodal model, the authors equip GPT-4 with specialized functions:

  • Vision transformers trained on TCGA slides to infer MSI vs MSS and KRAS/BRAF mutation status.

  • MedSAM for segmentation of CT/MRI lesions and a calculator to apply RECIST-style %-change.

  • OncoKB API, PubMed, Google search, and a 6,800-document oncology-guideline library accessed through a retrieval-augmented generation (RAG) pipeline.

  • Basic reasoning and planning—e.g., recognize a lesion in two scans, segment both, compute growth, then look up targeted-therapy options.

The agent autonomously chooses and sequences these tools (up to 10 per case), integrates results, and cites supporting evidence.

Performance Highlights

  • On 20 realistic gastrointestinal-oncology cases with multimodal inputs, the agent achieved 87.2% completeness on expert-defined decision criteria versus 30.3% for GPT-4 alone.

  • Tool use was successful in 87.5% of required calls; 91% of individual statements were judged correct, with only 2.4% potentially harmful.

  • Citations were accurate in 75.5% of instances—an important step toward auditability and trust.

  • Sequential tool chaining (e.g., MedSAM → calculator → RAG) proved critical to solving multi-step tasks.

  • Open-weights LLMs (Llama-3 70B, Mixtral 8×7B) failed badly at function calling—success in only 39% and 8% of required calls—reinforcing that reasoning + tool orchestration remains a differentiator for proprietary frontier models.

Why It Matters

  • Modular, update-friendly architecture: tools and guideline corpus can be swapped or refreshed without retraining the core LLM—key for keeping up with rapidly changing oncology standards.

  • Explainability & regulatory alignment: each tool can be validated separately and its output inspected—more transparent than a single black-box model.

  • Blueprint for workflow integration: the proof-of-concept suggests a path to embedding such agents into tumor-board software or EHRs, contingent on data-privacy, interoperability, and device-approval hurdles.

Limitations and Next Steps

Small-scale evaluation (20 cases) and reliance on single-slice images limit clinical readiness. Future work will need:

  • Better-validated tools (e.g., clinical-grade MSI detectors, 3-D radiology models such as Merlin)

  • Local, secure open-weights models for HIPAA/GDPR compliance

  • Human-in-the-loop, multiturn interaction to capture real tumor-board dynamics


Bottom line: The paper moves the conversation from “Can a big LLM answer board-style questions?” to “Can we orchestrate specialist tools around an LLM to make safe, auditable, up-to-date recommendations?”—a shift with immediate relevance for developers of clinical-grade decision-support in genomics and oncology.