Tuesday, December 16, 2025

From ESMO: Requirements for AI-based Biomarkers in Oncology

 At Linked In, Josie Hayes flagged an important consensus new article from European Society of Medical Oncology.   Her note is here.  

Hayes writes:

What if you could screen 100,000 patients for a low prevalence biomarker with histology slides, then confirm only the positives with molecular tests?

That's the promise of Class B AI biomarkers—and ESMO's new framework just gave us the roadmap to get there.

Class B biomarkers use AI as an indirect measure of known biomarkers through alternative methods. 

Example: Analyzing H&E slides to predict MSI status using AI, then confirming positives with molecular testing.

Tempus is already doing this with their prostate MSI screen (p-MSI)- AI on pathology images flags likely MSI-high cases before running expensive sequencing.

The economics are stunning: 

→ Screen broadly at low cost 

→ Confirm selectively with precision tests 

→ Deploy globally without breaking the budget

She points us to a new article, Aldea et al., in Annals of Oncology:

https://www.annalsofoncology.org/article/S0923-7534(25)06267-2/fulltext

###

AI CORNER

###

Chat GPT 5 discusses Aldea et al. at length.   Then, Chat GPT compares this paper on a framework for AI biomarkers, with, the GigaTime paper in Cell this month, Valanarasu et al., reporting actual data with a particular technology.

###

What the paper is and why it matters

Aldea et al. present EBAI (ESMO Basic Requirements for AI-based Biomarkers in Oncology), a consensus framework intended to close the widening gap between AI biomarker development and routine clinical adoption. The authors’ starting premise is pragmatic: regulatory authorization alone does not resolve the issues that actually determine whether an AI biomarker is used in practice. These include generalisability across sites, calibration stability, integration into clinical workflows, accountability when models err, and downstream implications for reimbursement and clinical trust.

The paper arrives at a moment when oncology faces a paradox. AI biomarkers are proliferating rapidly, often with striking performance metrics, yet clinical adoption remains cautious. Aldea et al. argue that this hesitation reflects not conservatism but category error: AI biomarkers are being treated as a single class when, in fact, they differ fundamentally in risk profile, evidentiary burden, and intended role in care. EBAI is meant to provide a shared vocabulary that aligns technical development with clinical expectations.

The framework was developed using a modified Delphi process involving 37 experts across oncology, pathology, radiology, biostatistics, ethics, regulation, and patient advocacy. Consensus was reached over four voting rounds, emphasizing the paper’s role as a field-level alignment document rather than a single-author position.


The central contribution: a use-based classification of AI biomarkers

The paper’s most important innovation is its three-class taxonomy (A, B, C), which explicitly links what an AI system does to the level and type of evidence required for its use. This classification reframes validation as context-dependent rather than universal.

Class A systems automate or standardize measurement of an already accepted biomarker using the same data modality that a human would interpret. These tools are conceptually closest to traditional pathology automation and therefore carry the lowest incremental risk. Because the output is directly auditable and maps onto existing clinical practice, the central validation question is concordance rather than discovery. Examples include automated PD-L1 or HER2 scoring on immunohistochemistry slides, tumor-infiltrating lymphocyte quantification, or residual tumor burden estimation following neoadjuvant therapy.

For Class A systems, Aldea et al. emphasize analytical validation and agreement with expert readers. Replacement of human scoring is considered reasonable when AI error rates fall within known inter-observer variability, rather than demanding unattainable perfection.

Class B systems represent a more disruptive—and economically powerful—category. These models predict a known biomarker using a different input modality, most commonly using H&E histology to infer molecular or transcriptomic features. Crucially, the intended use is usually pre-screening or triage, not full replacement. This is the category highlighted by Josie Hayes: AI can screen very large populations at low marginal cost, reserving expensive molecular testing for those most likely to benefit.

The paper draws a sharp conceptual line here. Using AI to enrich then confirm is treated as a fundamentally different—and lower-risk—proposition than replacing molecular testing outright. Validation expectations reflect this distinction. Analytical validation against a gold-standard reference test is mandatory, and high-quality real-world or retrospective trial data are acceptable. Many experts favor additional retrospective clinical validation, particularly if AI output could influence treatment decisions. Prospective “silent trials,” in which AI runs in the workflow without affecting care, are discussed as a trust-building step but are not universally required.

A key limitation is stated explicitly: when therapy selection depends on mutation subtype rather than gene-level status, current image-based predictors often lack sufficient granularity. In such cases, Class B systems should remain screening tools rather than aspirational replacements.

Class C systems are the most conceptually ambitious. These models derive novel biomarkers directly from clinical outcomes rather than predicting existing markers. The authors divide Class C into prognostic (C1) and predictive (C2) systems. Prognostic tools estimate outcomes such as recurrence or survival independent of treatment, while predictive tools aim to identify differential benefit from one therapy versus another.

For predictive Class C systems, the evidentiary bar is especially high. Demonstrating treatment interaction requires comparison across treatment arms or against an established predictive biomarker. The paper points to examples that have undergone randomized trial validation and have begun to enter clinical guidelines, underscoring that such adoption is possible—but demanding.


What ESMO says must be demonstrated

Across all classes, the framework converges on three essential requirements that cannot be waived. These are best understood not as technical formalities but as safeguards against misplaced confidence.

First, ground truth must be clearly defined. This includes how labels were generated, who performed them, whether readers were blinded, and how disagreements were adjudicated. Second, performance must be evaluated in a way that matches clinical intent, rather than relying on generic accuracy metrics. Third, generalisability must be demonstrated, with stability shown across institutions, scanners, laboratory protocols, and patient populations.

Beyond these core elements, the paper strongly encourages fairness auditing within validated populations and practical explainability checks. Importantly, explainability is framed not as philosophical transparency but as a diagnostic tool to detect shortcut learning or spurious correlations, using techniques such as occlusion testing or confounder stress-tests.


Moving beyond headline metrics

Aldea et al. are explicit in discouraging the field’s fixation on single summary statistics such as AUC. Instead, they advocate multi-dimensional performance reporting aligned to clinical use. This includes discrimination metrics, calibration assessment, incremental value over existing standards, and explicit evaluation of clinical utility.

Calibration receives particular emphasis. A well-calibrated model that clinicians can trust at specific decision thresholds is treated as more valuable than a marginally higher AUC with unstable probabilities. Decision curve analysis is highlighted as a practical way to connect model performance to real clinical trade-offs.

The authors also stress the importance of a priori sample size justification and independent validation cohorts. Models should not be validated on data that overlap—directly or indirectly—with training sources, and performance claims should be scoped to the populations actually studied.


Generalisability as a prerequisite, not an aspiration

One of the paper’s strongest messages is that AI biomarkers should not be casually “ported” across cancer types, specimen preparations, scanners, or institutions. Each such shift represents a new operating environment that requires evidence. Generalisability is treated as a first-class requirement, not a post-marketing hope.


Replacement versus pre-screening: an explicit risk calculus

Throughout the paper, intended use remains the organizing principle. For pre-screening applications, the relevant benchmark is whether human plus AI outperforms human judgment alone. For replacement, error rates must match or exceed the reliability of the existing standard of care, including accepted variability in molecular testing.

Notably, even when Class B systems approach molecular-test-level performance, many experts still favor confirmatory testing in the near term. This reflects sociotechnical realities—trust, liability, and workflow—rather than purely statistical concerns.


After adoption: monitoring is expected, not optional

EBAI treats post-deployment monitoring as part of validation, not an afterthought. Continuous performance surveillance, drift detection, and bias monitoring are recommended, with predefined remediation pathways. When retrospective validation is insufficient, limited prospective evaluation within real workflows is encouraged.


Ethics, regulation, and human oversight

While not a regulatory manual, the framework aligns closely with emerging AI governance principles. It emphasizes clinician oversight for high-impact decisions, transparency to patients when AI materially influences care, and careful alignment between validated populations and real-world use. Fairness is framed pragmatically as fitness for the intended population rather than abstract demographic parity.


Strategic interpretation in the context of gigaTIME and Hayes’ commentary

If gigaTIME represents the technological leap of foundation models applied to H&E pathology, Aldea et al. provide the governance and deployment logic. The paper formalizes Class B AI biomarkers as a legitimate and powerful category, particularly for population-scale screening and trial enrichment, while warning against prematurely claiming replacement.

The unifying message is not that AI biomarkers work, but that they must be classified, validated, deployed, and monitored in ways that match their intended role. This framing is what allows impressive AI performance to translate into clinically adoptable and economically scalable biomarker infrastructure.

 ###

###

###

ESMO PAPER (Aldea) vs CELL PAPER (GigaTime, Valanarasu)

You’re dealing with two fundamentally different genres of paper, serving complementary but non-overlapping purposes:


Aldea (ESMO): policy, framing, and clinical governance

The Aldea / ESMO EBAI paper is not trying to advance the science of AI models. Instead, it is doing something rarer and arguably harder: stabilizing the conceptual ground so that AI biomarkers can move from impressive demonstrations into clinical systems without constant category errors.

Its unit of analysis is intended use, not architecture or performance ceilings. The paper assumes that powerful models already exist (and will continue to improve) and asks:

  • What kind of AI biomarker is this, really?
  • What evidence is proportionate to the clinical risk it introduces?
  • When is screening acceptable, and when is replacement a bridge too far?
  • What does “validation” actually mean once you leave the lab?

In that sense, Aldea is closer to clinical doctrine, health policy, and systems engineering than to computer science. It is explicitly normative: it tells the field how to behave if it wants trust, adoption, and scale.


Valanarasu et al. (Cell): scientific discovery and technical proof

By contrast, Valanarasu et al. (gigaTIME) is a pure science research paper, published in Cell for exactly that reason. Its goal is to show that something previously thought infeasible is, in fact, possible.

Its core scientific claims are:

  • H&E morphology contains enough latent signal to reconstruct spatial proteomic patterns.
  • A multimodal, foundation-style model can learn a cross-modal translation from H&E to multiplex immunofluorescence.
  • Once that translation exists, you can generate virtual populations at unprecedented scale, enabling discoveries that were previously blocked by data scarcity.

The unit of analysis here is capability:

  • Can we do this at all?
  • Does it generalize?
  • What new biological insights fall out once we can?

This is not a clinical deployment paper, and it is not trying to be. There is no pretense that gigaTIME is a “biomarker” in the regulatory or ESMO sense. It is an enabling scientific instrument.


Why they are different — and why they belong together

Seen clearly, the papers are not in tension at all. They sit at different layers of the same stack:

  • gigaTIME (Cell) lives at the capability layer:
    What can foundation models extract from routine pathology that humans cannot?
  • Aldea / ESMO (Annals of Oncology) lives at the deployment layer:
    Once such capabilities exist, how do we classify, validate, and safely use them in medicine?

A useful way to say it bluntly:

gigaTIME expands the possibility space.
EBAI constrains the permission space.

That is exactly why your instinct to review them together — but not conflate them — is correct.


The key connective insight (without collapsing the categories)

What gigaTIME enables, Aldea helps discipline.

  • gigaTIME makes Class B–style economics (image-first, molecular-scale inference) plausible at population scale.
  • Aldea explains why those same tools should initially be framed as screening, enrichment, or discovery instruments, not instant replacements for molecular assays.
  • gigaTIME shows that H&E can be a gateway to multiplex biology.
  • Aldea explains when and how such gateways can be allowed to influence care.

Put differently:
Cell papers create new worlds; ESMO papers decide which doors you’re allowed to open, and under what supervision.