At Linked In, Josie Hayes flagged an important consensus new article from European Society of Medical Oncology. Her note is here.
Hayes writes:
What if you could screen 100,000 patients for a low prevalence biomarker with histology slides, then confirm only the positives with molecular tests?
That's the promise of Class B AI biomarkers—and ESMO's new framework just gave us the roadmap to get there.
Class B biomarkers use AI as an indirect measure of known biomarkers through alternative methods.
Example: Analyzing H&E slides to predict MSI status using AI, then confirming positives with molecular testing.
Tempus is already doing this with their prostate MSI screen (p-MSI)- AI on pathology images flags likely MSI-high cases before running expensive sequencing.
The economics are stunning:
→ Screen broadly at low cost
→ Confirm selectively with precision tests
→ Deploy globally without breaking the budget
She points us to a new article, Aldea et al., in Annals of Oncology:
https://www.annalsofoncology.org/article/S0923-7534(25)06267-2/fulltext
###
AI CORNER
###
Chat GPT 5 discusses Aldea et al. at length. Then, Chat GPT compares this paper on a framework for AI biomarkers, with, the GigaTime paper in Cell this month, Valanarasu et al., reporting actual data with a particular technology.
###
What
the paper is and why it matters
Aldea et al. present EBAI (ESMO Basic Requirements for
AI-based Biomarkers in Oncology), a consensus framework intended to close
the widening gap between AI biomarker development and routine clinical
adoption. The authors’ starting premise is pragmatic: regulatory authorization
alone does not resolve the issues that actually determine whether an AI
biomarker is used in practice. These include generalisability across sites,
calibration stability, integration into clinical workflows, accountability when
models err, and downstream implications for reimbursement and clinical trust.
The paper arrives at a moment when oncology faces a paradox.
AI biomarkers are proliferating rapidly, often with striking performance
metrics, yet clinical adoption remains cautious. Aldea et al. argue that this
hesitation reflects not conservatism but category error: AI biomarkers are
being treated as a single class when, in fact, they differ fundamentally in
risk profile, evidentiary burden, and intended role in care. EBAI is meant to
provide a shared vocabulary that aligns technical development with clinical
expectations.
The framework was developed using a modified Delphi
process involving 37 experts across oncology, pathology, radiology,
biostatistics, ethics, regulation, and patient advocacy. Consensus was reached
over four voting rounds, emphasizing the paper’s role as a field-level
alignment document rather than a single-author position.
The central contribution: a use-based classification of
AI biomarkers
The paper’s most important innovation is its three-class
taxonomy (A, B, C), which explicitly links what an AI system does to the
level and type of evidence required for its use. This classification reframes
validation as context-dependent rather than universal.
Class A systems automate or standardize measurement
of an already accepted biomarker using the same data modality that a human
would interpret. These tools are conceptually closest to traditional pathology
automation and therefore carry the lowest incremental risk. Because the output
is directly auditable and maps onto existing clinical practice, the central
validation question is concordance rather than discovery. Examples include
automated PD-L1 or HER2 scoring on immunohistochemistry slides, tumor-infiltrating
lymphocyte quantification, or residual tumor burden estimation following
neoadjuvant therapy.
For Class A systems, Aldea et al. emphasize analytical
validation and agreement with expert readers. Replacement of human scoring is
considered reasonable when AI error rates fall within known inter-observer
variability, rather than demanding unattainable perfection.
Class B systems represent a more disruptive—and
economically powerful—category. These models predict a known biomarker
using a different input modality, most commonly using H&E histology
to infer molecular or transcriptomic features. Crucially, the intended use is
usually pre-screening or triage, not full replacement. This is the
category highlighted by Josie Hayes: AI can screen very large populations at
low marginal cost, reserving expensive molecular testing for those most likely
to benefit.
The paper draws a sharp conceptual line here. Using AI to enrich
then confirm is treated as a fundamentally different—and
lower-risk—proposition than replacing molecular testing outright. Validation
expectations reflect this distinction. Analytical validation against a
gold-standard reference test is mandatory, and high-quality real-world or
retrospective trial data are acceptable. Many experts favor additional
retrospective clinical validation, particularly if AI output could influence
treatment decisions. Prospective “silent trials,” in which AI runs in the
workflow without affecting care, are discussed as a trust-building step but are
not universally required.
A key limitation is stated explicitly: when therapy
selection depends on mutation subtype rather than gene-level status,
current image-based predictors often lack sufficient granularity. In such
cases, Class B systems should remain screening tools rather than aspirational
replacements.
Class C systems are the most conceptually ambitious.
These models derive novel biomarkers directly from clinical outcomes
rather than predicting existing markers. The authors divide Class C into
prognostic (C1) and predictive (C2) systems. Prognostic tools estimate outcomes
such as recurrence or survival independent of treatment, while predictive tools
aim to identify differential benefit from one therapy versus another.
For predictive Class C systems, the evidentiary bar is
especially high. Demonstrating treatment interaction requires comparison across
treatment arms or against an established predictive biomarker. The paper points
to examples that have undergone randomized trial validation and have begun to
enter clinical guidelines, underscoring that such adoption is possible—but
demanding.
What ESMO says must be demonstrated
Across all classes, the framework converges on three
essential requirements that cannot be waived. These are best understood not as
technical formalities but as safeguards against misplaced confidence.
First, ground truth must be clearly defined. This
includes how labels were generated, who performed them, whether readers were
blinded, and how disagreements were adjudicated. Second, performance must be
evaluated in a way that matches clinical intent, rather than relying on
generic accuracy metrics. Third, generalisability must be demonstrated,
with stability shown across institutions, scanners, laboratory protocols, and
patient populations.
Beyond these core elements, the paper strongly encourages
fairness auditing within validated populations and practical explainability
checks. Importantly, explainability is framed not as philosophical transparency
but as a diagnostic tool to detect shortcut learning or spurious correlations,
using techniques such as occlusion testing or confounder stress-tests.
Moving beyond headline metrics
Aldea et al. are explicit in discouraging the field’s
fixation on single summary statistics such as AUC. Instead, they advocate
multi-dimensional performance reporting aligned to clinical use. This includes
discrimination metrics, calibration assessment, incremental value over existing
standards, and explicit evaluation of clinical utility.
Calibration receives particular emphasis. A well-calibrated
model that clinicians can trust at specific decision thresholds is treated as
more valuable than a marginally higher AUC with unstable probabilities.
Decision curve analysis is highlighted as a practical way to connect model
performance to real clinical trade-offs.
The authors also stress the importance of a priori sample
size justification and independent validation cohorts. Models should not be
validated on data that overlap—directly or indirectly—with training sources,
and performance claims should be scoped to the populations actually studied.
Generalisability as a prerequisite, not an aspiration
One of the paper’s strongest messages is that AI biomarkers
should not be casually “ported” across cancer types, specimen preparations,
scanners, or institutions. Each such shift represents a new operating
environment that requires evidence. Generalisability is treated as a
first-class requirement, not a post-marketing hope.
Replacement versus pre-screening: an explicit risk
calculus
Throughout the paper, intended use remains the organizing
principle. For pre-screening applications, the relevant benchmark is whether human
plus AI outperforms human judgment alone. For replacement, error rates must
match or exceed the reliability of the existing standard of care, including
accepted variability in molecular testing.
Notably, even when Class B systems approach
molecular-test-level performance, many experts still favor confirmatory testing
in the near term. This reflects sociotechnical realities—trust, liability, and
workflow—rather than purely statistical concerns.
After adoption: monitoring is expected, not optional
EBAI treats post-deployment monitoring as part of
validation, not an afterthought. Continuous performance surveillance, drift
detection, and bias monitoring are recommended, with predefined remediation
pathways. When retrospective validation is insufficient, limited prospective
evaluation within real workflows is encouraged.
Ethics, regulation, and human oversight
While not a regulatory manual, the framework aligns closely
with emerging AI governance principles. It emphasizes clinician oversight for
high-impact decisions, transparency to patients when AI materially influences
care, and careful alignment between validated populations and real-world use.
Fairness is framed pragmatically as fitness for the intended population rather
than abstract demographic parity.
Strategic interpretation in the context of gigaTIME and
Hayes’ commentary
If gigaTIME represents the technological leap of
foundation models applied to H&E pathology, Aldea et al. provide the governance
and deployment logic. The paper formalizes Class B AI biomarkers as a
legitimate and powerful category, particularly for population-scale screening
and trial enrichment, while warning against prematurely claiming replacement.
The unifying message is not that AI biomarkers work, but
that they must be classified, validated, deployed, and monitored in ways
that match their intended role. This framing is what allows impressive AI
performance to translate into clinically adoptable and economically scalable
biomarker infrastructure.
###
###
###
ESMO PAPER (Aldea) vs CELL PAPER (GigaTime, Valanarasu)
You’re dealing with two fundamentally different genres of
paper, serving complementary but non-overlapping purposes:
Aldea (ESMO): policy, framing, and clinical governance
The Aldea / ESMO EBAI paper is not trying to
advance the science of AI models. Instead, it is doing something rarer and
arguably harder: stabilizing the conceptual ground so that AI biomarkers
can move from impressive demonstrations into clinical systems without constant
category errors.
Its unit of analysis is intended use, not
architecture or performance ceilings. The paper assumes that powerful models
already exist (and will continue to improve) and asks:
- What
kind of AI biomarker is this, really?
- What
evidence is proportionate to the clinical risk it introduces?
- When
is screening acceptable, and when is replacement a bridge too far?
- What
does “validation” actually mean once you leave the lab?
In that sense, Aldea is closer to clinical doctrine,
health policy, and systems engineering than to computer science. It is
explicitly normative: it tells the field how to behave if it wants trust,
adoption, and scale.
Valanarasu et al. (Cell): scientific discovery and
technical proof
By contrast, Valanarasu et al. (gigaTIME) is a pure
science research paper, published in Cell for exactly that reason.
Its goal is to show that something previously thought infeasible is, in fact,
possible.
Its core scientific claims are:
- H&E
morphology contains enough latent signal to reconstruct spatial
proteomic patterns.
- A
multimodal, foundation-style model can learn a cross-modal translation
from H&E to multiplex immunofluorescence.
- Once
that translation exists, you can generate virtual populations at
unprecedented scale, enabling discoveries that were previously blocked
by data scarcity.
The unit of analysis here is capability:
- Can we
do this at all?
- Does
it generalize?
- What
new biological insights fall out once we can?
This is not a clinical deployment paper, and it is not
trying to be. There is no pretense that gigaTIME is a “biomarker” in the
regulatory or ESMO sense. It is an enabling scientific instrument.
Why they are different — and why they belong together
Seen clearly, the papers are not in tension at all. They sit
at different layers of the same stack:
- gigaTIME
(Cell) lives at the capability layer:
What can foundation models extract from routine pathology that humans cannot? - Aldea
/ ESMO (Annals of Oncology) lives at the deployment layer:
Once such capabilities exist, how do we classify, validate, and safely use them in medicine?
A useful way to say it bluntly:
gigaTIME expands the possibility space.
EBAI constrains the permission space.
That is exactly why your instinct to review them together —
but not conflate them — is correct.
The key connective insight (without collapsing the
categories)
What gigaTIME enables, Aldea helps discipline.
- gigaTIME
makes Class B–style economics (image-first, molecular-scale
inference) plausible at population scale.
- Aldea
explains why those same tools should initially be framed as screening,
enrichment, or discovery instruments, not instant replacements for
molecular assays.
- gigaTIME
shows that H&E can be a gateway to multiplex biology.
- Aldea
explains when and how such gateways can be allowed to influence care.
Put differently:
Cell papers create new worlds; ESMO papers decide which doors you’re allowed
to open, and under what supervision.