Discoveries in Health Policy: Using AI: (1) Finding Half-Remembered Papers; (2) Finding Papers of Major Importance. For: Computational Pathology.

Summary of This Long Post :Creatively Using AI at Work

Case Study: Digital Pathology

This blog begins with an unexpectedly mundane task: finding two important digital pathology papers that I could only half remember.

You know the feeling: I recalled the ideas, but not the authors or titles. Using ChatGPT as a research assistant, I tracked down the Dawood paper on confounding and shortcut learning in H&E-to-genomics prediction and the Trost/SPARK paper on interpretable, multi-layer computational pathology.

The conversation with AI then expanded into something more interesting: How do we identify papers that genuinely reshape a field? Along the way, ChatGPT located related author networks, suggested additional field-defining papers, and helped formulate an operational definition of a high-impact publication: a paper that future authors must cite, answer, distinguish, or defend themselves against.

The result is both a practical demonstration of AI-assisted literature retrieval and a discussion of how scientific fields change.

#######

This weekend, I used Chat GPT to look up two papers from the last six months that I half-remembered, and not the first author. (Spoiler alert: "Dawood" and "Trost.")

Then, I asked Chat GPT to find recent papers that share several authors from the Trost author list.

Then, I asked Chat GPT to find five-or-so digital pathology papers Z1, Z2, etc, from the past year, of equally high originality and potential impact. (Impact: Operationally: If you write a paper on on topic X in the next 12 months, it's guaranteed the editor or reviewers will require you to address [high impact paper Z.])

I'm reproducing my dialog with Chat GPT more or less as it occured.

###

I am trying to locate a major digital pathology paper that appeared within roughly the last six months. I think it may have been published in a high-profile journal such as Nature Medicine, but I am not certain.

The paper addressed the recent wave of studies claiming that AI can analyze H&E tumor slides and predict molecular features, such as whether a tumor has specific gene mutations. As I recall, this newer paper argued that many such claims were overestimated or based on flawed analytic methods. The models may have appeared locally accurate in limited settings, but the results did not hold up well across multiple genes, tumor types, or validation settings.

The paper received some attention at the time, including press coverage and discussion on LinkedIn, but I am having trouble finding it now. Can you help identify the paper?

###

I think this is the paper:

Dawood M, Branson K, Tejpar S, Rajpoot N, Minhas FUA. “Confounding factors and biases abound when predicting molecular biomarkers from histological images.” Nature Biomedical Engineering. Published March 2, 2026. (Nature)

Why it matches: it argues that H&E-based AI biomarker prediction often learns correlated shortcuts—for example, BRAF prediction in colorectal cancer may actually be riding on MSI-associated morphology—rather than isolating a true mutation-specific visual signal. The study analyzed 8,221 patients across breast, colorectal, lung, and endometrial cancers, and found that headline AUROCs often degraded when tested within biomarker-stratified subgroups. (Nature)

Regarding "Press attention at the time." The press hook was essentially: “AI pathology models may be cheating.” Warwick’s release and ASCO Post both covered it in early March 2026. (University of Warwick) (ASCO Post).

###

Yes! Dawood.

###

You're welcome. That paper made enough of a splash that I suspected it might be the one from your description.

What made the Dawood et al. paper stand out was that it wasn't just another "external validation is hard" paper. Rather, it challenged a large portion of the intellectual foundation behind the H&E→genomics literature. For several years, the field accumulated papers claiming that an H&E slide could predict:

MSI status
BRAF mutations
EGFR mutations
KRAS mutations
HRD
TMB
Various expression signatures

Often with respectable AUCs.

Dawood and colleagues essentially asked: "How much of this signal is actually the mutation, and how much is morphology associated with the mutation?"

Their example of BRAF in colorectal cancer is particularly elegant. BRAF-mutant tumors often have MSI and characteristic histology. If an algorithm learns the morphology associated with MSI, it may appear to predict BRAF successfully even if it has little direct information about BRAF itself.

The phrase that stuck with many readers was "shortcut learning." That's the machine-learning term for models exploiting correlated features rather than learning the biologically intended relationship.

From a digital pathology strategy perspective—which is probably why it caught attention on LinkedIn among Roche, Philips, PathAI, Ibex, Paige, Nucleai, and computational pathology observers—the paper has two implications:

It raises the evidentiary bar. Future H&E biomarker papers will increasingly be expected to perform subgroup analyses controlling for confounders rather than reporting only a global AUC.
It does not invalidate computational pathology. Instead, it shifts attention toward endpoints that are clinically meaningful regardless of mechanism:
- prognosis,
- treatment response,
- recurrence risk,
- immunotherapy benefit,
- survival prediction.

If a model predicts response to pembrolizumab better than current biomarkers, clinicians may care less whether the model is using a direct genomic signal or a complex morphologic correlate.

In some ways, Dawood et al. may become remembered as the computational pathology equivalent of the classic epidemiology papers that taught investigators to think carefully about confounding rather than accepting an impressive association at face value.

For someone following the Roche–Philips–Leica–PathAI–Paige ecosystem, I'd rank it as one of the most important methodological papers in digital pathology published in the last year or two. It is the sort of paper that gets cited in introductions and discussion sections for years because it changes how people frame the problem, not just how they calculate an AUC.

###

[My memory was, you'd predict ER estrogen receptor status from H&E using AI, but you were just replicating that high grade tumors tended to lose differentiated gene expression which everybody already knew.]

###

Yes — that’s exactly the issue. The model may not be “seeing BRAF” or “seeing ERBB2” or “seeing PIK3CA.” It may be seeing a histologic phenotype that is correlated with a molecular phenotype.

So the apparent result:

“AI predicts mutation X from H&E”

may actually mean:

“AI detects tumor grade, mucinous morphology, lymphocytic infiltrate, necrosis, differentiation state, MSI-like morphology, or subtype — and mutation X is enriched in that group.”

Your breast cancer example is a good plain-English version: an H&E model predicting ER status may be partly rediscovering that high-grade, poorly differentiated, more aggressive-looking tumors are more often ER-negative, while better-differentiated tumors are more often ER-positive. That is not useless, but it is very different from saying the model has discovered an image-level molecular signature of estrogen receptor biology.

Dawood et al. make this broader point: many models capture aggregated morphologic correlates or confounded biomarker constellations, not independent molecular signals. They specifically argue that aggregate AUROC can look good while subgroup performance collapses once you control for correlated biomarkers or tumor contexts. (Nature)

The clean way to say it for a blog might be:

Some H&E-to-genomics AI papers may have been less like discovering invisible molecular information and more like repeatedly rediscovering familiar pathology: grade, differentiation, subtype, mucin, lymphocytes, necrosis, and tumor architecture. The molecular marker was real, but the model’s shortcut was often a broader histologic phenotype correlated with that marker.

That framing is fair and captures the Dawood critique nicely.

###

What is the PMID of DAWOOD?

###

The PMID is 41772176.

Full citation:

Dawood M, Branson K, Tejpar S, Rajpoot N, Minhas FUA.
Confounding factors and biases abound when predicting molecular biomarkers from histological images.
Nature Biomedical Engineering. Published March 2, 2026.
PMID: 41772176
DOI: 10.1038/s41551-026-01616-8

For papers that DO predict molecular from H&E, see e.g. Naboulsi discussing Qi 2025 (endometrial), PMID 41207998. Early efforts Couture 2018 2026 Zhen, finding ecDNA (Mischel) via H&E here.

###

Thank you. New topic. There is another major paper from the last six months that I need to locate again. I believe it appeared in a high-profile journal, perhaps Nature, Cell, or something similar.

As I recall, the paper argued that AI prediction from H&E slides could be improved by first performing several distinct, parallel analyses of interpretable histologic features. For example, the model might separately score features such as cellularity, differentiation, necrosis, vascularity, and other tumor or microenvironment characteristics. Then, after generating scores across perhaps 8, 10, or more domains, the algorithm would recombine those feature-level outputs into a new weighted model that was highly predictive of the final target.

The key point was that explicitly breaking the image into multiple intermediate histologic or biologic feature domains, and then reassembling those features into a final predictive model, produced better results than trying to predict the final granular outcome directly from the original H&E image.

The task is the same as before: what paper from the last six months did this in a high-profile publication? If there are a few similar papers that seem to be using this general approach, it would be fine to report several candidates rather than forcing a single answer.

###

The best match is almost certainly:

Trost F, Zhang B, et al. “An agentic framework for autonomous scientific discovery in cancer pathology.” Nature Medicine. Published April 29, 2026. DOI: 10.1038/s41591-026-04357-y.

This is the SPARK paper. It fits your memory closely: instead of going directly from H&E slide → endpoint, SPARK generates many biologically interpretable tissue concepts, converts them into computable parameters, verifies/decorrelates them, then recombines selected parameters into predictive/prognostic models.

A few specifics: SPARK processed H&E slides with quality control, tissue segmentation, tumor/stroma compartment mapping, and single-cell detection; then generated biologically grounded parameters across 18 cohorts, five cancer types, and more than 5,400 patients. It ultimately retained 1,115 nonredundant parameters for downstream analysis.

The “reassembly” piece appears in their use of XGBoost models for biomarker prediction and integrative prognostic scores using multiple top SPARK parameters. They report predictions for ER in breast cancer, PD-L1 in lung adenocarcinoma, MSI in colorectal cancer, HPV/p16 in head and neck cancer, and prognostic scoring across tumor types.

So in shorthand:

Dawood = warning paper about H&E AI shortcut learning.
Trost/Zhang/Tolkach SPARK = constructive paper arguing that interpretable, modular, concept-based pathology features can be recombined into stronger predictive models.

###

Besides Trost, see Liang, Cancer Cell, Spatial biomarker discovery via interpretable semantic learning.

And Trost stands by itself [multi layer approach]? I had a fragment of memory that there may have been a couple papers like convergent evolution. They were making about the same point at about the same time in different journals.

###

It does not completely stand by itself, but among recent high-profile 2026 papers, Trost/SPARK looks like the cleanest match.

There are at least three related strands:

1. Trost et al., Nature Medicine 2026 — SPARK.
This is the closest to your memory: generate interpretable pathology concepts/parameters first, then recombine them into predictive models. It explicitly contrasts this with opaque end-to-end prediction.

2. Diao et al., Nature Communications 2021 — HIF2GENE / human-interpretable image features.
Older, but philosophically very similar. PathAI-derived “human-interpretable image features” were extracted from densely mapped slides, grouped into cell/tissue architecture features, and then used to predict molecular phenotypes. This may be the intellectual ancestor of what you remember.

3. Li et al., Nature Medicine 2026 — HEX / virtual spatial proteomics.
Not quite the same, but adjacent. Instead of hand-interpretable morphologic bins, HEX predicts a panel of spatial protein-expression features from H&E, then uses those virtual molecular maps for biomarker discovery in lung cancer. So it is also “intermediate biological representation first, downstream prediction second,” but the intermediate layer is virtual proteomics, not cellularity/necrosis/differentiation.

So my read is: Trost is probably the one you meant, but your “convergent evolution” memory is fair. The broader movement is away from black-box H&E → endpoint and toward structured intermediate representations: interpretable morphologic features, concept scores, spatial protein maps, or tissue-grammar-like abstractions.

###

UPDATE. 6/12/2026. The original post had TROST and LI as above. Later in June I noticed LIANG. Here is a comparison of TROST and LIANG.

Yes — your “layering” intuition is exactly right. Both papers try to escape the black-box problem by building intermediate, biologically meaningful representations between raw WSI pixels and final clinical predictions. But they differ in what the layers are and where the “intelligence” sits.

Core comparison

Liang et al. / PathPrism is more of a semantic-spatial biomarker engine. It begins with CRC whole-slide images, decomposes them into predefined tissue classes, and then derives spatial features such as tissue fractions, entropy, and graph-based tissue interactions. Those features become an interpretable “spatial biomarker spectrum” used in transparent models for prognosis, mutation status, and treatment-response prediction. The central architectural metaphor is: pixel image → semantic tissue map → spatial biomarker spectrum → transparent clinical model.

Trost et al. / SPARK is more of an agentic scientific-discovery system. It also starts from segmented tissue and cell-level pathology objects, but the distinctive layer is not just spatial feature extraction; it is an LLM-agent workflow that generates biological hypotheses, converts them into code, verifies parameters, and then tests them across cohorts. The paper describes SPARK as a multi-agent “pathology brain” with idea generation, refinement, parameter coding, and verification.

Where they are similar

Both papers are reacting against end-to-end black-box foundation-model pathology. Each tries to make the model’s intermediate representations meaningful to pathologists. In Liang, interpretability comes from predefined tissue semantics and spatial biomarkers. In Trost, interpretability comes from human-readable biological ideas that are translated into measurable parameters. In both cases, the final classifier or predictor is not merely saying “high risk”; it can point to tissue architecture, cell proximity, stromal organization, lymphocyte patterns, tumor-front behavior, or related spatial concepts.

Both also use a layered representation of pathology. The raw WSI is first converted into structured objects: tissue compartments in Liang; tissue compartments plus single-cell detections in Trost. Those objects are then aggregated into higher-order features. Liang emphasizes spatial fractions, entropy, and graph features; Trost emphasizes LLM-generated parameters involving cell types, compartments, morphology, and spatial relationships.

Both papers also frame digital pathology as a discovery platform, not merely a prediction platform. PathPrism discovers candidate spatial biomarkers and uses LLMs secondarily for hypothesis generation; SPARK makes hypothesis generation itself the front end of the system.

Where they differ

The biggest difference is constraint versus openness. PathPrism is deliberately constrained: it defines a controlled semantic vocabulary of tissue types, then extracts a relatively systematic set of spatial features. This gives it a clean chain of reasoning and probably stronger auditability. SPARK is deliberately open-ended: it asks agents to invent biologically meaningful parameters, code them, and filter them. This makes it broader and more flexible, but also more dependent on the quality, stability, and biological discipline of the agentic workflow.

The second difference is disease scope. Liang is deeply developed in colorectal cancer, with extension to breast cancer. It reports application across 7,000 CRC patients and 11 cohorts, with endpoints including survival, MSI, BRAF, TP53, and adjuvant chemotherapy benefit. Trost is broader by design: 18 cohorts, five cancer types, and more than 5,400 patients, spanning lung adenocarcinoma, lung squamous carcinoma, colorectal cancer, breast cancer, and head-and-neck squamous cancer, with prognostic and predictive applications.

The third difference is granularity. PathPrism is primarily macroscopic and tissue-architectural: tumor, stroma, lymphocytes, mucus, muscle, adipose, debris, normal mucosa, and their spatial organization. SPARK goes further toward cellular microecology, explicitly incorporating single-cell detection and classification across tumor cells, fibroblasts, macrophages, lymphocytes, neutrophils, eosinophils, and plasma cells.

Bottom line

PathPrism is the more disciplined “semantic spatial biomarker” paper: it makes digital pathology interpretable by decomposing WSIs into tissue maps and then into quantified spatial layers. SPARK is the more ambitious “agentic discovery” paper: it makes digital pathology exploratory by letting language-model agents propose, implement, and test new biologically framed measurements.

A compact way to say it: Liang builds interpretable layers from the image upward; Trost builds agentic biological concepts from the question downward, then tests them against the image. Both are part of the same movement: turning digital pathology from black-box prediction into structured, inspectable, biologically grounded measurement.

###

END JUNE 12 UPDATE. RETURN TO MAY 31 ARTICLE

###

Two simultaneous trends that emerged in 2025–2026:

The Dawood critique — end-to-end H&E→gene prediction often reflects confounding and shortcut learning.
The SPARK-style response — explicitly decompose pathology into interpretable concepts, then recombine those concepts into predictive models.

From a pathology perspective, this feels almost like a return to classical pathology. Instead of saying:

"The neural network predicts MSI."

the newer philosophy is:

"The system measures lymphocyte architecture, tumor budding, gland formation, necrosis, stromal composition, spatial organization, and related features, then uses those measurements to predict MSI."

That approach is easier to validate biologically, easier to troubleshoot, and less vulnerable to the kind of criticism raised by Dawood. In fact, I would not be surprised if future FDA-cleared computational pathology systems increasingly look more like multi-feature morphometric scores than pure black-box image classifiers.

If you are trying to reconstruct a bibliography of the most influential digital pathology papers of the last year, I'd probably place Dawood (PMID 41772176) and Trost/SPARK (PMID 42056496) in the same conversation because they are almost philosophical opposites: one asks "What are these models really learning?" and the other asks "Can we force them to learn and expose meaningful pathology concepts?"

###

Can you identify publications from 2025 or 2026 that share multiple authors with the Trost paper?

A publication with only one or two overlapping authors is probably not significant enough for this purpose. I am looking for papers that have several or more authors in common with the Trost paper, suggesting that they may come from the same research group, collaboration network, or follow-on line of work.

Please limit the investigation to publications from 2025 and 2026.

###

Yes. The clearest 2025–2026 publications with several overlapping Trost/SPARK authors are these:

Paper	Journal / year	Overlap with Trost paper
Ahmad et al. “Development and clinical validation of a prognostic algorithm for stroma-tumor ratio quantification in non-small cell lung cancer.”	Lung Cancer, 2025	7 shared authors: Michels, Rasokat, Wolf, Heldwein, Quaas, Büttner, Tolkach. PMID 40466465. (PubMed)
Barroso et al. “AI-based single-cell analysis as a next-generation histological grading approach in colorectal cancer…”	Modern Pathology, 2025	6 shared authors: Glamann, Bauer, Wickenhauser, Büttner, Quaas, Tolkach. (tolklab.de)
Wang et al. “AI Algorithm for Lung Adenocarcinoma Pattern Quantification (PATQUANT)…”	MedComm, 2025	5 shared authors: Bychkov, Heldwein, Quaas, Büttner, Tolkach. PMID 40927554. (PubMed)
Mittmann et al. “Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer.”	Nature Communications, 2025	At least 5 shared authors: Bauer, Wagner, Quaas, Büttner, Tolkach. PMID 41062516. (PubMed)
Grothey et al. “Comprehensive testing of large language models for extraction of structured data in pathology.”	Communications Medicine, 2025	4 shared authors: Schömig-Markiefka, Quaas, Büttner, Tolkach. (tolklab.de)

My read: Trost/SPARK is not a one-off author group. It sits in a Cologne/Tolkach/Büttner/Quaas ecosystem that published several 2025 computational pathology papers on interpretable grading, tumor-stroma ratio, lung pattern quantification, single-cell analysis, and LLM pathology data extraction.

###

New task. Use the Dawood and Trost papers as two index examples of recent, high-importance papers in digital pathology or computational pathology. By “high importance,” I mean papers that are likely to change how people in the field think, design studies, interpret claims, or evaluate future publications.

Please identify five additional papers from the past year that are likely to have a similar kind of field-shaping impact over the next several years. I am not simply looking for papers with high citation counts or impressive technical performance. I am looking for papers that may cause the field to “wake up” or realign its thinking, in the same way that the Dawood and Trost papers seem likely to do.

For each candidate paper, briefly explain why it appears to be potentially field-shaping: for example, whether it challenges current assumptions, introduces a new analytic framework, exposes a methodological problem, changes expectations for validation, or provides a new standard that future papers will need to cite, answer, distinguish, or defend against.

###

Here are five I’d put beside Dawood and Trost as likely “thinking realignment” papers:

Paper	Why it matters
*Ding et al. — TITAN: “A multimodal whole-slide foundation model for pathology.” Nature Medicine, 2025. PMID: 41193692.*	Moves from patch/ROI models toward whole-slide multimodal foundation models using WSI + reports + synthetic captions; a major “platform” paper. (Nature)
*Li et al. — HEX: “AI-enabled virtual spatial proteomics from histopathology…” Nature Medicine, 2026. PMID: 41491099.*	Converts H&E into predicted spatial proteomics, then uses that virtual molecular layer for prognosis and immunotherapy-response biology. This is a conceptual leap beyond H&E→single biomarker. (Nature)
*Campanella et al. — “Real-world deployment of a fine-tuned pathology foundation model for lung cancer biomarker detection.” Nature Medicine, 2025. PMID: 40634781.*	Important because it tests a computational biomarker in a prospective silent-trial / real-world workflow setting, not just retrospective AUC land. (Nature)
*Neidlinger et al. — “Benchmarking foundation models as feature extractors for weakly supervised computational pathology.” Nature Biomedical Engineering, 2025. PMID: 41034516.*	A field-discipline paper: 19 foundation models, 31 tasks, external cohorts. It says: stop treating every new foundation model as self-proving; benchmark them properly. (Nature)
*Chen et al. — OmiCLIP/Loki: “A visual–omics foundation model to bridge histopathology with spatial transcriptomics.” Nature Methods, 2025.*	Bridges H&E and spatial transcriptomics, framing pathology AI as a translator between tissue morphology and omics rather than merely a classifier. (Nature)

My strongest shortlist would be: Dawood, Trost/SPARK, Ding/TITAN, Li/HEX, Campanella/EAGLE, Neidlinger benchmark, and Chen/OmiCLIP..

Together they capture the current pivot: black-box prediction → interpretable features, multimodal foundation models, external benchmarking, virtual omics, and real-world deployment.

###

Good! In this area, for example, I would operationally define a high impact paper that shapes the thinking by pointing out that if author submits a paper to Nature Medicine in the next year, that says he can predict gene PQR from an H&E slide, ….it’s highly likely that the editor or peer reviewers will quickly ask him to acknowledge and discuss Daywood. To me, that’s an operational rule for a field, shaping or high impact paper.

###

Exactly. That’s a very good operational definition.

A field-shaping paper is not merely “highly cited” or “technically impressive.” It becomes part of the reviewer checklist.

For Dawood, the new reviewer question becomes:

“Have the authors ruled out confounding and shortcut learning? Are they predicting gene PQR, or are they predicting grade, MSI-like morphology, tumor subtype, necrosis, lymphocytes, or some correlated phenotype?”

For Trost/SPARK, the corresponding reviewer question might become:

“Why is this an opaque end-to-end model? Could the authors decompose the prediction into interpretable morphologic concepts or intermediate tissue features?”

For foundation-model papers like TITAN or benchmarking papers like Neidlinger, the reviewer questions become:

“Why did the authors build a bespoke feature extractor rather than compare against current pathology foundation models?”
“Was the model externally benchmarked across institutions, scanners, cohorts, and tumor types?”

For Campanella/EAGLE-type deployment papers:

“Is this only retrospective AUC performance, or is there evidence of real-world workflow behavior, silent-trial performance, failure modes, and clinical implementation?”

So your rule is excellent:

A high-impact computational pathology paper is one that future authors must either cite, answer, distinguish, or defend themselves against.

That separates “another good AI paper” from a paper that changes the grammar of the field.

###

Yes, your definition or summary sentence is very well designed. Future authors must site, answer, distinguish, defend, that’s a great sentence.

###

Thank you. I like it because it works across many scientific fields and avoids getting trapped in citation counts.

A paper can accumulate 2,000 citations and still be largely a resource paper or method paper. Conversely, a paper can have only a few hundred citations yet fundamentally alter how reviewers think.

A refinement might be:

A field-shaping paper is one that future authors must cite, answer, distinguish, or defend themselves against, even when their own work is ostensibly about something else.

That's what Dawood seems likely to become.

A hypothetical 2027 paper titled:

"Deep Learning Prediction of FGFR3 Mutations from H&E Slides in Bladder Cancer"

may have nothing directly to do with Dawood's cancer types or biomarkers. Yet reviewers may still ask:

How were confounders addressed?
Did you control for grade and subtype?
Did you test within biomarker-stratified subgroups?
Could the model be exploiting correlated morphology?
How does your work relate to Dawood et al.?

At that point, Dawood has become part of the intellectual infrastructure of the field.

Likewise, Trost/SPARK may become the paper that prompts reviewers to ask:

What morphologic concepts is the model using?
Can intermediate features be exposed?
Is there a biologically interpretable representation?
Why is a black-box approach preferable?

In pathology there are a few classic examples of papers that achieved this status. The 1990s and early 2000s papers establishing HER2 testing standards, the early Oncotype DX validation papers, and more recently the first major pathology foundation-model papers. Once they appeared, later authors had to position themselves relative to them.

As a historian of a field, that's often the more interesting question than citation count:

What papers changed the questions reviewers ask?

That's a much smaller and more consequential list. Dawood strikes me as a strong candidate for that category. Trost may be as well, although it is newer and we'll have to see whether the community embraces interpretable concept-based pathology as strongly as its authors hope.

Sunday, May 31, 2026

Using AI: (1) Finding Half-Remembered Papers; (2) Finding Papers of Major Importance. For: Computational Pathology.

Core comparison

Where they are similar

Where they differ

Bottom line