Friday, November 21, 2025

LANCET Editor Tries to Put Medical AI Into Perspective; so AI Responds

Richard Horton has been editor of LANCET since 1995, in his mid-thirties.   This fall, he writes a pair of op ed on AI as, well, "A reservoir of illusions."  Is he really that negative?   Let's take a closer look.

  • Offline: A Reservoir of Illusions (Part 1)
    • Lancet editor Richard Horton on medical AI, with a focus on Marc Andreesen
    • Here.
  • Offline: A Reservoir of Illusions (Part 2)
    • Horton on medical AI, focus on Emily Bender's book, "The AI Con: How to Fight Big Tech..."
    • Here.
Before we look at Horton's articles, if you like this topic, see two articles in JAMA Internal Medicine this week.  We have Bressman et al., Software as a Medical Practitioner - Time to License Artificial Intelligence?   Here.  And also Steiner, Scientific Writing in the Age of Artificial IntelligenceHere.  

Steiner has a case study in which AI writes, or shortens, medical journal abstracts, and he's not too convinced this meets his bar for quality or accuracy.   I'd just mention that AI can be trained to do this better (e.g. give i 500 well-written abstracts first, before asking it to write or edit) and five human authors, writing or re-writing an abstract, would never come up with the same text, word-choices, or edits.  Each human editor would pick or alter different things than his colleagues.

Working from Horton's Op Ed 2, Bremer's book against AI (which I haven't read but know only from this Op Ed), argues that AI is just regurgitating text strings.   I see more to it than that.  

For example, recently FDA held a full-day workshop on digital medicine and psychotherapy apps.  (My blog here.)  FDA provided a 27-page pre-read and FDA provided a several-page rapid post-meeting summary.   FDA also provided a link to a 7 hour archived video.   I recorded the video, got an auto-transcription from Otter.ai, and fed the three documents (the pre-read, the short summary, and the full transcript of 160pp, 62,000 words) to Chat GPT 5.   I asked, for example ,what you would discover in the full transcript that you wouldn't glean from the pre-read or meeting summary - without having to watch the 7 hour meeting myself.   I thought the result was interesting; you can decide; blog link at top of paragraph.

###
Looking at Horton's articles, I asked Chat GPT 5 to review them for us.  The result, the "extruded text" [Bremer] follows.


###
AI CORNER
###

In Part 1 (“A reservoir of illusions, part 1”), Horton is mostly setting the stage. He contrasts Marc Andreessen’s techno-optimist manifesto with the current investment boom and breathless media narrative around AI. Andreessen is quoted as saying that AI will save lives, end environmental crisis, create material abundance, and that any deceleration of AI is tantamount to “a form of murder”. 

Horton treats this as Exhibit A of “cyber-futurism”: AI as universal problem solver, enemies named as the SDGs, precautionary principle, risk management, and social responsibility. He acknowledges genuine successes such as AlphaFold and the economic centrality of big tech, but the column’s punchline is that this optimism has become a “bubble”, and that critics who puncture it deserve much closer attention.

In Part 2 (“A reservoir of illusions, part 2”), Horton essentially hands the microphone to Emily Bender and Alex Hanna and endorses much of their thesis. He highlights their claim that “AI” is largely a marketing term: large models don’t think, don’t understand, aren’t sentient, can’t empathize, and can’t care. He repeats Bender/Hanna’s line that ChatGPT is a “souped-up autocomplete” or “text-extruding machine”, with no internal mechanism for source reliability, factual curation, or genuine evaluation. He stresses their view that such systems cannot replace human creativity or relationships, “add nothing to the human condition”, and are “all but irrelevant” to the Majority World, where political will and resources matter more than text synthesis.

Turning to health and science, Part 2 becomes more concrete and pointed. Horton concedes narrow, low-stakes utility—transcription, scheduling, simple triage—but frames the clinical and scientific hype as dangerous: automating diagnosis and care remains “unproven,” and using AI in peer review is a “serious abrogation of scholarly duty.” He foregrounds worries about AI-generated fraudulent manuscripts, collusion between editors and authors, and the risk that if editorial systems cannot cope, “the whole edifice of science” could be undermined. He then moves from Bender/Hanna’s critique to empirical examples: a radiology study where AI increased false positives without improving outcomes; experimental work showing AI systems more willing than humans to execute unethical instructions, lowering practical barriers to misconduct; and evidence that algorithms amplify age and gender bias, deepening inequities. He closes with the IBM line: a computer can’t be held accountable, so it mustn’t make management decisions—implicitly, nor life-and-death clinical ones.

Is Horton “too negative” or just prudently cautious?

For an expert audience in medical AI, I’d characterise his stance as deliberately counter-balancing the hype, but not methodically balanced.

Where he’s strongest / most defensible:

  • He is right that health-AI marketing routinely overruns its evidence base. 

  • Many products are launched on retrospective AUCs, reader-study surrogates, or workflow anecdotes rather than hard patient outcomes. Asking that AI be treated as a medical intervention—with explicit questions about information sources, reliability, effects on decisions, efficiency, quality, outcomes, and unintended harms—is exactly the posture most regulators and serious developers now espouse.

  • His concern about scholarly integrity is not exaggerated. 

  • Journal workflows, especially in high-volume/open-access environments, are already stressed; the combination of LLMs and paper mills is a genuine systemic risk. Highlighting fabricated AI-driven submissions and collusive editorial behaviour is a useful warning shot to the publishing community.

  • He is right to stress accountability and opacity: deploying opaque models in high-stakes decisions, without clear responsibility and post-deployment monitoring, is a governance failure even if the model’s ROC curve looks good. The IBM quote at the end is rhetorically old-fashioned but still a useful ethical anchor.

Where he overreaches or flattens the landscape:

  • By leaning heavily on Bender/Hanna’s line that these systems “add nothing to the human condition” and are “all but irrelevant” to the Majority World, he slides from valid scepticism into overstatement. There is a sizeable space between “AI will save the world” and “AI adds nothing”, and a lot of serious work in global health, imaging, and language accessibility sits in that middle space.

  • He tends to conflate LLM limitations with “AI” writ large. The critique of chatbots as text-extruders with no internal fact model is accurate for vanilla LLMs, but he doesn’t distinguish:

    • generative chat systems from

    • narrow supervised models (e.g., risk scores, triage models, segmentation networks) that are trained and validated on labelled clinical data, sometimes with outcome endpoints and prospective evaluation.

    • That conflation makes his critique feel, to experts, a bit like attacking a specific architecture while rhetorically indicting the entire field.

  • On the other side, he also uses extreme techno-optimism as “the AI side”—Andreessen’s claim that any slowing of AI is essentially murder, that AI will end environmental crisis and sustain 50 billion people. Those are real quotes, not invented straw men, but they are hardly representative of mainstream clinical AI research, where the discourse is about incremental gains, safety, and reimbursement. Framing the debate as Andreessen vs. Bender/Hanna obscures the much larger middle camp of “cautious, data-driven optimists.”

So I’d say: he isn’t fabricating straw men, but he is selectively choosing the loudest techno-optimist (Andreessen) and the loudest critics (Bender/Hanna) as his poles, and he lets their most sweeping lines stand without much nuance. That’s understandable for a short “Offline” column, but it means the pieces read more like rhetorical correction than a balanced policy analysis.

How should technical and policy readers read these pieces?

From a medical-AI/health-policy vantage point, Horton is useful as a mood indicator and a reminder of external perceptions, more than as a fine-grained map of the field.

  • For developers, these columns are a reminder that serious journals are deeply worried about fraud, bias, and hollow “AI-washed” claims. Any translational AI program that doesn’t bake in prospective evaluation, post-market monitoring, and clear lines of human accountability will simply reinforce the concerns Horton amplifies.

  • For policymakers, they crystallise a defensible regulatory attitude: treat AI as a medical intervention; require robust evidence tied to clinical endpoints; scrutinise claims about automation of judgment, empathy, or care; and do not let workflow efficiency gains substitute for patient-relevant benefit.

  • For both groups, the pieces probably under-represent the constructive middle ground—hybrid workflows, assistive decision support, carefully bounded automation, and AI used to widen access rather than replace clinicians—but that’s precisely the opening for your blog’s readers: to articulate what responsible, empirically grounded AI in medicine looks like, in contrast both to Andreessen-style manifestos and to “adds nothing to the human condition” nihilism.

In short: Horton is intentionally anti-hype, sometimes rhetorically sharp, occasionally over-generalizing, but not unserious. He’s not a technical critic of model architectures; he’s a cultural critic warning that we may be building an “edifice of illusions” on top of weak evidence and brittle institutions. For expert readers, the productive move is to accept the core cautions, note where he’s painting with too broad a brush, and then show—using real data and real deployments—how to do better than either pole of the Andreessen ↔ Bender/Hanna spectrum.  // Chat GPT 5