Two AI's Review the AMA's New Artificial Intelligence Coding Rules "Appendix S 2027"
As of June 27, I believe the 2027 version is here:
And the 2025-and-earlier version still online here;
https://www.ama-assn.org/system/files/cpt-appendix-s.pdf
###
CHAT GPT 5.5 followed by CLAUDE OPUS 4.8
###
AMA’s 2027 Appendix S improves the AI taxonomy by focusing on software outputs, clinical meaningfulness, and the assistive/augmentative/autonomous distinction. But it still oddly refuses to define “AI,” and it lacks formal decision rules or a flow chart. Without logic-tree testing, the prose may hide ambiguities, contradictions, and edge-case classification problems for future CPT applicants and reviewers.
===========
AMA CPT Appendix S, 2027:
A Better AI Taxonomy, Still Missing Its Decision Rules
The AMA CPT Appendix S revision is a significant improvement over the earlier version. It is more operational, more specific, and more useful for CPT applicants trying to decide whether an AI-enabled medical service is assistive, augmentative, or autonomous. But it also preserves two conceptual problems.
First, the chapter is called a “Taxonomy for artificial intelligence in medical services,” yet it immediately says the term “AI” is not defined in the taxonomy. That is a strange opening move. It is as if a taxonomy of birds began by saying that the document does not define “bird.”
Second, and just as important, the revised Appendix S still does not provide a logical decision tree, a flow chart, or a set of formal classification rules. It offers prose definitions, examples, and a summary table. Those are helpful, but they are not the same as operational rules. As a result, it is difficult to know whether the taxonomy will classify services consistently, or whether difficult cases will expose contradictions, circularity, or paradoxes.
In my view, the AMA should have built the taxonomy first as a logic tree: if this, then that. Only after the decision rules were internally tested should the AMA have converted them into explanatory free text. Free text is useful for education, but it is a dangerous place to bury classification rules. Prose can smooth over problems that a logic tree would immediately expose.
Sidebar: In an interview, Quinn provided this example. An applicant feels his service is not AI, and doesn't refer to Appendix S. At review, several panelists feel the service is AI, and can't proceed without conformity to Appendix S. But now 3 or 4 panelist voice their agreement with the applicant, and say the service falls outside of Appendix S. More panelists join each side of the debate, offering increasingly diverse rationales for either position. How is this paradox to be resolved, with no AMA definition of AI?
See also: Is an ICD an AI?
1. What changed: from machine “work” to software “outputs”
The earlier Appendix S was framed around the “work performed by the machine” for the physician or other qualified health care professional. Under that version, assistive AI detected clinically relevant data, augmentative AI analyzed or quantified data to yield clinically meaningful output, and autonomous AI interpreted data and independently generated clinically meaningful conclusions.
The 2027 version changes the emphasis. It focuses less on the machine’s “work” and more on the software’s output and the role of that output in clinical care. That is a major improvement. CPT is not mainly a philosophy of technology. It is a coding system for medical services and procedures. Therefore, the better question is not, “What kind of AI is this?” but, “What clinical output is being produced, and how does that output function in the service?”
The new version asks whether the software merely supports the physician, whether it derives a clinically meaningful parameter, or whether it generates a diagnosis, interpretation, recommendation, or intervention. That is a much more useful framework than a general discussion of “AI.”
But even here, the revision stops short. It provides better distinctions, but not a classification algorithm. It does not say, for example, “Start here. Does the software produce a quantitative or categorical parameter? If no, go to assistive. If yes, does it generate a diagnosis or management recommendation? If yes, go to autonomous. If no, consider augmentative.” That type of structure would make the taxonomy testable. The current text remains interpretive.
2. Assistive AI: clinically supportive, but not transformative
The 2027 table describes assistive software as clinically supportive of physician or QHP performance by providing clinically relevant data. The examples include outputs such as “likelihood of,” “suggestive of,” or “risk for.” The output requires physician or QHP interpretation and report. Assistive software does not require documentation of clinical meaningfulness.
This is clearer than the old version, but not entirely clean. The words “likelihood of,” “suggestive of,” and “risk for” sound somewhat interpretive. They are not merely raw data. A software output that says something is “suggestive of” a disease is already doing more than highlighting pixels or pointing to an ECG segment. It is nudging the clinician toward a clinical inference.
The taxonomy appears to solve this by saying that assistive software does not derive a distinct quantitative or categorical parameter and does not generate an interpretation or conclusion. But the border remains fuzzy. Is a “risk for” output a clinical parameter? Is a “likelihood of” output a score? Does “suggestive of” become an interpretation if it is sufficiently specific?
A flow chart would force these questions into the open. For example, the first decision point might be whether the software creates a discrete output parameter. If the answer is no, the software may be assistive. If the answer is yes, it cannot be assistive and must move to the augmentative/autonomous branch. Without that rule, the category boundaries remain vulnerable to argument.
3. Augmentative AI: the most improved category
The augmentative category is where the revision does its best work. The new version says augmentative software produces a quantitative or categorical parameter that is qualitatively different from the input. It must be more than adding, averaging, measuring, or reporting descriptive statistics. The output may be a scale, index, classification, risk score, or other metric used in diagnosis, treatment, mitigation, prevention, or management.
That is a useful distinction. A software tool that merely measures a distance, counts cells, averages values, or reports descriptive statistics should not automatically become “AI” or even “augmentative.” A tool that derives a new clinically meaningful score, index, or classification is different. It has transformed the input into a clinically interpretable parameter.
The 2027 text also introduces an important evidence concept. If the output corresponds to a metric already used in clinical care, it should be validated by equivalence to that metric. If the output is novel, it should be validated for impact on patient management. This is a strong addition. It makes clear that the output must be clinically meaningful, not merely computationally interesting.
But again, the prose leaves unresolved edge cases. Suppose software produces a novel categorical classification but the classification is only loosely tied to management. Is that augmentative? Suppose it produces a risk score that a physician may or may not use. Is that still augmentative, or merely assistive? Suppose it produces a score and also displays a suggested next step. Does that cross into autonomous Level I?
These are not academic problems. They are exactly the kind of classification questions that CPT applicants, specialty societies, and advisors will face. A decision tree would make the assumptions visible. The current text asks readers to infer the rules from paragraphs.
4. Autonomous AI: clearer levels, but still dependent on interpretation
The autonomous category is also improved. The new version says autonomous software automatically derives parameters and independently generates clinically meaningful interpretations or conclusions. It may recommend a definitive diagnosis, recommend specific management, or initiate management actions.
The three autonomous levels are sensible. Level I requires physician or QHP judgment to implement or reject the recommendation. Level II allows a reasonable opportunity to negate an impending action before implementation. Level III automatically initiates management actions that continue unless a physician or QHP intervenes, with oversight over time.
That structure is better than the older version because it is more explicitly tied to management consequences. It recognizes that “autonomy” is not just a matter of whether software reaches a conclusion. It is a matter of what happens next. Does the physician act? Does the software propose? Does the software initiate? Does the physician have to stop it?
Still, formal rules would help. For example, the taxonomy could state:
If the software output includes a definitive diagnosis or specific management recommendation, classify as autonomous Level I unless the software also initiates action.
If the software initiates an action after an alert or opportunity to override, classify as autonomous Level II.
If the software initiates management that continues unless a physician intervenes, classify as autonomous Level III.
Rules like these may be implicit in the prose, but implicit rules are not enough. CPT classification should not depend on how artfully an applicant describes the service.
5. The oddest feature: an AI taxonomy that does not define AI
The most peculiar feature remains the statement that “AI” is not defined in the taxonomy. I understand why the AMA did this. A formal definition of AI could become obsolete quickly. It could also create arbitrary boundaries. A rules engine, regression model, neural network, image classifier, large language model, or proprietary algorithm may all be called “AI” by different people in different settings.
So the AMA has chosen to avoid that fight. Instead, it classifies software outputs according to their role in clinical care. That is practical. It avoids the trap of defining AI by computational method.
But it creates a threshold problem. How does a service enter Appendix S in the first place? If two people vehemently disagree about whether a service is AI, under their own unwritten definitions, Appendix S does not resolve the disagreement. It can classify the type of AI something is, once someone has already decided that it is AI. But it does not decide whether the thing is AI at all.
That is conceptually awkward. It may also matter in practice. Applicants may invoke “AI” because it sounds innovative. Opponents may deny that something is AI because it sounds overmarketed. The taxonomy does not give a neutral test.
A more honest title might be “Taxonomy for Software-Generated Clinical Outputs in Medical Services and Procedures.” That is really what Appendix S does. It classifies clinical software outputs by function: supportive, parameter-generating, or decision-generating. That is a good and useful taxonomy. It is just not really a definition of artificial intelligence.
6. Why the absence of a logic tree matters
The lack of a formal decision tree is not just a formatting issue. It goes to reliability.
A taxonomy should be tested against hard cases. It should classify the easy examples, but also survive edge cases. What happens when software produces a score and a recommendation? What happens when the output says “highly suggestive of disease X”? What happens when the physician must sign the report but almost never changes the recommendation? What happens when the software output is used in an E/M service but also has independent clinical validity? What happens when the output is a novel index that predicts management response but does not itself recommend management?
The current Appendix S may have answers to these questions, but they are embedded in prose. A logic tree would reveal whether the answers are consistent.
This is why I would have built the revision in the reverse order. First, write the logical rules. Second, test them against known CPT codes and hypothetical edge cases. Third, look for paradoxes. Fourth, convert the stable rules into explanatory prose. The AMA appears to have done the last step without showing the first three.
The danger is that different applicants may use the same words in different ways. One applicant may characterize a score as augmentative. Another may characterize a similar score as assistive because the physician still interprets the result. A third may push toward autonomous Level I by emphasizing that the score implies a specific diagnosis or management action. Without explicit decision points, the taxonomy may become a language game.
7. What the 2027 revision gets right
Despite these concerns, the revision is a real improvement.
It correctly moves away from “AI” as a magic word. It focuses on clinical output. It distinguishes raw or supportive information from derived parameters. It requires clinical meaningfulness for augmentative and autonomous outputs. It recognizes that novel metrics need evidence of impact on patient management. It also recognizes that physician work related to augmentative software may already be captured by existing codes.
These are important policy signals. CPT applicants should not come forward merely saying, “We use AI.” They need to say what the software output is, how it differs from the input, why it is clinically meaningful, and how it affects patient care.
That is exactly the right direction.
8. Bottom line
The 2027 Appendix S is better than the prior version. It is more precise, more clinically grounded, and more useful for CPT code applications. It shifts attention from the vague idea of machine “work” to the more practical question of software output and clinical function.
But it still has two major weaknesses.
First, it is an AI taxonomy that does not define AI. That may be defensible, but it is undeniably strange.
Second, it provides prose descriptions rather than formal classification rules. There is no decision tree, no logical flow chart, and no visible test of whether the categories classify services consistently. In difficult cases, free text can hide ambiguity that a logic tree would expose.
The best reading is that Appendix S is not truly a taxonomy of artificial intelligence. It is a taxonomy of software-generated clinical outputs. As that, it is useful. But for a CPT system that must make reproducible decisions, the next version should probably include a formal decision tree: start with the output, ask whether it derives a new parameter, ask whether it generates a diagnosis or management recommendation, ask whether it initiates action, and only then assign the category.
That would make Appendix S not only clearer, but testable.
#
#
#
Claude Opus 4.8
#
#
The 2027 Appendix S reframes AI classification around the output and its clinical role, making "derived parameter" the pivot between assistive and augmentative, softening augmentative's professional component to "conditional," and recasting autonomous Levels I-III as in/on/over-the-loop. But it ships prose without logic rules, leaving boundaries undecidable, two splitting axes conflated, and "AI" pointedly undefined yet load-bearing at application entry.
.
What actually changed: from "what the machine does" to "what the output is"
The 2025 taxonomy sorted services by the machine's activity — it detects (assistive), analyzes/quantifies (augmentative), or interprets and independently generates conclusions (autonomous). The discriminating axis was a verb describing internal computation, and the table backed it with crude binaries: "Analyzes data: No/Yes/Yes" and "Provides independent diagnosis: No/No/Yes."
The 2027 version re-anchors the whole scheme on the output and "the role of its output(s) in clinical care." The pivotal new construct is the derived parameter — defined in the key-definitions section as a quantitative or categorical output (index, score, classification). This single concept now carries the assistive/augmentative boundary:
- Assistive = output gives clinically relevant data without deriving a parameter and without interpretation/conclusion. Vocabulary: "likelihood of," "suggestive of," "risk for."
- Augmentative = output is a derived parameter, qualitatively different than the input, and — the genuinely operational addition — "more than a summation": more than adding, averaging, measuring, or reporting descriptive statistics. Vocabulary: "predictive of," "prognostic of."
- Autonomous = derives a parameter and generates an interpretation/conclusion as a diagnosis or intervention.
That "more than summation/descriptive statistics" line is the most substantive sharpening in the document. The old "analyzes data: yes/no" was useless — everything "analyzes." The new test at least tries to exclude mere arithmetic from the augmentative tier, and it pairs with a real evidentiary bar: augmentative outputs must be validated by equivalence to a metric in current clinical use, or, if novel, validated for impact on patient management. The table's new row — "Requires documentation of clinical meaningfulness: No/Yes/Yes" — operationalizes that. So the 2027 augmentative category is meaningfully harder to claim than the 2025 one.
Two more substantive moves worth flagging because they're reimbursement-load-bearing, not cosmetic:
The augmentative professional component went from mandatory to conditional. 2025 table: augmentative "Requires physician interpretation and report — Yes." 2027 table: "Conditional (see Note)." The Note says physician work for augmentative software is "typically captured by existing codes" — an E/M data element, a pre-surgical-planning factor, or folded into a separate service. For you this is the line that matters: it formalizes the idea that an augmentative AI output may have no distinct professional payment pathway and instead gets absorbed into a host code. That's friendly to payers and a structural problem for developers who assumed the derived parameter would anchor its own RVUs.
The autonomous levels were rebuilt around a control-theory spine. Stripped of the prose, 2027 Levels I/II/III are the classic human-in-the-loop / on-the-loop / over-the-loop gradient:
- Level I — software recommends; physician must affirmatively implement or reject (pre-action human judgment). 92229 is now explicitly tagged Level I.
- Level II — software is about to act; physician gets an override window before implementation ("reasonable opportunity to negate the impending action").
- Level III — software acts; physician supervises and intervenes after, "unless a physician chooses to intervene," typically judged over multiple interventions.
And a quiet but important disambiguation in the definitions: "automatically" now refers only to the algorithm's input→output work (deriving the parameter), not to acting on the patient. That deliberately decouples "the model runs without a human" from "the system acts on the patient without a human" — i.e., it stops people from reading "autonomous" as "Level III." Plus a new transparency requirement: derived parameters must be reported at all autonomous levels for explainability — which slots neatly into the broader regulatory transparency push (and, not incidentally, the HHS clinical-AI RFI material).
One detail you'll enjoy: the entry gate — output "must be useful in the diagnosis, cure, mitigation, treatment, or prevention of disease" — is the §201(h) device definition lifted almost verbatim. So a taxonomy that pointedly refuses to define "AI" silently tethers its threshold question to FDA's device concept. Which is a good segue to your first objection.
"The term 'AI' is not defined" — paradox, or orthodoxy?
I'd argue your paradox is real at the application stage and dissolves at the classification stage, and the difference is the whole game.
The dissolution: nowhere in the classification criteria does "AI-ness" appear. The splitting variables are derived parameter / interpretation-conclusion / physician-role — all properties of the output and its clinical use, none a property of the method. A logistic regression and a transformer producing the identical risk index land in the same cell. So "is it AI?" isn't undefined-and-therefore-unanswerable for sorting purposes — it's irrelevant. The categories are method-agnostic, which is exactly CPT's century-old posture: CPT codes the service rendered to the patient, not the instrument. By that logic the document is simply orthodox CPT, and the honest title would be "Taxonomy for Algorithmic and Software-Driven Clinical Outputs." The word "AI" is marketing that imports a definitional liability they then have to disclaim. Your "two people disagree whether it's AI, yet we classify the type" is therefore not a contradiction — it's two people arguing about a label that the classifier never reads.
The residue you're entitled to keep: the 2027 text says the taxonomy "should be consulted for code change applications which describe use of AI-enabled medical services." That sentence makes the undefined term the trigger for entry. If "AI" is undefined, an applicant cannot know whether they're obligated to consult Appendix S — even though, once inside, the AI question never recurs. So the incoherence is real but narrow: AI-ness does zero work in sorting and unlicensed gatekeeping work in triggering. The fix is trivial (drop "AI-enabled," say "software-output-driven services"), which is itself evidence that the "AI" framing is doing branding rather than logic.
The missing logic — No decision procedure
You're right that the revision ships definitions but no decision procedure, and free prose hides exactly the failures a tree would expose. Here is the implied logic made explicit:
- Gate — Is the output clinically useful (the §201(h) verbs)? No → not in scope.
- Derived-parameter test — Does the output derive a quantitative/categorical parameter, qualitatively distinct from input, beyond summation/descriptive statistics? No → Assistive. Yes → continue.
- Interpretation test — Does it also generate an independent interpretation/conclusion (a diagnosis or intervention recommendation)? No → Augmentative. Yes → Autonomous.
- Autonomy level — physician acts to implement (I) / override window before action (II) / acts and physician intervenes after (III).
Drawn out, with the break-points marked:Now the diagram I wish the AMA had shipped — the implied decision spine, with the two boundaries that the prose leaves genuinely undecidable flagged in amber:The two amber nodes are where the prose genuinely fails to determine an answer, and there are more behind them.
| click to enlarge |
Break A — the assistive/augmentative collision. The assistive vocabulary explicitly includes "risk for," yet "risk scores" sit squarely in the augmentative output list. So a numeric risk score reported as a risk-for statement is on both sides of the line at once. The taxonomy's own anchor example makes this worse: 0764T/0765T (algorithmic ECG, low-EF-type risk output) stays assistive in both editions — but under the new "derives a parameter qualitatively distinct from the input" test, a computed probability of cardiac dysfunction derived from a raw ECG looks exactly like a derived parameter. The intended resolution is presumably that a flag/likelihood isn't a validated derived index — but the text gives you no operational threshold to locate that line, and the "more than adding, averaging, measuring" exclusion doesn't help, because a learned weighted sum (logistic regression) is arithmetically "adding" and clinically a derived index. The boundary ends up turning partly on how the report is worded, which makes it gameable — and gaming it matters, because augmentative carries a documentation-and-validation burden that assistive doesn't.
Break B — classification vs. conclusion. Augmentative outputs are explicitly allowed to be "categorical classifications." But a categorical classification is a conclusion in ordinary clinical language — "BI-RADS 4" is a classification and diagnostic-adjacent. 92229 (autonomous Level I) outputs essentially a binary refer/don't-refer for diabetic retinopathy, which by the letter of the augmentative definition is just a categorical classification — yet it's autonomous. What actually moves it across the line isn't the output type but the absence of a physician interpretation step. Which exposes the structural fault.
The structural fault: the taxonomy splits on two orthogonal axes and pretends it's one. Read the definitions carefully and you find two latent variables doing the work:
- Axis A — output ontology: raw data → derived parameter → parameter-plus-interpretation.
- Axis B — physician role: must interpret and report → conditional/absorbed → none (then sub-graded in-the-loop / on-the-loop / over-the-loop).
Assistive = (raw data, physician required). Augmentative = (derived parameter, physician conditional). Autonomous = (parameter + interpretation, no physician). These co-vary on the happy path, so the prose reads fine. But nothing forces them to co-vary in the wild, and the off-diagonal cells are real products with no home:
- A derived parameter that is auto-acted upon without physician interpretation — Axis A says augmentative, Axis B says autonomous. A closed-loop insulin algorithm computing a dose-adjustment index and delivering it is precisely this. The taxonomy has no cell for it.
- Raw detection (no derived parameter) wired to automatic action — a simple threshold alarm that auto-titrates on a raw sensor reading. Axis A says assistive; Axis B says Level III autonomous. There is no "assistive-but-acting" cell, so the document silently assumes detection-without-a-parameter can never drive autonomous action, which is empirically false.
A logic tree forces you to pick one splitting variable per node; the moment you draw it (as above), the off-diagonal cells announce themselves. Prose lets you carry both axes implicitly and never notice they've come apart. That's the concrete vindication of your methodological point: the failure isn't that the prose is unclear sentence-by-sentence — each sentence is fine — it's that the document never commits to a single decision variable, and only a formal structure would have caught it.
A vagueness lurks at II/III. Level II is "opportunity to negate the impending action before implementation (e.g., an alert)"; Level III is "automatically initiates… and will continue unless a physician chooses to intervene." Both are "acts unless a human stops it." The only difference is the salience and timing of the stop-window — a continuum dressed as a categorical boundary. A 30-second pre-action alert that proceeds on timeout: Level II (there was a window) or Level III (it proceeds unless you act)? Undecidable, and it's a genuine issue.
Where I'd partly defend AMA CPT, and where I wouldn't
On "AI is not defined": I think the paradox is real but mislocated, and worth stating precisely because the precise version is sharper.
The classification criteria never read the word "AI" — they read output type and clinical role — so for sorting purposes, whether something is AI is not undefined-and-unanswerable, it's simply irrelevant. A regression and a transformer with the same output land in the same cell. That is orthodox CPT: the code set has always abstracted away from the instrument and coded the service. By that reading the honest title is "Taxonomy for Software-Driven Clinical Outputs," and "AI" is branding that imports a definitional liability they then have to disclaim. The residue you're entitled to keep is narrower and more damning: the application guidance says the taxonomy "should be consulted for code change applications which describe use of AI-enabled medical services." There, the undefined term is the trigger for entry — you can't know whether you're obligated to consult Appendix S without knowing what "AI-enabled" means. So AI-ness does zero work in classification and unlicensed gatekeeping work at the application threshold. The cure is one word: delete "AI-enabled."
The strongest argument for shipping prose rather than a flowchart: standards bodies often want the residual fuzziness, because a logic tree forecloses the negotiated ambiguity they rely on for consensus and for case-by-case Panel adjudication. A decision tree binds the Panel to outcomes it hasn't litigated yet; prose preserves discretion. So the missing flowchart may be a deliberate institutional choice, not an oversight. But the cost of that choice is exactly what you flagged — undecidability at the boundaries that carry the reimbursement consequences (the "conditional" professional component for augmentative is where the dollars actually move). I'd take your formalize-first discipline over their consensus-preserving haze, with the caveat that they may understand the politics of binding the Panel better than the logic of binding the applicant.
##
##
##
Chat GPT compares the two long essays.
##
I treated the uploaded text as your combined draft: ChatGPT first, then the section labeled Claude Opus 4.8. The key contrast is that ChatGPT gives a clear policy-reader critique, while Claude turns the critique into a more formal logic analysis with axes, edge cases, and paradoxes.
Comparing the Two AI Reviews of AMA’s 2027 Appendix S
The two AI reviews reach broadly similar conclusions: AMA’s 2027 Appendix S is a real improvement, but it remains conceptually unfinished.
Both reviews praise the shift from the older “work performed by the machine” framing to a newer focus on software outputs, clinical meaningfulness, and the role of the output in care. Both also focus on the same two oddities: Appendix S is an AI taxonomy that refuses to define “AI,” and it offers prose definitions rather than a true decision tree.
But the two discussions have different strengths.
The ChatGPT discussion is the more accessible policy essay. It explains the revision in a way that a CPT-aware but nontechnical reader can follow. It emphasizes that the new Appendix S is best understood not as a taxonomy of artificial intelligence, but as a taxonomy of software-generated clinical outputs. That is a useful and memorable reframing. In this view, the real categories are not about AI methods but about clinical function: does the software support the physician, derive a new clinical parameter, or generate a diagnosis or management recommendation?
ChatGPT also gives a clean critique of the missing logic tree. It argues that AMA should have developed the taxonomy first as formal decision rules, then tested those rules against difficult examples, and only afterward converted the stable rules into explanatory prose. That is a strong procedural criticism:
- Free text can sound reasonable while hiding contradictions.
- A flow chart forces the authors to decide which variable controls classification at each step.
Claude’s discussion is more technical and more aggressive. It agrees that the document has improved, but it pushes harder on the logical architecture (or lack of it).
Claude identifies “derived parameter” as the new pivot between assistive and augmentative services. That is a useful observation. In the revised Appendix S, the key question is not merely whether software “analyzes” data. The key question is whether it produces a quantitative or categorical output—an index, score, classification, or similar parameter—that is qualitatively distinct from the input.
Claude also makes an important reimbursement point: the 2027 table changes augmentative software from always requiring physician interpretation and report to requiring it only conditionally.
That matters because the note says physician work related to augmentative software may already be captured by existing codes, such as E/M, presurgical planning, or another interpretive service. In other words, even if a software output is clinically meaningful, that does not guarantee a separate physician work payment pathway. For developers, that may be the most economically important sentence in the revision.
Claude’s most original contribution is the “two axes” analysis. It argues that Appendix S is trying to sort services along two different dimensions at once.
- One axis is output type: raw or supportive data, derived parameter, or interpretation/conclusion.
- The other axis is physician role: physician must interpret, physician role is conditional, or software acts with little or no concurrent physician involvement.
- On the easy examples, these axes line up neatly. Assistive software provides supportive data and requires physician interpretation. Augmentative software provides a derived parameter and may be integrated into physician work. Autonomous software generates a conclusion or action with reduced physician involvement.
- But real products may not fall so neatly. A tool might produce a derived parameter and automatically act on it. A simple sensor threshold might trigger an action without deriving a sophisticated parameter. A categorical classification might look augmentative in one context and autonomous in another. Claude’s point is that the taxonomy works when the two axes co-vary, but it does not explain what to do when they separate.
That is the kind of problem a formal flow chart would expose immediately.
In prose, the ambiguity can stay hidden. In a logic tree, the author must decide: is the next split based on the kind of output, or on what the software does with that output? Appendix S never fully answers that question.
The two reviews also treat the “AI is not defined” issue slightly differently. ChatGPT frames it as a conceptual oddity: the chapter is called an AI taxonomy, yet it declines to define AI. That makes the taxonomy seem like a bird taxonomy that refuses to define “bird.”
Claude sharpens the point by distinguishing classification from entry. Once a service is inside Appendix S, whether it is “really AI” does not matter very much. The taxonomy classifies outputs, not computational methods. A logistic regression and a transformer could land in the same category if they produce the same clinical output. In that sense, the refusal to define AI is not fatal to classification.
But the undefined term still matters at the front door. The AMA says Appendix S should be consulted for code applications involving AI-enabled services. If “AI” is undefined, then the trigger for using the taxonomy is itself unclear. AI does no real work inside the classification rules, but it still performs unlicensed gatekeeping work at entry.
That is a subtle but powerful point.
Overall, ChatGPT provides the better explanatory essay for a general policy audience. It is clean, readable, and organized around the practical implications for CPT applicants. Claude provides the better stress test. It finds the hidden structural issue: Appendix S may be using two classification axes while pretending to use one. It also gives concrete examples of why the boundaries may become unstable, especially between assistive and augmentative, and between augmentative and autonomous Level I.
Take Home Lessons: Having Read Both...
The best synthesis is this: AMA’s 2027 Appendix S is meaningfully better than the prior version, but still not fully engineered. It has better vocabulary, better evidence expectations, and better attention to the clinical role of software outputs. But it is still written as a prose taxonomy rather than a decision procedure. For a coding system, that is a problem. CPT classification should not depend on how creatively an applicant describes the service.
The next version should probably begin with a logic tree. First ask whether the software output is clinically relevant and within scope. Then ask whether it derives a distinct quantitative or categorical parameter. Then ask whether it generates an interpretation, diagnosis, or management recommendation. Then ask whether the software merely recommends, acts after an override window, or acts until interrupted. Only after that should AMA write the educational prose.
In plain terms, Appendix S is improving. But it still reads like a negotiated text, not a tested classification machine.
