July issues of JAMA and Annals of Internal Medicine each brought out several articles with new insights for diagnostics.
Since Fall 2021, JAMA has published a surprisingly extensive series of articles on diagnostics and diagnostics excellence. Newly, we have:
Sarma et al. (2022) Achieving Diagnostic Excellence for Cancer: Symptom Detection as a Partner to Screening. JAMA July 18 online. Here.
Sarma et al. describe that only a limited number of cancers have widely endorsed screening tests (cervical, breast, colon, and more recently and less successfully, lung). Yet even in these cancers only a minority are detected by population screening. They describe programs in UK and elsewhere looking at early symptom work-up and intervention.
Reyna et al. (2022) Rethinking Algorithm Performance Metrics for Artificial Intelligence in Diagnostic Medicine. JAMA 328:329. Here.
Reyna et al. describe cases where special metrics have been weighted to maximize test usefulness during test comparisons. For example, a sepsis detection test had different weights for penalties when wrong diagnoses were false positives vs false negatives. This goes beyond classic simple metrics (sensitivity, specificity, area under curve) and merges evaluation with specific use cases. Another example was in cardiology AI diagnoses, where "false" diagnoses that led to the same workup pathway as correct ones, had smaller penalties. (Related, see Persson on machine learning ICU algorithms for sepsis, here, and another "rich content" evaluation of algorithms by Cerrato et al. here.)
Turning to Annals of Internal Medicine:
Burke et al. (2022) The Challenge of Variants of Unknown Significance. AIM 175:994. Here.
The authors summarize, "the high prevalence of rare and novel variants in the human genome points to VUSs as an ongoing challenge. Additional strategies can help mitigate the potential harms of VUSs, including testing protocols that limit identification or reporting of VUSs, subclassification of VUSs according to the likelihood of pathogenicity, routine family-based evaluation of variants, and enhanced counseling efforts. All involve tradeoffs, and the appropriate balance of measures is likely to vary for different test uses and clinical settings. Cross-specialty deliberation and public input could contribute to systematic and broadly supported policies for managing VUSs."
Lehmann et al. (2022) Ethical Considerations in Precision Medicine and Genetic Testing in Internal Medicine Practice: A Position Paper From the American College of Physicians. AIM here.
The authors summarize, "Whether or not to use genetic tests or adopt technologies such as genome sequencing in clinical care and, if so, when and how to do so needs careful consideration. This paper focuses on the use of precision medicine and genetics in the practice of general internal medicine, where patients may be at increased risk for a heritable condition, may need a drug associated with a known pharmacogenomic variant, or may have a cancer diagnosis for which genetic information may guide treatment decisions. This position paper is intended to complement and provide more specificity to the guidance outlined in the American College of Physicians (ACP) Ethics Manual, which identifies a number of issues associated with precision medicine..."
Lee et al. (2022) QUAPAS: An Adaptation of the QUADAS-2 Tool to Assess Prognostic Accuracy Studies. AIM 175:1010.
One of the world experts in diagnostics tests and theory is Patrick Bossuyt of the University of Amsterdam (here). Here, the Dutch team adapts a major diagnostic test guideline, QUADAS (QUADAS-2) for prognostic test accuracy: QUAPAS with a "p".
(See also my blog on MolDx's newly released checklist for prognostic tests, here.)
There are a lot of issues with conventional low-voltage evaluation metrics for prognostic tests.
- First, the tests may often have continuous score or several classes of output (low, medium, high), making familiar binary sensitivity and specificity impossible to calculate.
- Plus, Sens/Spec and the related "Area Under Curve" AUC are blind to prevalence, which makes it impossible to calculate positive predictive value and negative predictive value which are crucial for clinical test performance.
- In addition, while Sens and Spec are often reported to decimal points ("Our sensitivity is 93.6%") this is nonsensical because small shifts in either prevalence or patient spectrum (along a continuum) will shift Sens/Spec metrics by quite a bit.
- Which renders the expression of population test summary metrics with decimal points nonsensical.
- And, Sens/Spec are expressed for whole tested populations, where many patients may be either strong positives or strong negatives, but clinically it is the "gray zone" patients where test performance may be basically a coin flip, that are actually the most important patients with the most need for accurate testing.
- For an entry point on spectrum effects, see the Usher-Smith 2016 review paper, BMJ 2016:353:i3139, here.
- Another difficulty is pilot data for a biomarker expressed with excitement as a "p value" between groups - Group High Risk is 8 units, Group Low Risk is 7 units, p<.05 - but the test is not useful because despite scoring a p value on the average, 80% of the patient test scores in Group A and B actually overlap.