This summer, I've been fortunate to have a hand in two publications. I had a major writing role in the Personalized Medicine Coalition's policy paper on diagnostics reimbursement. The final paper, I should add, is a PMC work product, and many shared in the final comments and editing. It's on the PMC website, HERE.
With co-author Felix Frueh PhD, in August 2014, we've been able to publish a "New View of Clinical Utility" - our attempt to help out an area that's been contentious for a long time. Available online HERE. More on clinical utility after the break.
What' the story behind "Molecular Diagnostics Clinical Utility Strategy: A Six Part Framework"?
Frueh and I were both frustrated that the triad of "analytical validity, clinical validity, clinical utility" is OK as far as it goes, but it can't seriously claim to be a full "framework" or "decision path" or "decision criteria" for communications between diagnostics and payers. Frueh suggested in 2013 that there are six key questions that provide about the right degree of granularity - 1 question is too few, 40 questions would be too many.
1. Who should be tested and under what circumstances?
2. What does the test tell us, that we did not now without it?
3. Can we act on the information provided by the test?
4. Does the outcome change, in a way we find value in, relative to the outcomes obtained without the test?
5. Will we act on the information provided by the test?
6. If the test is employed, can we afford it? (Is it a reasonable value?)
The six questions can be represented on a flow diagram:
Three of the questions are explicitly comparative - (2) Improvement in clinical validity; (4) Improvement in outcomes, (6) Value. We argue that much as in required in a pharmacoeconomics paper, the comparative answers to Questions 2, 4, and 6 should clearly discuss (I) The comparator, (II) The units of comparison, and (III) The uncertainty. Sounds obvious - but this approach quickly reveals a lot of pitfalls in discussing diagnostic tests.
For example, let's say we have a better prognostic test for Disease X. We clarify that we mean: (I) It's being compared to tumor grade, (II) and the units of comparison are "p" values, and (III) the statistical uncertainty is small. OK - but while you've stated correctly that while the old fashioned tumor grade correlates with outcomes (p=.05), and your brand new test correlates at (p=.01), stop and look at the chart above. How will that answer to Question 2 drive the arrow which represents the answer to Question 3? Not clear. Let's try again. This time, your new test is still compared to tumor grade, as before, but we find that 20% of cases will be reclassifed as very low grade (the units of comparison are now reclassification), and you have validated as more accurate in two or three studies. Now the reader can start to believe that the improved clinical validity will drive some actions (Question 3 and the following questions), since it's well known that very low grade tumors lead the treatment down different management pathways.
Taking one more crack at Question 2. If the answer to Question 2 is, 'the test provides additional information" or "independent information" - if that's the your final answer to Question 2 - such an answer to Question 2 doesn't clearly drive the reader's brain forward to patient impacts, via Question 3 and the later questions.
We also take some time to discuss "uncertainty" - a topic for FDA regulators as well, and other authors. I discussed the FDA's approach to uncertainty in a prior blog (here). We suggest that with molecular tests, there are frequently three kinds of uncertainty.
(I) Purely statistical uncertainty - Survival is increased by 6 months... plus or minus 1.3 months, p = .03.
(II) Pragmatic uncertainty - For example, will a population outside the trial population be the same? Or, I know the p value is .03, but is one trial really enough? Does a trial in academia apply to family practice? And so on. And
(III) Conceptual uncertainty - The various range uncertainties that are often still left, that are not statistical uncertainty or pragmatic uncertainty. For example, if the test is "really prognostic" or "really predictive?" Are we really sure that a mutation found in 20% of the tumor cells is the right mutation to treat with a targeted drug?
In therapeutic trials, everyone is very familiar with outcomes validated with p-values (statistical uncertainty) and concerns about external validity (pragmatic uncertainty). We think molecular tests often raise still other types of concerns, which is why we leave a third category of uncertain in the framework. The developer should think through what these are likely to be for the evaluator or the payer, and could perhaps address them in a Q&A section. ("How is one sure our test is predictive and not just prognostic? We know this because data X,Y and Z.")
We are confident that "uncertainty" the best way to characterize many of the disagreements between test developers and payers. When payers say "you don't have enough clinical utility" they do understand your claim (say, that 20% of unnecessary surgeries should be avoided for the patients) but they don't believe it because they feel too much uncertainty.
Frueh and I also provided some sections that discuss "context" and the issue about evidence being "fit for purpose," which stakeholders often find contentious. For example, we familiarly class tests as "screening" or "diagnostic" or "monitoring," and so on. But when we introduce a new test, it probably also has a different kind of value proposition in addition to representing a class of test like a "monitoring" test. The new test might be "faster and cheaper," or, might be "more accurate," or be a "reflex test, resolving a prior ambiguous test" and so on. This second value propostion matters, too. Policymakers might require colossal public health trials before endorsing an entirely new population screening test, but later, if a new test version is exactly equally accurate but faster and cheaper, it can be swapped in more easily, even though it is a "screening test" which normally requires a colossal trial.
We also discuss that "outcomes" for diagnostic tests can be diverse - they might be the results of a drug choice that causes progression free survival; or overall survival; or might lead to an unnecessary drug or surgery avoided, and so on. We felt most outcomes could be put in three categories, as shown below.
To play further with the three columns, consider a test type with a lot of data - circulating tumor cells. These tests may have data in all three of the columns: an oncology circulating tumor cell test may be correlated with progression free survival on imaging (left column), correlated with overall survival (middle column), and allow better family planning (right column.) So why is payer coverage lagging for CTC tests?
By itself, the CTC test information better predicting PFS or OS (Question 2) doesn't change the PFS outcome or the OS outcome, so that information isn't a delta in the Outcome -- Question 4 is the comparative outcome. A correlation may tell what OS will be, but doesn't change OS, so it doesn't move the needle on Question 4.
(Now, a CTC might not just predict, but also change PFS or OS, but this occurs only by bringing in a second argument, that the new information about OS leads to different and better therapeutic decisions, adding a story that fits Question 3 which connects from Question 2 to Question 4. But payers might want to know the type and scale of those decisions in real life. Payers won't respond to vague suggestions that with the new information, "stuff could happen" in the Question 3 step leading to clinical improvements in the Question 4 step.)
Prognostic tests can give information that directly jumps to the right column in the three column outcome chart, such as "value of knowing." The test, in itself, provides advance information that is new, for the patient herself, on her survival, which may allow family planning, and fit the "value of knowing" category too. But our collective experience suggests that that type of improvement in right column outcomes (a delta in value-of-knowing outcome or the ability-to-plan outcome) need a lot of follow-up to be convincing to payers.
The fact that there are so many types of diagnostic tests (from screening, to diagnostic, to monitoring, etc) and so many additional new-test value propositions (faster, cheaper, more accurate, etc.), and so many possible and wildly diverse outcomes (survival, surgeries avoided, or less pain) suggests that it would be very difficult to make any one size fits all, simple and locked-down "Criteria for Evaluating a Diagnostic Test." It's a fantasy, a Shangri-La, a pot of gold at the end of the rainbow. Two different tests could have completely different pathways across the three colunns below, making a short, one size fits all evaluation solution a pipe dream.
Looking at the chart above, if you have 6 test types, and 7 value propositions, and maybe 10-20 possible common outcome points, it's 400-plus pathways through the table, and that's even before considering test- and disease-specific circumstances and competitors for the management pathway.
Coming full circle, in recognition of this complexity, we hope the Six Questions will be helpful for guiding communications in many circumstances. "We agree with your statements and positions on Questions 1,2,3,5, but we have the following clearly stated specific concerns about aspects of Questions 4 and 6" -- for example. That's better than hearing, "You don't have enough clinical utility, so do more clinical utility."
Although it doesn't appear in the paper, in giving talks on this topic, I've begun suggesting that the developer should be prepared for a reasonable skeptic. This is keyed off the actual statutory language for FDA approvals - that the data provides a reasonable assurance, to an expert, that the benefit is really there. Not an absolute assurance - a reasonable assurance. So similarly the diagnostic test evaluator should be a reasonable skeptic, not a skeptic of the type who can idly generate limitless and distant hypothetical objections.
I'd summarize the paper this way:
The idea is that you can fire test what you have before it ever gets to a technology assessor or a payer. Let's take Question 2, the delta in clinical validity. A company says the improvement with the new test is "additional information." The framework leads the author of the dossier to be asked: What's the comparator? He answers, "I guess, what the physician had before, whatever it was." What's the units? He answers, "More. More information." What's the uncertainty of that delta in "more" information? He answers, "Huh? We said it's 'more.' I guess it's also.. new information. Does that help?" Small wonder the technology assessor or payer is drumming his fingers and looking at his watch. The Six Question Framework provides a way of describing and arguing for the test, that should work, whether the payer has heard of the Six Question Framework or not. This is because the population and indication has been clearly described, the gain in clinical validity has been clearly described and in a way that drives forward to the subsequent questions. There is something we can do differently (Question 3), which has some benefit that can be well and clearly anticipated (Question 4). Will we do it? (Will patients really avoid surgery?) - Question 5. Is it a reasonable value? (Someone will always be thinking this) - Question 6.
See also my earlier blog on clinical utility (here).
For PowerPoints and an online video course based on the Frueh & Quinn publication, see here.
For a comparison of the Six Question Figure to an FDA risk benefit analysis, see here.