At the outset, let me say that I’m skeptical whether we can hold the PACE investigators responsible for the outrageous headlines that have been slapped on their follow-up study and on the comments they have made in interviews.
The Telegraph screamed:
Chronic Fatigue Syndrome sufferers ‘can overcome symptoms of ME with positive thinking and exercise’
Oxford University has found ME is not actually a chronic illness
My own experience critiquing media interpretation of scientific studies suggests that neither researchers nor even journalists necessarily control shockingly inaccurate headlines placed on otherwise unexceptional media coverage. On the other hand, much distorted and exaggerated media coverage starts with statements made by researchers and by press releases from their institutions.
The one specific quote attributed to a PACE investigator is unfortunate because of its potential to be misinterpreted by professionals, persons who suffer from chronic fatigue syndrome, and the people around them affected by their functioning.
It’s wrong to say people don’t want to get better, but they get locked into a pattern and their life constricts around what they can do. If you live within your limits that becomes a self-fulfilling prophesy.
It suggests that willfulness causes CFS sufferers’ impaired functioning. This is ridiculous as application of the discredited concept of fighting spirit to cancer patients’ failure to triumph against their life altering and life-threatening condition. Let’s practice the principle of charity and assume this is not the intention of the PACE investigator, particularly when there is so much more for which we should give them responsibility.
Go here for a fuller evaluation that I endorse of the Telegraph coverage of PACE follow-up study.
Having read the PACE follow-up study carefully, my assessment is that the data presented are uninterpretable. We can temporarily suspend critical thinking and some basic rules for conducting randomized trials (RCTs), follow-up studies, and analyzing the subsequent data. Even if we do, we should reject some of the interpretations offered by the PACE investigators as unfairly spun to fit what has already a distorted positive interpretation of the results.
It is important to note that the PACE follow-up study can only be as good as the original data it’s based on. And in the case of the PACE study itself, a recent longread critique by UC Berkeley journalism and public health lecturer David Tuller has arguably exposed such indefensible flaws that any follow-up is essentially meaningless. See it for yourself [1, 2,3 ].
This week’s report of the PACE long term follow-up study and a commentary are available free at the Lancet Psychiatry website after a free registration. I encourage everyone to download a copy before reading further. Unfortunately, some crucial details of the article are highly technical and some details crucial to interpreting the results are not presented.
I will provide practical interpretations of the most crucial technical details so that they are more understandable to the nonspecialist. Let me know where I fail.
To encourage proceeding with this longread, but to satisfy those who are unwilling or unable to proceed, I’ll reveal my main points are
The PACE investigators sacrificed any possibility of meaningful long-term follow-up by breaking protocol and issuing patient testimonials about CBT before accrual was even completed.
This already fatal flaw was compounded with a loose recommendation for treatment after the intervention phase of the trial ended. The investigators provide poor documentation of which treatment was taken up by which patients and whether there was crossover in the treatment being received during follow up.
Investigators’ attempts to correct methodological issues with statistical strategies lapses into voodoo statistics.
The primary outcome self-report variables are susceptible to manipulation, investigator preferences for particular treatments, peer pressure, and confounding with mental health variables.
The PACE investigators exploited ambiguities in the design and execution of their trial with self-congratulatory, confirmatory bias.
The Lancet Psychiatry summary/abstract of the article:
Background: The PACE trial found that, when added to specialist medical care (SMC), cognitive behavioural therapy (CBT), or graded exercise therapy (GET) were superior to adaptive pacing therapy (APT) or SMC alone in improving fatigue and physical functioning in people with chronic fatigue syndrome 1 year after randomisation. In this pre-specified follow-up study, we aimed to assess additional treatments received after the trial and investigate long-term outcomes (at least 2 years after randomisation) within and between original treatment groups in those originally included in the PACE trial.
Findings: Between May 8, 2008, and April 26, 2011, 481 (75%) participants from the PACE trial returned questionnaires. Median time from randomisation to return of long-term follow-up assessment was 31 months (IQR 30–32; range 24–53). 210 (44%) participants received additional treatment (mostly CBT or GET) after the trial; with participants originally assigned to SMC alone (73 [63%] of 115) or APT (60 [50%] of 119) more likely to seek treatment than those originally assigned to GET (41 [32%] of 127) or CBT (36 [31%] of 118; p<0·0001). Improvements in fatigue and physical functioning reported by participants originally assigned to CBT and GET were maintained (within-group comparison of fatigue and physical functioning, respectively, at long-term follow-up as compared with 1 year: CBT –2·2 [95% CI –3·7 to –0·6], 3·3 [0·02 to 6·7]; GET –1·3 [–2·7 to 0·1], 0·5 [–2·7 to 3·6]). Participants allocated to APT and to SMC alone in the trial improved over the follow-up period compared with 1 year (fatigue and physical functioning, respectively: APT –3·0 [–4·4 to –1·6], 8·5 [4·5 to 12·5]; SMC –3·9 [–5·3 to –2·6], 7·1 [4·0 to 10·3]). There was little evidence of differences in outcomes between the randomised treatment groups at long-term follow-up.
Interpretation: The beneficial effects of CBT and GET seen at 1 year were maintained at long-term follow-up a median of 2·5 years after randomisation. Outcomes with SMC alone or APT improved from the 1 year outcome and were similar to CBT and GET at long-term follow-up, but these data should be interpreted in the context of additional therapies having being given according to physician choice and patient preference after the 1 year trial final assessment. Future research should identify predictors of response to CBT and GET and also develop better treatments for those who respond to neither.
Note the contradiction here which will persist throughout the paper, the official Oxford University press release, quotes from the PACE investigators to the media, and media coverage. On the one hand we are told:
Improvements in fatigue and physical functioning reported by participants originally assigned to CBT and GET were maintained…
Yet we are also told:
There was little evidence of differences in outcomes between the randomised treatment groups at long-term follow-up.
Which statement is to be given precedence? To the extent that features of a randomized trial have been preserved in the follow-up (which we will see, is not actually the case), a lack of between group differences at follow-up should be given precedence over any persistence of change within groups from baseline. That is a not controversial point for interpreting clinical trials.
A statement about group differences at follow up should proceed and qualify any statement about within-group follow up. Otherwise why bother with a RCT in the first place?
The statement in the Interpretation section of the summary/abstract has an unsubstantiated spin in favor of the investigators’ preferred intervention.
Outcomes with SMC alone or APT improved from the 1 year outcome and were similar to CBT and GET at long-term follow-up, but these data should be interpreted in the context of additional therapies having being given according to physician choice and patient preference after the 1 year trial final assessment.
If we’re going to be cautious and qualified in our statements, there are lots of other explanations for similar outcomes in the intervention and control groups that are more plausible. Simply put and without unsubstantiated assumptions, any group differences observed earlier have dissipated. Any advantages of CBT and GET are not sustained.
How the PACE investigators destroyed the possibility of an interpretable follow-up study
Neither the Lancet Psychiatry article nor any recent statements by the PACE investigators acknowledged how these investigators destroyed any possibility of analyses of meaningful follow-up data.
Before the intervention phase of the trial was even completed, even before accrual of patients was complete, the investigators published a newsletter in December 2008 directed at trial participants. An article appropriately reminds participants of the upcoming two and one half year follow-up. But then it acknowledges difficulty accruing patients, but that additional funding has been received from the MRC to extend recruiting. And then glowing testimonials appear on p. 3 of the newsletter about the effects of their intervention.
“Being included in this trial has helped me tremendously. (The treatment) is now a way of life for me, I can’t imagine functioning fully without it. I have nothing but praise and thanks for everyone involved in this trial.”
“I really enjoyed being a part of the PACE Trial. It helped me to learn more about myself, especially (treatment), and control factors in my life that were damaging. It is difficult for me to gauge just how effective the treatment was because 2007 was a particularly strained, strange and difficult year for me but I feel I survived and that the trial armed me with the necessary aids to get me through. It was also hugely beneficial being part of something where people understand the symptoms and illness and I really enjoyed this aspect.”
These testimonials are a horrible breach of protocol. Taken together with the acknowledgment of the difficulty accruing patients, the testimonials solicit expression of gratitude and apply pressure on participants to endorse the trial by providing a positive of their outcome. Some minimal effort is made to disguise the conditions from which the testimonials come. However, references to a therapist and, in the final quote above, to “control factors in my life that were damaging” leave no doubt that the CBT and GET favored by the investigators is having positive results.
Probably more than in most chronic illnesses, CFS sufferers turn to each other for support in the face of bewildering and often stigmatizing responses from the medical community. These testimonials represent a form of peer pressure for positive evaluations of the trial.
Any investigator group that would deliberately violate protocol in this manner deserves further scrutiny for other violations and threats to the validity of their results. I challenge defenders of the PACE study to cite other precedents for this kind of manipulation of clinical trials participants. What would they have thought if a drug company had done this for the evaluation of their medication?
The breakdown of randomization as further destruction of the interpretability of follow-up results
Returning to the Lancet Psychiatry article itself, note the following:
After completing their final trial outcome assessment, trial participants were offered an additional PACE therapy if they were still unwell, they wanted more treatment, and their PACE trial doctor agreed this was appropriate. The choice of treatment offered (APT, CBT, or GET) was made by the patient’s doctor, taking into account both the patient’s preference and their own opinion of which would be most beneficial. These choices were made with knowledge of the individual patient’s treatment allocation and outcome, but before the overall trial findings were known. Interventions were based on the trial manuals, but could be adapted to the patient’s needs.
Following completion of the treatment to which particular patients were randomly assigned, the PACE trial offered a complex negotiation between patient and trial physician about further treatment. This represents a thorough breakdown of the benefits of a controlled randomized trial for the evaluation of treatments. Any focus on the long-term effects of initial randomization is sacrificed by what could be substantial departures from that randomization. Any attempts at statistical corrections will fail.
Of course, investigators cannot ethically prevent research participants from seeking additional treatment. But in the case of PACE, the investigators encouraged departures from the randomized treatment yet did not adequately take into account the decisions that were made. An alternative would have been to continue with the randomized treatment, taking into account and quantifying any cross over into another treatment arm.
Voodoo statistics in dealing with incomplete follow-up data.
Between May 8, 2008, and April 26, 2011, 481 (75%) participants from the PACE trial returned questionnaires.
This is a very good rate of retention of participants for follow-up. The serious problem is that neither
loss to follow-up, nor
whether there was further treatment, nor
whether there was cross over in the treatment received in follow-up versus the actual trial
Furthermore, any follow-up data is biased by the exhortation of the newsletter.
No statistical controls can restore the quality of the follow-up data to what would’ve been obtained with preservation of the initial randomization. Nothing can correct for the exhortation.
Nonetheless, the investigators tried to correct for loss of participants to follow-up and subsequent treatment. They described their effort in a technically complex passage, which I will subsequently interpret:
We assessed the differences in the measured outcomes between the original randomised treatment groups with linear mixed-effects regression models with the 12, 24, and 52 week, and long-term follow-up measures of outcomes as dependent variables and random intercepts and slopes over time to account for repeated measures.
We included the following covariates in the models: treatment group, trial stratification variables (trial centre and whether participants met the international chronic fatigue syndrome criteria,3 London myalgic encephalomyelitis criteria,4 and DSM IV depressive disorder criteria),18,19 time from original trial randomisation, time by treatment group interaction term, long-term follow-up data by treatment group interaction term, baseline values of the outcome, and missing data predictors (sex, education level, body-mass index, and patient self-help organisation membership), so the differences between groups obtained were adjusted for these variables.
Nearly half (44%; 210 of 479) of all the follow-up study participants reported receiving additional trial treatments after their final 1 year outcome assessment (table 2; appendix p 2). The number of participants who received additional therapy differed between the original treatment groups, with more participants who were originally assigned to SMC alone (73 [63%] of 115) or to APT (60 [50%] of 119) receiving additional therapy than those assigned to GET (41 [32%] of 127) or CBT (36 [31%] of 118; p<0·0001).
In the trial analysis plan we defined an adequate number of therapy sessions as ten of a maximum possible of 15. Although many participants in the follow-up study had received additional treatment, few reported receiving this amount (table 2). Most of the additional treatment that was delivered to this level was either CBT or GET.
The “linear mixed-effects regression models” are rather standard techniques for compensating for missing data by using all of the available data to estimate what is missing. The problem is that this approach assumes that any missing data are random, which is an untested assumption that is unlikely to be true in this study.
The inclusion of “covariates” is an effort to control for possible threats to the validity of the overall analyses by taking into account what is known about participants. There are numerous problems here. We can’t be assured that the results are any more robust and reliable than what would be obtained without these efforts at statistical control. The best publishing practice is to make the unadjusted outcome variables available and let readers decide. Greatest confidence in results is obtained when there is no difference between the results in the adjusted and unadjusted analyses.
The effectiveness of statistical controls depends on certain assumptions being met about patterns of variation within the control variables. There is no indication that any diagnostic analyses were done to determine whether possible candidate control variables should be eliminated in order to avoid a violation of assumptions about the multivariate distribution of covariates. With so many control variables, spurious results are likely. Apparent results could change radically with the arbitrary addition or subtraction of control variables. See here for a further explanation of this problem.
We don’t even know how this set of covariate/control variables, rather than some other set, was established. Notoriously, investigators often try out various combinations of control variables and present only those that make their trial looked best. Readers are protected from this questionable research practice only with pre-specification of analyses before investigators know their results—and in an unblinded trial, researchers often know the result trends long before they see the actual numbers.
See JP Simmons’ hilarious demonstration that briefly listening to the Beatles’ “When I’m 64” can leave research participants a year and a half older younger than listening to “Kalimba” – at least when investigators have free reign to obtain the results they want in an study without pre-registration of analytic plans.
Finally, the efficacy of complex statistical controls is widely overestimated and depends on unrealistic assumptions. First, it is assumed that all relevant variables that need to be controlled have been identified. Second, even when this unrealistic assumption has been met, it is assumed that all statistical control variables have been measured without error. When that is not the case, results can appear significant when they actually are not. See a classic paper by Andrew Phillips and George Davey Smith for further explanation of the problem of measurement error producing spurious findings.
What the investigators claim the study shows
In an intact clinical trial, investigators can analyze outcome data with and without adjustments and readers can decide which to emphasize. However, this is far from an intact clinical trial and these results are not interpretable.
The investigators nonetheless make the following claims in addition to what was said in the summary/abstract.
In the results the investigators state
The improvements in fatigue and physical functioning reported by participants allocated to CBT or GET at their 1 year trial outcome assessment were sustained.
This was followed by
The improvements in impairment in daily activities and in perceived change in overall health seen at 1 year with these treatments were also sustained for those who received GET and CBT (appendix p 4). Participants originally allocated to APT reported further improvements in fatigue, physical functioning, and impairment in daily activities from the 1 year trial outcome assessment to long-term follow-up, as did those allocated to SMC alone (who also reported further improvements in perceived change in overall health; figure 2; table 3; appendix p 4).
If the investigators are taking their RCT design seriously, they should give precedence to the null findings for group differences at follow-up. They should not be emphasizing the sustaining of benefits within the GET and CBT groups.
The investigators increase their positive spin on the trial in the opening sentence of the Discussion:
The main finding of this long-term follow-up study of the PACE trial participants is that the beneficial effects of the rehabilitative CBT and GET therapies on fatigue and physical functioning observed at the final 1 year outcome of the trial were maintained at long-term follow-up 2·5 years from randomisation.
This is incorrect. The main, a priori finding is that any reported advantages of CBT and GET at the end of the trial were lost by long-term follow up. Because an RCT is designed to focus on between group differences, the statement about sustaining of benefits is post-hoc.
The Discussion further states
In so far as the need to seek additional treatment is a marker of continuing illness, these findings support the superiority of CBT and GET as treatments for chronic fatigue syndrome.
This makes unwarranted and self-serving assumptions that treatment choice was mainly driven by the need for further treatment, when decision-making was contaminated by investigative preference, as stated in the newsletter. Note also that CBT is a novel treatment for research participants and more likely to be chosen on the basis of novelty alone in the face of overall modest improvement rates for the trial and lack of improvements in objective measures. Whether or not the investigators designate a limited range of self-report measures as primary, participant decision-making may be driven by other, more objective measures.
Regardless, investigators have yet to present any data concerning how decisions for further treatment were made, if such data exist.
The investigators further congratulate themselves with:
There was some evidence from an exploratory analysis that improvement after the 1 year trial final outcome was not associated with receipt of additional treatment with CBT or GET, given according to need. However this finding must be interpreted with caution because it was a post-hoc subgroup analysis that does not allow the separation of patient and treatment factors that random allocation provides.
However, why is this analysis singled out has exploratory and to be interpreted with caution because it is a post-hoc subgroup analysis when similarly post-hoc subgroup analyses are recommended without such caution?
The investigators finally get around to depicting what should be their primary finding, but do so in a dismissive fashion.
Between the original groups, few differences in outcomes were seen at long-term follow-up. This convergence in outcomes reflects the observed improvement in those originally allocated to SMC and APT, the possible reasons for which are listed above.
The discussion then discloses a limitation of the study that should have informed earlier presentation and discussion of results
First, participant response was incomplete; some outcome data were missing. If these data were not missing at random it could have led to either overestimates or underestimates of the actual differences between the groups.
This minimizes the implausibility of the assumption of random missing variables, as well as the problems introduced by the complex attempts to control confounds statistically.
And then there is an unsubstantiated statement that is sure to upset persons who suffer from CFS and those who care for them.
The outcomes were all self-rated, although these are arguably the most pertinent measures in a condition that is defined by symptoms.
I could double the length of this already lengthy blog post if I fully discussed this. But let me raise a few issues.
The self-report measures do not necessarily capture subjective experience, only forced choice responses to a limited set of statements.
One of the two outcome measures, the physical health scale of the SF-36 requires forced choice responses to a limited set of statements selected for general utility across all mental and physical conditions. Despite its wide use, the SF-36 suffers from problems in internal consistency and confounding with mental health variables. Anyone inclined to get excited about it should examine its items and response options closely. Ask yourself, do differences in scores reliably capture clinically and personally significant changes in the experience and functioning associated with the full range of symptoms of CHF?
The validity other primary outcome measure, the Chalder Fatigue Scale depends heavily on research conducted by this investigator group and has inadequate validation of its sensitivity to change in objective measures of functioning.
Such self-report measures are inexorably confounded with morale and nonspecific mental health symptoms with large, unwanted correlation tendency to endorse negative self-statements that is not necessarily correlated with objective measures.
Although it was a long time ago, I recall well my first meeting with Professor Simon Wessely. It was at a closed retreat sponsored by NIH to develop a consensus about the assessment of fatigue by self-report questionnaire. I listened to a lot of nonsense that was not well thought out. Then, I presented slides demonstrating a history of failed attempts to distinguish somatic complaints from mental health symptoms by self-report. Much later, this would become my “Stalking bears, finding bear scat in the woods” slide show.
But then Professor Wessely arrived at the meeting late, claiming to be grumbly because of jet lag and flight delays. Without slides and with devastating humor, he upstaged me in completing the demolition of any illusions that we could create more refined self-report measures of fatigue.
I wonder what he would say now.
But alas, people who suffer from CFS have to contend with a lot more than fatigue. Just ask them.
[To be continued later if there is interest in my doing so. If there is, I will discuss the disappearance of objective measures of functioning from the PACE study and you will find out why you should find some 3-D glasses if you are going to search for reports of these outcomes.]