A study piqued my curiosity in a news feed recently. Yoga could reduce depression symptoms, researchers said – but only if you expected it to (Uebelacker 2018).
Yoga for depression, it turns out, is pretty typical of lots of health issues and claims: an important topic, lots of hype. Here’s a classic example. Last year, Time magazine declared, “It’s Official: Yoga Helps Depression”, after a very little trial of yoga for people with major depression was published (Streeter 2017).
It’s easy to think of reasons both for why yoga might work for depression, and why it mightn’t. It’s a good example, I think, for looking at some basics for critical reading of randomized clinical trials. So let’s roll up our sleeves and get going!
1. Start by getting a good picture in your mind of what happened, from the participants’ point of view.
Gaining and then keeping perspective of the basics of the trial is a critical starting point. Otherwise, it’s really easy to get lost in technical details or focus on what the researchers want to draw your attention to, instead of what’s important.
There are two major aspects of validity with a clinical trial:
internal validity – how scientifically sound the experiment was, and
external validity – whether the results for these participants could be expected by others.
Picturing who could get into the trial, and what would happen to participants in each comparison group – as if you were one of them – could give you some quick insight into both kinds of validity.
That can help you understand biases like selection bias – where the results are pre-destined by who is in the trial. And potential confounding variables can stick out – other things besides the intervention being studied that could account for the results the people experienced.
Inherent limitations of a trial are sometimes obvious. For example, Streeter 2017, the trial that Time reported on, only randomized 32 people, and had no non-yoga comparison group.
The recent Uebelacker paper wasn’t the actual trial report. That was published in 2017. It found yoga made no difference to depression symptoms. The 2018 paper uses the trial’s data to analyze a question that wasn’t originally planned. That’s called a post-hoc or secondary analysis, and when it wasn’t planned, you can’t put as much weight on it. It’s safer to think of these analyses as finding new questions, not answers. I’ve written an explainer on this at Statistically Funny. The bottom line? Be careful of researchers on fishing expeditions.
2. Check if the trial seems to be properly randomized and you can see what happened to everyone by the end.
It’s not enough to say that a trial is randomized: some people misuse the term, or they do it, but not securely.
If randomization has been done well, then the comparison groups should have about the same number of people in them. (An exception here is if a trial was designed to have a comparison group that’s double the size of the others.) And the groups of people should be similar before the intervention in important ways. There should be a table that gives you this baseline “before” data.
In the Streeter trial, for example, the two comparison groups weren’t the same on all counts. The “low dose” group had higher depression scores before the trial started (27.7 versus 24.6).
Check the details on how they did the randomization. Allocation concealment is a critical issue. This means that the comparison group a person is destined to join can’t be known by the person entering them into the trial, so that who gets into which group can’t be manipulated. More on this critical issue at Statistically Funny.
Poor allocation concealment is one of the biggest known threats to validity of randomized trials. It means that the trial is likely to have exaggerated the benefits of health treatments (meta-research on this here and here).
This is a weakness of the Streeter trial of yoga for depression. Here’s what they report:
The randomization numbers and group assignments were kept in sealed envelopes, numbered sequentially, and opened in sequence when a subject was randomized.
That’s a very easy system to fiddle, if you don’t think a person will do well with the intervention, for example. So this trial is at risk of bias here, which seriously limits the reliability of its results. This doesn’t mean that the allocation to groups wasn’t clean: we just can’t know for sure.
On the other hand, the Uebelacker trial authors address this problem, although we don’t know exactly how. They wrote:
Study staff had no way of knowing to which arm the next participant would be randomized.
What is secure? A process where randomization is done independently. For example, ringing a central place to enter the person as a participant first, and then being getting assigned a number that a computer has generated.
Ideally, a randomized trial will include the CONSORT flow diagram, which shows you what happened to everybody. Both the Streeter and Uebelacker trials did this. This is the outline for a flow diagram:
Understanding what happened to everyone is important. You need to know not if there are big imbalances in the groups, for example from missing data. If lots of people drop out of a group (or the whole trial) – say 15% or more – then that puts a big question mark around the trial’s results.
It’s possible that the people who dropped out had the same results as though who stayed the course. But people could have disappeared because of adverse effects, say, without reporting them to the researchers. Or it could be a signal that the intervention is too hard to stick with.
3. Exactly what is being compared can be a deal breaker, so check if it’s fair and useful.
Everything can pretty much depend on this, can’t it? It’s easier to make an intervention look good if you test it against another intervention that you know is lousy – or avoid ever going head-to-head with the best. That’s called comparator bias.
An example of comparator bias was documented in the 1990s. Out of 56 trials of NSAIDs for pain relief by manufacturers, nearly half (48%) compared the manufacturers’ own drug to a lower dose of a competitors’ drug.
I’ve already mentioned comparison groups as a study weakness a couple of times, with trials that didn’t have a “no yoga” comparison. It’s not that trials like that aren’t important. Once you know for sure something is effective, then there could be other sorts of questions that matter, like, how high should a dose be?, or what’s the shortest amount of time you can take this drug to still get a benefit?
But if the effectiveness of an intervention hasn’t been decisively established, then it’s important to keep that in mind. Comparing different versions of an intervention, without a group of people having no intervention, a placebo, or a completely different intervention can’t generally prove effectiveness. After all, one version of an ineffective treatment could just be slightly less ineffective than another, couldn’t it?
There is another pretty big genre, here: non-inferiority or equivalence trials. This is when the bar for an intervention is set in a trial to show only that it has similar effects to another intervention. Those trials have some differences to standard trials, which are superiority trials – trying to find out what intervention is better. You can find out more about this at Statistically Funny.
4. When it comes to the outcomes, the devil really is in the details of what is being measured, so check carefully.
Don’t jump to conclusions quickly about what an outcome means in a clinical trial. You really have to check the fine print. Just because adverse effects are described as “trivial” for example, doesn’t mean you will think they are! It can be right up there with a doctor saying, “You might feel a little discomfort” just before the searing pain begins!
The second big thing to remember is that just because a trial is “a good trial”, it doesn’t mean the data on every outcome in it is equally solid. Always think of any study as a collection of elements of uneven quality. One measure could be more objective than another; there could be comprehensive data for one, but a large amount of missing data for another. (More on the highs and lows of the “good” study here at Statistically Funny.)
The third big issue to remember is that the outcomes that matter most in a trial are the primary outcomes the researchers pre-specified as the test for whether that intervention worked. While there can be fully legitimate reasons to make a change once a study is underway, most of the time it’s a worry. It’s basically shifting the goal posts after the game has started, to make scoring a goal easier.
Also key: whenever possible, outcomes should be assessed by people who don’t know what comparison group the person was in (blinded outcome assessment). (More on that in the post about allocation concealment and blinding at Statistically Funny.)
Clinical trial outcomes aren’t necessarily chosen because to measure what you care about, or what makes sense to you. All sorts of issues come into play to find ways to see if there is a difference between one intervention and another, in as short and small a trial as possible. That includes things like substitute measures (surrogate outcomes) and ones that combine different outcomes together (composite endpoints). This can end up leaving large uncertainties around the results – or just be hard to interpret back into real life decision making.
5. Find a good systematic review to put the single trial in context.
It’s risky to put a lot of weight on a single trial, without taking the current state of evidence into account. It’s even risky to rely on what the authors of a trial say previous research has shown – they may not have done enough research to know, or they may be quoting studies selectively. Even when they quote a systematic review of the literature, it could be out of date, or not be thorough itself.
Even a systematic review that’s out of date can be helpful for context. It should still cover issues that you need to keep in mind. For example, with yoga, reading a good systematic review would remind you to keep in mind that there isn’t a single, completely standard “yoga”. It would unpack issues like what kind of depression do the participants have, and how that’s measured.
Importantly, when previous trials haven’t already arrived at a pretty strong conclusion, a systematic review should give you an idea of what a trial would need to be like, to shift the balance of evidence.
For example, there’s a systematic review on yoga and depression by Cramer and colleagues, 2013. They found 12 trials, with only 619 people in them altogether – and they judged 9 of them to be at high risk of bias. Ot the 3 that were more likely to be reliable, one didn’t have a “no yoga” comparison group. Given the variety of types of yoga, there really was no good answer yet on the effectiveness of yoga for depression.
This means that another small trial isn’t going to make a difference to what we know. A review in 2014 sets 100 participants as a minimum number to be worthwhile for yoga and depression. The Uebelacker trial that found no effect of depression had 122 participants.
There have been at least a few more trials – but no recent solid systematic review. So the evidence is still at the “mixed results” stage. A powerful trial could tip this in any direction.
Tips to help here: I’ve written a quick guide to judging the quality of a systematic review – and a series of posts on understanding them (start here). You can find systematic reviews by searching the Epistemonikos database and by adding your specific search terms here in PubMed. You can also try Google Scholar – including to try to find a free full text of a review, if it’s behind a paywall at the journal. (There’s a “pirate” website for this, too, called Sci-Hub.)
Want to keep going? If you want to learn more about critically assessing clinical trials, check out these free sources: