### Part 1: Introduction To Statistical Methodology

Introduction

Since the early 1960’s statistical studies in medicine have moved from being a newly introduced innovation to the most acceptable way to verify medical theories and practices. However, the critical limitations of this methodology are often ignored and this, along with poor handling of appropriate statistical methods, has resulted in many false positive and non-replicable results in clinical research to date.

History and development

Statistical studies have played an increasingly important role within medicine since the 1960’s through the pioneering work of Hill, based on methodology developed by Fisher in the 1930’s. These types of studies underpinned advances in diagnosis and treatment during the 1970’s. Unfortunately, the derailing influence of vested interests soon became increasingly apparent – beginning in the1980’s and extending to the present day.

In the 1990’s the medical profession adopted a system for ranking clinical studies and other sources of medical knowledge, including the development and implementation of methodologies to reduce bias in medical research. This initiative was spearheaded by a group of Canadian epidemiologists, headed by Sacket. Thus was born what is now termed Evidence Based Medicine (EBM) and Evidence Based Practice. Concurrently, since the mid 1990’s the CONSORT group (focused on clinical trials) and the PRISMA group (focused on systematic reviews and meta-analyses of trials) have periodically issued statements in an effort to develop and disseminate international standards for transparent reporting of medical research.

Guidelines for assessing clinical trials have been developed and refined since the 1960’s. Currently accepted models include the Jadad scale, Physiotherapy Evidence Database scale (PEDro), Cochrane Collaboration’s risk of bias tool, etc. (1) These, together with the CONSORT and PRISMA statements mentioned above, have been widely published, with the aim of improving and standardising both the design and reporting of clinical trials. They have been incorporated into university curriculums and are generally accepted within the medical profession, However, at the time of writing there are still many clinical trials and reports of trials that either fall short of these standards or introduce new and unacknowledged sources of error.

How to read an academic paper

When you read an academic paper in which other works are cited, these other works are either research studies, review papers, official clinical guidelines or pages from an authoritative textbook. The purpose behind including citations is twofold. Firstly, to provide a source from which to verify the facts that are being used in the discussion. Secondly, some references may direct readers to an article or a chapter in a book, in which there is a more comprehensive and detailed discussion of issues that have only been mentioned quite briefly. Therefore, an important part of reading an academic paper is to include at least a cursory glance at the references and then, if necessary, to access a particular source, either to check the facts, or to gain a deeper understanding of the issues involved by reading what others have to say on the subject. You should not always accept a fact or a viewpoint simply because a reference has been cited. The following discussion aims to illustrate why this process of additional scrutiny is absolutely necessary.

Peer review failings: abstracts are often unreliable

You may have presumed that the process of scrutinising source material has already been done for us, through peer review. You may even argue that the whole purpose of reading a review paper on a certain topic, which may be based on a hundred or more original studies and related material, is to get an idea of the most recent advances in a particular area without having to go through all of these other papers.

The inconvenient truth is that the process, in which qualified experts read and critically evaluate a paper before publication, is subject to a considerably greater degree of inconsistency and error than we would expect from normal human error. A survey conducted in 1999, found that up to 68% of abstracts in papers from medical journals are either false or misleading, and the situation remained quite poor over the following ten years. (2, 3) With the subsequent publication the CONSORT statements in 2001 and then in 2010, which included guidelines for proper reporting of abstracts, it is hoped that the situation may have improved somewhat. (4) However, this appears to be unlikely. (5)

The ‘abstract’ is a short summary of findings at the beginning of a paper that is freely accessible when you do an online search for scientific papers on a particular subject. While many academic papers are available free of charge, most are not, and journals may charge USD $30-50 for access to a complete paper. This can add up quite quickly when you are searching for information on a particular topic! It is not surprising, therefore that the abstract is the most frequently read portion of scientific papers available on PubMed. (6) The purpose of the abstract is to provide a concise and accurate summary of the paper, highlighting the main content, the purpose of the research, the relevance or importance of the work, as well as the main outcomes together with sufficient supporting data. (7) Unfortunately, quite a large number of them are inaccurate or misleading, in spite of peer review. There are several reasons for this: (2, 3, 5)

- ‘Publication bias’: This leads to pressure on researchers to come up with something positive, as negative trials tend not to be published.
- Vested interests: Sometimes those conducting a study have a vested interest in the drug or treatment protocol appearing to be more effective and safer than it really is. At present most clinical trials on drug treatments are funded by pharmaceutical companies.
- Poor understanding of statistics by researchers and reviewers.
- Sloppy reporting, which may have been ignored because of a reported positive outcome.

Please, take a moment to let this fact sink in…. The peer review process is deeply flawed, and you cannot fully place your trust in the information gained from the abstract of an academic paper. And this is just the beginning. The situation only gets worse when we examine the other parts of an academic paper, specifically papers within the clinical trial literature. (8)

Poor handling of statistical methodology in medical research

In a recent review, which discusses the widespread poor handling of statistical methodology in medical research literature, we find the bold, but comforting, statement: ‘the most important thing clinicians should know about statistics, are not formulas but basic concepts.’ Additionally, the author proposes that the best way for medical researchers to avoid the common statistical pitfalls is to design and analyse their studies in consultation with a qualified statistician. (9) The implication here is that medical researchers, who may have only studied a single unit of medical statistics as undergraduates, should not attempt to apply statistical science in their research but leave it to the experts. In the same way that a physician will refer patients to a surgeon when there is a surgical problem, a specialist statistician should be given charge of this component of a medical study. Otherwise problems can, and frequently do, arise. The following section explores some of the main problem areas that are raised in this review. (9) However, we must first clearly understand the nature and scope of statistical studies: what they can do and what they can’t do. (10, 11)

Statistical studies do not provide proof

Statistical studies are not designed to provide proof; this is the domain of mathematics and logic. A statistical study on a medical treatment provides information about a specific relationship, i.e. the strength of association between an intervention and the outcomes that have been observed. Such studies are only able to demonstrate association, but not causation. Acceptable proof of causation in medicine is obtained through a complex series of steps that include and go beyond the statistical demonstration of a strong association. (12, 13) Moreover, in any statistical study, because of the limited sample size (i.e., the number of subjects within a clinical trial) compared with the total population with a particular condition, the results will always carry some degree of uncertainty. This is the reason why a clinical trial is usually repeated several times, in different locations and with different members of the population of interest before the results are accepted into mainstream practice.

The critical point is this: statistical studies aim to minimise uncertainty and to minimise inaccuracy; however, they do not, and cannot, entirely remove them. Thus, the results of a clinical trial are never entirely true nor, for that matter, entirely false; even though experts in the field may imply the contrary. (8) The best that can be provided by a statistical study is a likely range of values (or measurements) together with a measure of the degree of uncertainty associated with this range. At best, if the studies are carried out properly, we can be reasonably confident that the outcomes we will see in our clinic are likely to be somewhere within this range. Thus, while inaccuracies and uncertainties are always part of a statistical study, the aim is to minimise them. And this is very much a work in progress.

Figure 1. NORMAL DISTRIBUTION – the BELL-SHAPED CURVE

Note that:

µ = the average or mean value, generally referred to simply as ‘the mean’.

σ = the standard deviation (SD)

Approximately 68% of values in the distribution are within one SD of the mean, i.e., they lie between one SD above or one SD below the mean value

Approximately 95% of values in the distribution are within 2 SD of the mean.

Approximately 99% of values in the distribution are within 3 SD of the mean.

Normal distribution

The most fundamental concept in statistical studies of populations is the ‘normal distribution’ curve. If we take a measure of some characteristic (e.g. height) in every member of a particular population (e.g. adult males), and if we plot the resulting measurements on a graph with number of people on the vertical (left hand) axis and height measurement on the horizontal (bottom) axis, the results will conform to the pattern of distribution in Figure 1: the normal distribution. In general, we will find that the measured values (height) will cluster around a ‘mean’ (the average height) and be distributed in such a way that approximately 95% of the population have a height that is within two ‘standard deviations’ (SD) of the mean (i.e. within the range of two SD below the mean and two SD above the mean). The value of the standard deviation is calculated mathematically from the data that have been collected.

In this example our measurements and analysis are highly accurate because we have, theoretically, measured every member within the population and there is no uncertainty because we have measured everyone. Our degree of accuracy is only limited by the accuracy of our measuring instruments.

Critical assumptions

What if we don’t have the resources or the time to measure everyone and are only able to measure a small portion of this population. How accurate would our figures be when applied back to the entire population of interest? Here we have the basic ‘leap of faith’. At this point statistical science rests on the assumption that we can quantify both the range of measurements and the degree of uncertainty, based on data obtained from a small portion of the total population. However, this is only an assumption, which may or may not be true. Imagine a situation where we only measured the heights of the adult males in a nursing home, or alternatively, in an elite basketball team. The average height and the range of normal height would likely be quite different in each situation; and neither would be representative of the entire population. This is an example of what are known as ‘confounding factors’, which have a measurable effect on the results that are being collected.

This is the nature of statistics: given a certain assumption, which generally cannot be proven to be true, the application of a specific statistical test will provide a measure of the strength of the relationship between two or more variables (e.g., the outcome of a particular treatment used on a certain population with a particular disorder).

The other important assumption in clinical trials is that the participants taking the active treatment are evenly matched with those who are taking the placebo. Factors such as age, severity of the illness and psychological state may profoundly influence a patient’s response to an intervention. Sometimes important differences between the two groups may go unrecognised.

Therefore, in a clinical study, no matter how well designed or carefully analysed, there is always a chance, albeit a small one, that the conclusions are not applicable to the population of interest. One or more of our assumptions, conscious or not, may have been erroneous. So, while every effort may have been made to ensure accuracy and reliability in a clinical trial, there always exists the finite possibility that there are hidden errors and that the results will not apply in a real-life clinical setting. Never-the-less, we must also say that a carefully designed clinical trial, analysed correctly and reported transparently will provide valuable and useful information most of the time, and that statistics do have a rightful place in our clinical decision making. Our goal is to discover the best ways to evaluate this kind of information.

Mean and standard deviation

Returning to our example of height measurements in adult males. If you were to examine only one member of this population, his height could be anywhere between the lowest measure and the highest measure, so your ability to predict his height is at its lowest point. In addition, we are unable to make a judgement about whether his height is normal or abnormal. These types of judgement are the basis for some of the essential statistical concepts: our intuitive idea of ‘normal’ in this case means, in statistical terms, that a person’s height is close to the mean (average) height, and ‘abnormal’ signifies that the person is an outlier, outside of the majority. Statistical science quantifies this: ‘normal’ height = the range of height measurements clustered around the mean (i.e., within the range of plus or minus two Standard Deviations), and ‘abnormal’ height would be any measurement that is outside of this 95% range. The more people you examine, the more likely you are to find that most of the heights fall into a range that gets closer and closer to the population mean. That is to say that the range of values, within which 95% of the heights fall, becomes closer and closer to the two SD range on either side of the mean of the entire population. Conversely, the fewer people that are examined, the further away from the normal range (i.e. the less accurate) your results will become. Therefore, we need to find a practical compromise, i.e. a minimum number of subjects that will provide meaningful results.

Confidence intervals: two additional bell-shaped curves

The discussion in the above paragraph illustrates the fact that any study dealing with a limited number of subjects will generate a mean that is likely to be different from the true mean of the whole population of interest – remembering that the mean is the average, around which the actual measured values will be clustered, both above and below. This degree of uncertainty can also be quantified and is provided in clinical studies by the ‘confidence interval’ (CI), which represents the range within which the true mean for entire population is likely to occur. Thus, a more realistic interpretation of results in a clinical study would be visually represented by adding two additional bell-shaped curves, one to the left and one to the right of the one in Figure 1, above. One of our new bell-shaped curves will be centred on the lowest value of the CI and the other on highest value. This idea has important consequences for real-life clinical application of clinical trial results.

Probability and statistical significance

The other important axiom in statistical studies has to do with the application of probability theory. The logic behind this process may be illustrated with a simple example: how to calculate the probability (denoted by the symbol, ‘p’) that a coin tossed five times will come down with the same side up each time. The first toss determines which side you are looking for; and in the four subsequent tosses the probability of getting this particular side (say, heads) is one in two (i.e. 0.5 or 50%) for each toss. To calculate the probability of four more tosses showing heads we multiply these probabilities together: 0.5 x 0.5 x 0.5 x 0.5 (= 0.065). This gives us a probability of 6.5% (p = 0.065).

The precision of the maths belies the fact that we are not dealing with anything concrete here. This calculation seems to imply that if we were to repeat the example 200 times, we would find 13 sequences of five with identical faces (or 6.5 in 100). Unfortunately, this is not correct, and we are unlikely to get exactly 13 sequences of 5 identical throws in this experiment. What this p value really means is that if we repeated the experiment (200 lots of 5 throws) many times, the average number of times we get 5 identical faces will get closer and closer to 13, the more times we repeat the experiment. Additionally, the results will match a normal distribution pattern more and more closely the more times the experiment is repeated. Therefore, the best we can say is that p is likely to be within a particular range, e.g. 5 – 8 (if we calculated that the standard deviation was 1.5 in the above example). Please note that this gives a range of likely values rather than a fixed definite value.

To return to our original example, after the five identical tosses, we might be suspicious that this coin is biased, but as p is not less that 0.05 (the generally accepted cut-off for significance) we may accept that the coin could indeed be normal. However, if we get another head on the sixth throw (p = 0.03125) then we can start to become suspicious – that is, if we are strictly following the standard statistical paradigm. Personally, I would be examining the coin very closely after the 4^{th} head, especially if I had been betting on tails!

This illustrates several important points about the application of statistics in research. In clinical studies the level of statistical significance is generally set at 5% (p < 0.05). We, or rather the statisticians, can calculate the probability that the difference between the results in the two study groups is due to random chance. When this probability, ‘p’, is less than 5% (expressed as p < 0.05) we say that the results are statistically significant. This cut-off point for significance is much like an ‘industry standard’. The widespread acceptance of this standard makes it easier to apply for funding, get your paper published in a journal, and get approval from your peers. But we need to remember that this is not an expression of objective truth, nor a demonstration of proof (and these are common misinterpretations); a study could just as well use 10% or 1% (p < 0.1 and p < 0.01 respectively); it depends on whether or not the benefits or risks are major or trivial. If there is a major risk involved in a treatment (i.e. severe and disabling side effects) we may be willing to accept a greater degree of uncertainty when assessing this risk.

Let us now extend our example of five coin-tosses into a clinical trial. We are suspicious that our original coin is biased in some way so that it favours heads. We must find another coin that is normal in every way and compare the results when we do five tosses in a row, repeated, say, 200 times. The new coin represents the placebo group and the original coin, the treatment group. We want to see if the results we get with the first coin are different from those with the standardised new coin. If at the end of the trial we had close to 13 lots of same face throws with the normal coin, that is to be expected and are confident that this coin is normal. If we also ended up with a similar result for the original coin, then we can also be quite certain that this coin is not biased. Here is where things get interesting. If we get a different result with our original coin, what amount of difference should be deemed significant? Statisticians have methods to calculate the probability that the difference in results between the two groups have occurred by chance only, and this is given as a number between zero and one. As the number gets closer and closer to zero the difference is less and less likely to have occurred by chance. The cut-off point is generally accepted as 0.05, which signifies that the results would only be expected to occur five times in a hundred.

All too often this process is used inappropriately. In a clinical trial where the placebo group has very few positive responders and the treatment group has a majority of positive responders, the results are obvious, and we do not need the statistical analysis to tell us whether or not the treatment works. However, when the placebo group has around 40% of subjects responding (measured as remission or significant improvement), and the active treatment group does marginally better, as in most published trials on antidepressants (14, 15) then the finding of a ‘statistically significant’ difference between the two groups may be misleading. Apart from avoiding the issue of clinical significance (i.e. whether or not we can expect to see real benefits to patients in the clinic), attention is diverted away from issues that need to be examined more closely, such as the validity of the diagnosis and the normal course of the illness.

Application of statistics within clinical trials

Clinical studies deal with the likelihood that a particular health outcome will occur within a given population when a particular therapeutic intervention is applied. There will always be some degree of uncertainty, but by applying statistical analysis we endeavour to minimise this uncertainly to acceptable levels. There are three important variables in this process, each having a critical influence on the others: the size of the effect (that we are hoping to achieve or avoid), the size of our sample (i.e., the number of participants in a trial) and the cut-off point that we choose for deciding whether or not the relationship is significant. Therefore, statistical studies are not objective; preconceived ideas (a.k.a. biases) are incorporated into them in the form of assumptions that are used to set the statistical parameters. These assumptions are derived from decisions that have been made at the critical steps in designing the study, based on the researchers’ values, ideas about an illness and about what level of risk or benefit is deemed to be acceptable or desirable.

Now, what exactly are we testing in a clinical trial? The common misconception is that we are attempting to quantify a treatment’s effects – therapeutic and/or adverse. Unfortunately, this is not what clinical trial results are telling us – even though, according to the principles of evidence-based medicine (EBM), clinicians are entitled to make such an inference, using their knowledge and experience, along with advice from peers and mentors, when considering how to incorporate the results of a trial into their practice. (16, 17) The term, ‘statistical significance’ as used in clinical trials, refers to the process used to establish that the results obtained by the treatment (a.k.a. the ‘active intervention’) were not likely to be due to the effects of random chance,

Essentially, when trial results are characterised by a low p value (i.e., below the 5% or other nominated cut-off point for significance), this is telling us that we have sufficient evidence to reject the ‘null hypothesis’, the proposition that the intervention is doing nothing and that the observed differences between the two study groups are due to random chance. When we are able to reject the null hypothesis, it means that the real effects of the intervention (i.e. the difference between the effects seen in the active treatment group and the placebo groups) do not lie within the same range as those that would occur due to random factors. We are more than 95% sure of it; but, of course, we could also be wrong – and there is less than a 5% chance of that. This is what the ‘evidence’ gained from a clinical trial is telling us: it is highly likely that the active treatment is not doing nothing.

Null Hypothesis Significance Testing (NHST)

Why are clinical trials conducted in this way? Historically, clinical trial methodology was developed from the methods used in epidemiology, where the most practical way to test for a significant factor in a disease outbreak is to first analyse the data to see whether or not the factor under consideration has effects that are not simply due to chance. Obviously, there are an almost unlimited number of factors at play within any given scenario, and therefore it is more likely that an incorrect one could be chosen. Hence the need for an efficient, low-cost method that doesn’t require large resources while you do repeated tests to find something that may actually be having an influence (or rather ‘not having no influence’) on the outbreak, spread and the severity of the disease being studied. Moreover, this methodology is most suitable for assessing scenarios in which there are a number of different factors (a.k.a. ‘variables’) at play, some of which may only be having a small, but significant, effect. Thus, epidemiological methods are designed to detect variables with different degrees of influence, ranging from quite small to quite large.

In this way, an epidemiological study begins with the proposal, or ‘hypothesis’ that a particular factor is having no influence on a disease – the null hypothesis. Then data is collected and analysed in order to accept or reject the null hypothesis. This is where the p value comes in, and this process is referred to as ‘null hypothesis significance testing’ (NHST). When applying this methodology to assess the effects of a single medicinal intervention on the course of a disease, there are many changes that need to be made to the original epidemiological methods. Thus, clinical trial methodology in particular, as well as EBM as a whole, are still very much works in progress – and there are many critical areas where improvement is warranted.

In clinical trials, when the p value is given, it is generally implied (unless stated otherwise) that the standard p < 0.05 is taken to be statistically significant. When p is less than 0.05, this is equivalent to saying that if there were no real difference between the outcomes in the two groups being studied (i.e., the active treatment group and the placebo group), the outcomes obtained in this trial would only occur very rarely in the population of interest. Therefore this particular trial result most likely represents the sort of result you would see in the majority of this population. In practice, there are ways to calculate the minimum number of subjects in a trial so that the results will, in fact, be meaningful.

Thus, when the p value is less than 0.05, and we can reject the null hypothesis, we are confident that there must be some sort of association between the administration of the treatment and the measured outcomes and that this finding applies to the whole of the population of interest. Put in another way, the results allow us to be reasonably sure that the intervention is not doing nothing within the study group and that it will also not do nothing when applied within the population of interest.

Ground zero: the placebo group

In a trial that compares an active treatment with a placebo, when the p value is less than 0.05 the difference between the effects of the two interventions being compared (e.g., between active and placebo, or between two different active treatments) is deemed not to be due to random chance and that the treatment is not doing nothing. The placebo arm of the study provides the reference level that defines what ‘doing nothing’ means in measurable terms. It is important to bear this in mind, as one study comparing herbal and pharmaceutical treatments for depression makes the extraordinary statement, when reporting on the failure of the active treatment to surpass the placebo: ‘These findings were clearly due to the consistently high placebo response rate on all outcome measures.’ (18)

This line of reasoning contains two major flaws. One is that, by definition, the placebo response rate is ground zero: the response rate in the placebo group represents the zero setting that is to be used in order to accurately measure the results (if any) of the active treatment. This means that whatever response is found in the placebo group, the only way it can be legitimately used is to subtract it from the response of the active treatment group, in order to get a measurement of the true response to the active treatment.

The second important error (which the above researchers also commit) is that clinical trial protocol demands that you ignore the response within a particular group, i.e. the difference between the measures taken at the beginning of the trial and those taken at the end of the trial within a particular group. The outcome measures only become meaningful when a comparison is made between the active treatment group and the placebo group at the end of the trial, subtracting the latter results from the former. Otherwise, when both the active treatment group and the placebo group both improve (e.g., in a self-limiting disease, in which patients will tend to get better without any treatment), you can get a false impression of efficacy if you only look at the results in the treatment group.

Confounding factors

A critical assumption in null hypothesis significance testing (NHST), which is virtually impossible to prove, is that we have accounted for all of the possible confounding factors, so that the study groups are equally matched, or ‘controlled’. Important confounding factors include age, gender, severity of illness, duration of illness, previous treatments (and how long since stopping them before entering the trial), current medications, socio-economic factors, education level, patient expectations, attitudes to the illness (e.g. perceived benefits from being ill), and the validity of the diagnosis, i.e. do the subjects all have the same disease? In addition, the potential biases of the researchers are removed through randomization and blinding, so that assignment of participants to the study groups is done by a computer-generated random system, and none of the people who administer the treatments and measure the outcomes know to which group a participant belongs. (19, 20)

As a previous US Secretary of Defense once explained: ‘There are known knowns. There are things we know we know. We also know there are known unknowns. That is to say, we know there are some things we do not know. But there are also unknown unknowns, the ones we don’t know we don’t know.’ And there’s the rub: in spite of careful assessment of potential confounding factors in a clinical trial, in which every effort has been made to ensure that these factors are evenly matched between groups, it is always possible that some other yet to be discovered factors have played a decisive role in the trial outcomes. (19, 20) Another reason to be cautious about accepting the results of a single clinical trial.

Thus, another critical aspect of NHST in clinical trials, and one that is often neglected, is that when the p value is below the nominated cut off point for statistical significance (usually 0.05), it means that the active treatment is not doing nothing or that there are unknown or unacknowledged confounding factors at work, influencing the trial results. A well-reported clinical trial should discuss how potential confounding factors were prevented from influencing the trial outcomes, together with a brief discussion of other possible confounders. Unfortunately, sometimes researchers fail to do this and sometimes there are unknown factors at play.

Another important issue with confounding factors is that clinical trials, randomized and controlled as they aim to be, are a step removed from real life clinical scenarios, in which individual patients, who may, indeed, come to us at random, are in no way ‘controlled’. Each one comes complete with a unique set of confounding factors. This fact represents a considerable barrier to the application of a generalized ‘significant’ result from a clinical trial to an individualized clinical encounter.

### Part 2: Common Sources Of Error In Contemporary Clinical Research

Introduction

The application of statistical methodology in medical studies was first championed by Hill in the early 1960’s. (21, 22) Better known by his full name, Sir Austin Bradford Hill introduced randomised controlled trials into clinical medicine, and also repeatedly warned of their limitations and their potential for misuse: problems that are still current some 50 years down the track. (23, 24) Hill cautioned against the overemphasis of statistical significance as well as neglecting the possibility of undetected errors. Unfortunately, this advice continues to be pertinent today, when there are still too many unnecessary ‘false positive and non-replicable results’ in clinical research to date. (24)

A renowned professor of statistics published a seminal paper in 2005, entitled: ‘Why Most Published Research Findings are False’. (8) Ten years later, statistical errors are still all too common. (23, 24, 5) Mills’ famous statement from 1993 still rings true: ‘If you torture the data long enough, they will tell you what you want to hear’. (25) You do not need to be well trained in the minutiae of statistics to spot the major problem areas, if you know where to look. The following is a summary of the important issues, for assessing the statistical accuracy, or otherwise, in a medical paper. (9, 23, 26)

Lack of homogeneity between study groups

In a clinical trial the effects seen in a group of subjects receiving a treatment are compared against those seen in a similar group who are given a placebo, and often also against a group receiving no treatment at all (i.e. on the waiting list). You should not trust the ‘random allocation’ of subjects, nor p values that appear to confirm that there are no significant differences in confounding factors between the study groups (e.g. age, severity of illness, previous treatments, etc.). The logic behind NHST dictates that this test is completely inappropriate for assessing whether or not there are significant differences between groups in a clinical trial. (5) Therefore, if the study authors provide p values when comparing the characteristics of the two (or more) study groups, this is a meaningless measure – well intentioned or otherwise. Good reporting requires that all relevant characteristics of people within each group should be comparable, and the best way to show this is when the information is provided in the form of a table. You should check that these groups are, in fact, equal and homogenous, and that no important factors are omitted. As previously mentioned, factors that could potentially affect a trial’s outcomes include age, gender, severity of illness, duration of illness, previous treatments (and how long since stopping them before entering the trial), current medications, socio-economic factors, education level, patient expectations, attitudes to the illness (e.g. perceived benefits from being ill), and the validity of the diagnosis, i.e. do the subjects all have the same disease? The known potential confounding factors should be acknowledged, and other possible (i.e. ‘unknown’) confounders briefly explored.

A critical issue when assessing the homogeneity between groups is to look at how widely the pre-trial measurements of the condition being treated (e.g. severity of depression) are spread out. Even when the mean values for each group at the beginning of the trial are exactly the same (or very close), if the spread of values on either side of the mean is markedly different for each group, this can have a profound influence on the trial outcomes. A group with more subjects who are severely depressed and a few who are mildly depressed (i.e. when the measurements are very widely spread around the mean), may show markedly different responses to a treatment than a more homogenous group with scores that cluster very closely around the mean. Therefore, it is important to look at the standard deviations (SD) as well as mean values of the main outcome variable for all groups at the commencement of the trial. Both the mean and the SD values should be similar in each of the study groups.

Inappropriate statistical tests to analyse the data.

While this is not an easy one for a non-statistician to spot, a study should at least report which particular statistical test was used to analyse the data and also provide reasons for the choice. You should be cautious in accepting the conclusions if this is not the case, and particularly when the size of the treatment effect is very small, and the statistical analyses appear to be very complex. A basic rule of thumb: if a treatment really works in clinical practice (i.e. provides clinically significant results), or if one treatment is really better than another, it should be obvious; the statistical analysis may then provide a likely range of values for the results and perhaps also provide q comparison of results between different subgroups (e.g. older vs. younger patients, males vs females).

Placing too much importance on the p value.

Following on from the discussion above, the p value lets us know whether or not we are on the right track, but on its own, doesn’t tell us what we really want to know. It only tells us the probability of obtaining results as (or more) extreme than the observed data if the null hypothesis was true. Thus, it begins with the assumption that there is no significant difference in a particular factor (i.e. the expected outcome of an intervention) between the two groups being compared. When we examine the data at the end of the trial and calculate p, if it is below the accepted level (generally less than 0.05, i.e. below 5%) then we reject the null hypothesis and infer that the therapy is actually producing effects that are different to those occurring in the placebo group.

It is important to be aware that the p value is a function of, i.e. it is directly related to, three different factors: the size of the treatment effect, the sample size and the variability within the sample. Thus, the p value alone does not provide information about any one of these factors, and in particular, it does not give an indication of the size of the treatment effect. As discussed above, the lower the p value, the more confident we can be that the treatment is not doing nothing. We may then infer that it is very likely that our treatment is largely responsible for the observed clinical effects (all else being equal), but we still need some way to quantify these effects. This information is provided by the confidence intervals.

At this stage we need to bear in mind the critical distinction between statistical significance (we strongly suspect that our treatment is having some sort of effect) and clinical significance (the effect is not trivial and will make a real difference to the health and well-being of patients and caregivers). Here again, the p value does not provide the necessary information. The effects of the treatment need to be quantified, so that we can know that they are not trivial. This information should always accompany the p values in a trial report and is given in the form of confidence intervals.

Misinterpreting the confidence interval.

The confidence interval (CI) relating to the effect size should always be given along with the p values in a trial report. We need to remember that the ‘effect size’ that we are talking about is actually the mean (or average) effect size. The individual effect sizes of the participants in the trial are generally clustered around this mean in a normal distribution pattern (see Figure 1). The CI, generally expressed as the ‘95% CI’, is the range of values within which there is a 95% possibility that the true mean of the treatment effect lies, when applied to the whole population of interest.

The 95% CI is generally a fairly narrow range. However, the significance of the 95% CI is often misunderstood, as it is often described as the range in which the ‘true value’, is most likely to be. Unfortunately, this ‘true value’ does not refer to the actual size of the treatment effect that we are most likely to see in this population. The ‘true value’ refers to the mean treatment effect size. This is a critical distinction: the CI does not signify that if we gave this treatment to 20 people, at least 19 of them will have a clinical response that lies within the CI range. The CI only tells you the range, within which it is most likely that the population mean (or ‘true mean’) may occur – with most of the real values (i.e. the actual size of the treatment effect) falling on either side of this mean value.

This concept raises some important issues when applied to clinical practice. Say, for example, that we are reviewing a clinical trial in which the outcomes of a treatment (measured as ‘effect size’) above and including the mean were clinically significant, while outcomes below the mean were measurable but not clinically significant. If we based our clinical expectations on the mean effect size of the treatment group in the trial, we would be confident of achieving clinically significant outcomes in more than 50% of patients. However, when we take into consideration the 95% CI range, in which every value is equally likely, and look at the worst-case scenario (i.e. when the lowest value of the CI represents the true mean in the population), things don’t look quite so good. In this case, we would find considerably more than 50% of patients would fail to gain clinically meaningful results and that considerably less than 50% would have favourable outcomes.

These considerations may have a major influence on whether or not we choose to give this treatment to our patients. This information would have been concealed from us if we only relied on the mean treatment effect that was found in the clinical trial and misinterpreted the CI as the range of outcome measures that would be seen in 19 out of 20 patients in the general population. It may not be easy to make the necessary calculations, as studies in which the 95% CI shows the active treatment in a less favourable light, may not provide you with the relevant data, especially not in the abstract. Another rule of thumb: if the authors of a study do not clearly delineate the response level, above which you have clinically meaningful results, do not provide standard deviation values, or omit the confidence intervals – they are probably trying to conceal something.

This is an area in which researchers and those who report research findings are often able to ‘creatively’ present the data. Obviously, if you choose to report the trial results as if the upper CI was the true mean for the population being studied, it looks much better than results based on the lower CI as the true mean. And we need to remember that all values within the CI range are equally likely. Therefore, in a clinical trial where only the treatment outcomes (effect size) above and including the mean were clinically significant, as in the example two paragraphs above, the best that can be said is that further research is warranted.

Poor handling of dropouts and outliers

Some subjects will inevitably fail to continue up until the end of the trial, for a variety of reasons (e.g. intolerable side effects of the active treatment or impatience for clinical results) – just as some patients that we see in clinic fail to continue with a course of treatment or never come back after the first consultation. Additionally, some subjects in a trial, as in our clinics, do not follow instructions and fail to take their medicine regularly. Subjects like this are referred to as ‘dropouts’ and researchers are often tempted to exclude them from the analysis of the trial results.

Another critical sub-group of subjects are those who experience effects that are considerably outside the usual responses, ranging from no effect at all to dramatic and rapid effectiveness – both in the treatment and placebo arms of the trial. How are these people to be treated in the trial results? Do they represent random ‘freak’ events that crop up from time to time within the general population? Do they belong to the 5% of outliers that we would expect to find in any normally distributed variable? If these subjects are part of that outlying 5%, we can expect that within the entire population there will be an equal balance of these results on either side of the mean, even if within our small-scale trial the results may be skewed one way or the other. Alternatively, extreme treatment effects may occur in considerably more than 5% of the general population for reasons related either to the intervention itself or the person receiving the treatment.

In light of these considerations, some researchers may be inclined to completely exclude the outliers from the final analysis of the trial results. However, in real world clinical scenarios, where practitioners see only a small proportion of the total population, the anomalous outcomes seen in a trial reflect the possible outcomes that may be seen in an individual clinical practice.

A good trial should include all of dropouts in the final analysis, as this reflects real-life, and helps provide an assessment of the overall clinical strategy related to the intervention being studied. This is referred to as intention-to-treat (ITT) analysis. Always check that the numbers of subjects analysed at the end are the same as the numbers enrolled at the beginning of the trial. Generally, dropouts should be counted as ‘treatment failed’. Additionally, large numbers of dropouts can give you some useful information: if they occur in the treatment arm, the treatment may be causing unpleasant side effects and/or is ineffective.

For the same reasons, subjects with effects that are extreme and unusual should also be included in the end of trial analysis, or the researchers should provide valid reasons why they were excluded.

Within group comparisons

This refers to researchers using comparisons between the baseline (at the beginning of the trial) measurements and end of trial measurements within the one group. This is called a ‘within-group paired test’. In spite of the fancy name, it is not valid. Even when this comparison shows a clinically significant improvement, it is completely irrelevant. There could have been other factors, both known and unknown, that caused a similar improvement within the placebo group, thus neutralising the apparent effects of the active treatment. The only valid comparison is between the placebo and the treatment groups, and only this comparison can provide a measure of the actual effect of the treatment. This is an essential principle of clinical trial methodology, which is often ignored when the researchers or sponsors want to hide the true facts and give a positive spin to the trial results.

P value hacking

This the main way in which the data are ‘tortured until they tell you what you want to hear’ – using the same set of data to test out one or more new hypotheses, especially when the one that was tested originally has failed to reach statistical significance. Let us suppose that we have a clinical trial on treatments for depression, in which the average response of the active treatment group is only marginally better than that of the placebo group. However, there are quite a few subjects in the active treatment group with very good clinical outcomes, far exceeding the best ones in the placebo group, but unfortunately there are also a number of subjects in the active group with minimal or no improvement. Hence, the low mean response within the active group. The logical next step would be to look for common characteristics in the subgroup with a good response and compare them with similar patients in the placebo group. This is a subgroup analysis and, strictly speaking, does not form part of the legitimate results of the original trial. In addition, every time you do a different subgroup analysis, the p value becomes further and further diluted, such that the ‘statistical significance’ of successive analyses becomes completely meaningless. The only valid use for this observation Is to develop a new hypothesis and conduct a second trial to test it, e.g. that certain characteristics in patients with depression lead to consistently good outcomes, when using the same treatment protocol as in the original trial.

Therefore, any new hypotheses that are formulated after the trial data have been gathered and analysed should not be given much weigh. The main reasons for this are as follows:

- There are usually too few subjects with these specific characteristics in the treatment and placebo groups for a meaningful comparison
- The placebo and active subjects may not be matched in terms of other important characteristics
- Mathematically, the more hypotheses you try to prove with a single set of data, the more likely you are to have erroneous findings.

Often used as a means to get a research paper published, ‘p value hacking’ is an attempt by researchers to find something that is statistically significant in the face of a non-significant finding in the main trial outcome. The data collected are analysed in different ways, looking at various subsets and (inevitably) finding one or more that provide a statistically significant result, often without any real clinical significance. Sadly, the bogus statistically significant result is then reported as if it were the main finding of the study, possibly appearing in the title or at least in the abstract. The converse may also occur, where a subgroup that was part of the initial trail protocol is conveniently omitted when the results do not suit the interests of the researchers. These practices (or rather malpractices) of post hoc hypothesizing are also known as ‘HARKing’ (Hypothesizing After Results are Known).

Of course, an important part of analysing the data at the end of a trial is too look for patterns, both in terms of the desired effects of a treatment as well as the unwanted effects. If this leads to a new hypothesis being developed (e.g. side effects are more common in subjects who are over 60) that is a good thing. However, this new hypothesis cannot be applied back to the original trial – it can only be used as the basis for future trials.

Mistakenly inferring effect size from the p value.

As noted above, the p value is a function of the sample size; as you increase your sample size, the p value automatically decreases. This means that if the p value is too high to give statistical significance to your test results, you just have to continue the trial, adding more and more subjects until you get to the point where the p value is less than 0.05. In this way you can produce a ‘significant’ result, even when this result is exactly the same as that of the original smaller scale trial – with the same mean value, same spread of outcome measures around the mean and the same difference between the two groups. (27) Therefore, a very low p value should never be interpreted as an indication of a favourable effect size. Moreover, ‘statistical significance’ should not be taken to mean ‘clinical significance’. While different methods of measuring clinical outcomes may show a small difference in favour of the active treatment group, we always need to be sure that the net effect of the active treatment does, in fact, make an appreciable positive difference in the life of patients and carers.

HIDDEN SOURCES OF BIAS AND SCIENTIFIC ERRORS

In addition to mishandling of statistical methodology, there are a number of other common sources of error in clinical trials. Although CONSORT and PRISMA guidelines have been widely promulgated, inadequate or improper reporting of clinical trials along with failures to adhere to best practice in methodology are still common. Additionally, there are several weaknesses within the accepted clinical trial methodology that critically impact the quality of the results. The following list outlines some of the more readily detectable ones. (9, 23)

Use of casual and imprecise language.

The use of casual, imprecise or highly emotive language, especially in the abstract, should be a red flag for a ‘spin alert’. Authors should use precise language and clearly summarise the results of a trial, giving the key numerical findings.

Conclusions drawn from insufficient data.

Authors should provide sufficient data and the right kind of data to justify their conclusions. The actual values of p as well as SD and CI should be given. We should bear in mind that other comparative data, such as the odds ratio and relative risk have the potential to be misleading.

Poor description of methods and results

When you read through the description of the clinical trial, you should ensure that the methods and results are described accurately and in sufficient detail for a clear understanding of how the researchers conducted the trial and why they chose to adopt this methodology. Similarly, the results should be presented in a realistic way and include a discussion of the significance of the results and any possible limitations, cautions and caveats.

Specification of main and secondary outcomes

The main outcomes should be described in sufficient detail to be unambiguous. There should also be a description of secondary outcomes (if any), which should be proposed at the beginning of the trial and included in the trial design, rather than being added in at the end after the results have been collected and analysed.

Description of adverse events

There will always be some adverse events – both in the treatment are as well as the placebo arm of any trial. These should all be accurately described and recorded for comparison. Trials should always report both the benefits and the risks; the results of a trial should always include the frequency of all adverse events. Clinicians need to know the benefits as well as the risks of any treatment, so that these can be weighed against each other.

Publication bias and submission bias

It is well known (but difficult to actually prove with evidence) that studies with negative findings rarely get published; in fact, researchers generally do not even bother to submit such studies for publication, and they are often stopped before completion. These statements have appeared in several peer reviewed papers, but, as far as I have seen, they are never supported by any hard data. This may appear to be useless information that cannot be proven. However, we can make use of it in the following ways. Firstly, when we are only able to find a single study (with a positive result) on a particular treatment that was published several years ago, with no other studies reported since then, this should raise suspicion. Generally, we would expect that there will be other studies, perhaps larger or better designed, conducted subsequently, in the hope of gaining another positive result. If we cannot find any of these, we may be justified in presuming that ‘no news = no good news’, i.e., the treatment has not been found to work. Therefore, we should beware of treatments that are supported by only one study.

Generally, we should expect to find several studies that support a particular treatment, conducted a short time after publication of the initial positive result. Of course, this may not always be the case: if a treatment being tested is a non-pharmaceutical intervention that has the potential to supplant a widely used drug treatment, it may be difficult, if not impossible, for researchers to obtain the necessary funding for larger trials, e.g. lifestyle modification to manage gastric reflux. (28)

Validity of the diagnosis

The elephant in the room, particularly in regard to trials on treatments for depression, is the validity of the diagnosis. Are we looking at a single disease with the same cause in each case? Or, as with so many conditions, are we taking several different disorders with different etiologies and lumping them together because they share one particular symptom, which all too often is only vaguely defined? In the case of ‘depression’, the definition of ‘major depressive disorder’ has become so elastic that it now includes many people who are experiencing sadness due to a loss of some kind, and who tend to get over it within a few months. This would explain the relatively high rates of ‘remission’ or improvement seen in the placebo groups in trials on treatments for depression.

This is an area that may readily be exploited by vested interests. To continue the above example, industry sponsored trials, in which the raw data have been sequestered (i.e. not published along with the trial report), may have a severe mismatch between subjects in the placebo and active treatment groups. It is possible that subjects whose depressed mood is long term may predominate in the placebo group, while those with more recent onset depressive symptoms may predominate in the active treatment group. This type of placement would very likely give the advantage to drug treatment. Moreover, given the current definition of major depression, this arrangement is completely undetectable. Thus, even in the absence of any deliberate ‘stacking’ of the two groups, such a mismatch could also occur by chance.The same kind of thing may happen in crossover trials when patients on active treatment are changed to placebo (inevitably suffering withdrawal symptoms, which are classed as ‘depression relapse’) and the placebo patients who remitted are excluded from this part of the trial. (29, 30, 31)

Measurement of the effect size: accuracy and validity

In many trials positive therapeutic outcomes includes both complete remission as well as significant improvement. These are sometimes bundled up together and reported as a ‘positive result’. There are several issues here that require additional scrutiny. How is ‘clinical remission’ defined? What sort of follow up procedures are in place to provide information about whether or not the remission is maintained for a certain period of time, and whether or not ‘remission’ needs to be maintained by continuing with the therapeutic intervention, possibly indefinitely? Does the trial provide data about numbers of subjects experiencing complete remission as well as numbers of subjects with significant improvement? Is the level of ‘significant improvement’, as defined in the trial, the same as ‘clinical improvement’? How are these measured, and what is the margin for error in these measurements.

Let us take clinical trials on treatments for depression. Many of these trials use the Hamilton Depression Rating Scale (HDRS). However, serious issues regarding its validity have been raised and it has been described by critics as psychometrically and conceptually flawed. (32) Moreover, when it is used, it should be administered and interpreted by a qualified and experienced psychiatrist, otherwise the margin for error is much greater. However, given these limitations, the appropriate definition of clinically significant improvement using the HDRS should be a 50% or more decrease from the baseline score, equivalent to a 7 – 11 points reduction. (33) Unfortunately, the commonly accepted criterion in American and European trials is a reduction by 3 points from the baseline reading; and even then, most trials on SSRI’s fail to achieve this. (15)

Table 1 Checklist for assessing a randomised controlled trial (RCT)

The Abstract | The main content is clearly described Purpose of the research outlined The relevance or importance of the work clearly stated Main outcomes given with sufficient data to support the conclusions |

Methodology | Methods described in sufficient detail. Rationale behind the choice of methodology is given |

Homogeneity between study groups | In terms of age, gender, severity of illness, duration of illness, previous treatments (and how long since stopping them before entering the trial), current medications, socio-economic factors and education level; if P values are given, you should ignore them. How widely are the pre-trial measurements (especially severity of the illness) are spread out – reflected in the SD for each group. Both the mean and the SD values should be similar between the study groups. |

Assessing the unknown confounding factors | Are there any potential confounding factors that have not been acknowledged by the researchers? |

Statistical tests – appropriate or not | Is the reason for applying a particular test clearly given? Is the raw data self-explanatory (i.e., the treatment is obviously more effective than placebo or other treatment) or do the data need to be put through a complex series of statistical tests to reveal the ‘true’ results of the trial? This should raise suspicion. |

Null hypothesis significance testing: P values and confidence intervals | Confidence intervals should always be given along with P values. Is the minimum effect size for clinical significance clearly stated? If a study does not clearly delineate the response level, above which you have clinically meaningful results, does not provide standard deviation values, or omits the confidence intervals – they are probably trying to conceal something. |

Dropouts and outliers | Does the study provide an intention-to-treat (ITT) analysis? Always check that the numbers of subjects analysed at the end are the same as the numbers enrolled at the beginning of the trial. Generally, dropouts should be counted as ‘treatment failed’. Subjects with effects that are extreme and unusual should also be included in the end of trial analysis, or the researchers should provide valid reasons why they were excluded. Large numbers of dropouts in the treatment arm may mean that the treatment causes unpleasant side effects and/or is ineffective. |

Within group comparisons | If the study gives ‘within-group paired test’, this is invalid and generally reflects a bias or vested interest. |

Specification of main and secondary outcomes | Are the main and secondary outcomes specified at the beginning of the trial. Are there any secondary outcomes that have been added after the trial results have been collected? These should not be taken as definite results; they are a new hypothesis that is yet to be verified. |

P value hacking | Post hoc hypothesizing is only useful as a rationale for having future trials involving the subgroup in question, not for generating additional results. |

Mistakenly inferring effect size from the P value. | P value is a function of the sample size; as you increase your sample size, the P value automatically decreases. ‘Statistical significance’ should not be taken to mean ‘clinical significance’. If the trial was deliberately continued with additional subjects so that the results could reach statistical significance, the treatment is most likely ineffective. |

Use of casual and imprecise language | Methods and results should be reported in precise and scientific language. The methods should be described clearly and in sufficient detail to be critically evaluated. |

Conclusions drawn from insufficient data. | Does the data collected during the trial actually support the conclusions given in the report? Sometimes additional unjustifiable conclusions are given along with the correct ones. |

Description of adverse events | Are the adverse events in all of the groups described clearly? |

Publication bias and submission bias | Are there any other studies that confirm the results of a particular trial? If you can’t find any, what are the possible reasons for this? |

Validity of the diagnosis Are there any issues with diagnosis and how the severity is measured? | Is there a possibility that the diagnosis is not valid? (e.g. depression, irritable bowel syndrome) How accurate and how valid is the measurement or grading system for the disease being studied? (e.g., the Hamilton rating scale for depression, especially when not administered by a psychiatrist) |

Epilogue

The opening paragraph of Leo Tolstoy’s novel, Anna Karenina, begins with a bold statement that has spawned several iterations of the ‘Anna Karenina principle’. This glorious generalisation, laying claim to universal truth and placing Murphy’s law into its proper context, speaks to the notion that there are only a few ways to get something right – and a seemingly unlimited number of ways to get it wrong: ‘Happy families are all alike. Every unhappy family is unhappy in its own way.’ On reflection, it appears that the number of ways to ‘get it right’ or ‘achieve the desired outcome’ are strictly limited, while the number of different ways to err is several orders of magnitude greater. The comforting fact being that, as we are living in a finite world, the number of possible mistakes that can be made should also be finite.

This review and summary of the ‘popular’ errors in contemporary medical research is current at the time of writing. Optimistically, the scientific community will correct them where possible or learn to make allowances for them where unavoidable; realistically, however, we should expect to find more new ones cropping up on a regular basis; and hopefully, we will come to the end of our finite number of mistakes sometime in the not-too-distant future.

### References

- Duke University website, Medical Center Library & Archives: Systematic Reviews: the process: Appraisal & Analysis. Retrieved, 11 May 2020 from: https://guides.mclibrary.duke.edu/sysreview/analysis
- Berwanger, O., Ribeiro, R., Finkelsztejn, A., Watanabe, M., Suzumura, E., Duncan, B., Devereaux, P., Cook, D. (2009). The quality of reporting of trial abstracts is suboptimal: survey of major general medical journals. J Clin Epidemiol. 62(4):387-92.
- Pitkin, R., Branagan, M., Burmeister, L. (1999). Accuracy of data in abstracts of published research articles. JAMA. 31;281(12):1110-1.
- Consort 2010 Statement, Extensions – Abstracts. Retrieved 19
^{th}April 2020 from: http://www.consort-statement.org/extensions?ContentWidgetId=562 - Choi, S., Cheung, C. (2016). Don’t judge a book by its cover, don’t judge a study by its abstract. Common statistical errors seen in medical papers. Anaesthesia. 71. 10.1111/anae.13506.
- Islamaj Dogan, R., Murray, G., Névéol, A., Lu, Z. (2009). Understanding PubMed user search behaviour through log analysis. Database (Oxford): 2009, bap018.
- University of Melbourne. (2020). Academic Skills, Writing an Abstract. Retrieved 17
^{th}April 2020 from: http://www.services.unimelb.edu.au/academicskills - Ioannidis JP. (2005). Why most published research findings are false. PLoS Med. 2(8):e124.
- Evans, S. (2010). Common Statistical Concerns in Clinical Trials. J Exp Stroke Transl Med. 3(1)1-7.
- Krousel-Wood, M., Chambers, R., Muntner, P. (2006). Clinicians’ guide to statistics for medical practice and research: part I. The Ochsner J, 6(2), 68–83.
- Krousel-Wood, M., Chambers, R., Muntner, P. (2006). Clinicians’ guide to statistics for medical practice and research: part II. The Ochsner J, 7(1), 3–7.
- Hill, A. B. (1965). “The Environment and Disease. Association of Causation?”, Proceedings of the Royal Society of Medicine, 58, 295–300
- Rothman K, (2002). Epidemiology: An Introduction. Oxford University Press.
- Hengartner M. (2017). Methodological Flaws, Conflicts of Interest, and Scientific Fallacies: Implications for the Evaluation of Antidepressants’ Efficacy and Harm. Front Psychiatry. 2017; 8:275.
- Jakobsen, J., Katakam, K., Schou, A., Hellmuth, S., Stallknecht, S., Leth-Møller, K., Iversen, M., Banke, M., Petersen, I., Klingenberg, S., Krogh, J., Ebert, S., Timm, A., Lindschou, J., Gluud, C. (2017). Selective serotonin reuptake inhibitors versus placebo in patients with major depressive disorder. A systematic review with meta-analysis and Trial Sequential Analysis. BMC psychiatry, 17(1), 58.
- Sackett, D., Rosenberg, W., Gray, J., Haynes, R., Richardson, W. (1996). “Evidence Based Medicine: What It Is and What It Isn’t”, BMJ, 312(7023), 71-72
- Straus, S. & Sackett, D. (1998). “Using Research Findings in Clinical Practice”, BMJ, 317, :339–42
- Rapaport, M., Nierenberg, A., Howland, R., Dording, C., Schettler, P., Mischoulon, D. (2011). The treatment of minor depression with St. John’s Wort or citalopram: failure to show benefit over placebo. J Psychiatr Res, 45(7), 931–941.
- Skelly, A., Dettori, J., Brodt, E. (2012). Assessing bias: the importance of considering confounding. Evid Based Spine Care J. 3(1):9-12.
- Lambert J. Statistics in brief: how to assess bias in clinical studies? (2011). Clin Orthop Relat Res. 2011 Jun;469(6):1794-6.
- Armitage, P. (1991). “Obituary: Sir Austin Bradford Hill, 1897-1991”, Journal of the Royal Statistical Society,154(3), 482–484
- Hill, A. B. (1966). “Reflections on the Controlled Trial”, Ann. Rheum. Dis., 25, 107-113
- George, B., Beasley, T., Brown, A., Dawson, J., Dimova, R., Divers, J., Goldsby, T., Heo, M., Kaiser, K., Keith, S., Kim, M., Li, P., Mehta, T., Oakes, J., Skinner, A., Stuart, E., Allison, D. (2016). Common scientific and statistical errors in obesity research. Obesity (Silver Spring, Md.), 24(4), 781–790.
- Szucs, D., Ioannidis, J. (2017). When Null Hypothesis Significance Testing Is Unsuitable for Research: A Reassessment. Front Hum Neurosci, 11: 390.
- Mills, J. (1993). Data Torturing. NEJM, 329:1196-99
- Gliner, J., Leech, N., Morgan, G. (2002) Problems With Null Hypothesis Significance Testing (NHST): What Do the Textbooks Say?, J Exp Educ, 71:1, 83-92
- Motulsky, H. (2014). Common Misconceptions about Data Analysis and Statistics. J Pharmacol Exp Ther. 351:200-205
- Randhawa, M., Mahfouz, S., Selim, N., Yar, T., Gillessen, A. (2015). An old dietary regimen as a new lifestyle change for Gastro esophageal reflux disease: A pilot study. Pak J Pharm Sci. 2015;28(5):1583-1586.
- Healy, D., Le Noury, J., Wood, J. (2020). Children of the Cure. Missing Data, Lost Lives and Antidepressants. Samizdat Health Writer’s Co-operative Inc. http://www.samizdathealth.org.
- Healy, D., Healy, D. (2012). Pharmageddon. Berkeley: University of California Press.
- Gøtzsche, P. (2013). Deadly Medicines and Organised Crime. How Big Pharma Has Corrupted Healthcare. London, UK: Radcliffe Publishing.
- Bagby, R., Ryder, A., Schuller, D., Marshall, M. (2004). The Hamilton Depression Rating Scale: has the gold standard become a lead weight? Am J Psychiatry.161(12):2163-77.
- Bobo, W., Angleró, G., Jenkins, G., Hall-Flavin, D., Weinshilboum, R., Biernacka, J. (2016). Validation of the 17-item Hamilton Depression Rating Scale definition of response for adults with major depressive disorder using equipercentile linking to Clinical Global Impression scale ratings: analysis of Pharmacogenomic Research Network Antidepressant Medication Pharmacogenomic Study (PGRN-AMPS) data. Hum Psychopharmacol. 31(3):185-192.