Power analysis and effect size in mixed effects models: A tutorial

We’ve published the outcome of 4 years of study and computer simulations on the power of designs that include more than one observation per condition per participant. Indeed, a problem about the current studies on the replication crisis is that power is always calculated on the assumption that each participant only provides one observation per condition. This is not what happens in experimental psychology, where participants respond to multiple stimuli per condition and where the data are averaged per condition or (preferentially) are analyzed with mixed effects models.

Main findings

In a nutshell, these are our findings:

  1. In experimental psychology we can do replicable research with 20 participants or less if we have multiple observations per participant per condition, because we can turn rather small differences between conditions into effect sizes of d > .8 by averaging across observations (as indeed known to psychophysicists for almost a century). This is the positive outcome of the analyses.

  2. The more sobering finding is that the required number of observations is higher than the numbers currently used (which is why we run underpowered studies). The ballpark figure we propose for RT experiments with repeated measures is 1600 observations per condition (e.g., 40 participants and 40 stimuli per condition).

  3. The 1600 observations we propose is when you start a new line of research and don’t know what to expect. The article gives you the tools to optimize your design once you’ve run the first study.

  4. Standardized effect sizes in analyses over participants (e.g., Cohen’s d) depend on the number of stimuli that were presented. Hence, you must include the same number of observations per condition if you want to replicate the results. The fact that the effect size depends on the number of stimuli also has implications for meta-analyses.

If you use the article please refer to it as follows:

  • Brysbaert, M. and Stevens, M. (2018). Power Analysis and Effect Size in Mixed Effects Models: A Tutorial. Journal of Cognition, 1: 9, 1–20, DOI: https://doi.org/10.5334/joc.10.

Power for other models

Because we got many questions on power after writing the ms (and people rarely appreciated the answers we gave), we decided to write a prequel dealing with power requirements for simple designs. You find the text here (Brysbaert, 2019).

Missed studies in the article

After the publication of the article, it has become clear that other researchers already noticed the relationship between number of stimuli and standardized effect size. Usually this was framed in a negative way (i.e., the effect sizes are overestimated when based on the average of multiple observations), without paying attention to the more positive side for power. Here are some pointers:

  • Brand et al. (2010) already noticed the relationship between number of stimuli per condition and standardized effect sizes. They additionally point to the importance of the correlation between the observations: The higher the correlation, the less multiple observations will increase the standardized effect size (and arguably the less they will help to make the study more powerful).

  • Richard Morey (2016) also noticed that the standardized effect sizes in F1 analyses depend on the number of observations per condition. Maybe the effect size proposed by Westfall et al. is the preferred measure for future use? Alternatively, in reaction time experiments nothing may be more informative than the raw effect in milliseconds.

  • There was an interesting observation by Jeff Rouder pointing to the increased power of experiments with multiple observations. His rule of thumb (if you run within-subject designs in cognition and perception, you can often get high powered experiments with 20 to 30 people so long as they run about 100 trials per condition) agrees quite well with the norm we put forward (a properly powered reaction time experiment with repeated measures has at least 1,600 word observations per condition). With 2000-3000 observations per condition you have high powered experiment, with 1600 you have a properly powered experiment. Within limits (say a lower limit of 20), in most experiments the numbers of trials and participants can be exchanged, depending on how difficult it is to create items or to find participants.

More recent publications of interest

Kolossa & Kopp (2018) report that for model testing in cognitive neuroscience it is more important to obtain extra data per participant than testing more participants.

Rouder & Haaf (2018) published an article that nicely complements ours. They make a theoretical analysis of when extra trials improve power. The basic message is that extra participants are always better than extra trials. However, the degree to which this is the case depends on the phenomenon you are investigating. If there is great interindividual variation in the effect and if the variation is theoretically expected, you need many participants rather than many trials (of course). This is true for many experiments in social psychology. In contrast, when the effect is expected to be present in each participant and when trial variability is larger than the variability across participants, you can trade people for trials. These conditions were met for the priming studies we discussed. No participant was expected to show a negative orthographic priming effect (faster lexical decision times after unrelated primes than after related primes), and the variability in the priming effect across participants (and stimuli) was much smaller than the residual error. These conditions are true for many robust effects investigated in cognitive psychology, in particular for those investigated with reaction times. Indeed, many studies in cognitive psychology address the borderline conditions of well-established effects (to make a distinction between alternative explanations).

Another article warning against being too cheap on the number of trials per condition was published by Boudewyn et al. (2018). If you look at their small effect sizes (remember these are the ones we are after most of the time!), the recommendation of 40 participants 40 trials seems to hold for EEG research as well.

Nee (2019) nicely describes how extra runs improve the replicability of fMRI data, even with rather small sample sizes (n = 16). This is the good old psychophysics approach.

Inconsistencies in underpowered fMRI studies are nicely described by Munzon & Hernandez (2019), who started from a large sample (like we did) and looked at what would have been found in smaller samples. Well worth a read! Another article worth reading is Ramus et al. (2018), who document the many inconsistencies in fMRI research on dyslexia and convincingly relate this to the problem of underpowered studies.

Our article does not deal with interactions. A nice blog by Roger Giner-Sorolla (based on work by Uri Simonsohn) indicates that for an extra variable with 2 levels, it is advised to multiply the number of observations by at least 4 if you want to draw meaningful conclusions about the interaction (see also Brysbaert, 2019). So, beware of including multiple variables in your study. Is the interaction really needed to test your hypothesis?

Power of interactions also features in a review paper on power issues by Perugini et al. (2018).

Goulet & Cousineau (2019) discuss how you can use the reliability of your dependent variable to determine the best ratio of number of trials vs. number of participants (a message also in Brysbaert, 2019).


Comments are closed.