P threshold FAQ

Frequently Asked Questions - P-thresholds

1. What is the multiple-comparison problem? What is familywise error correction (FWE)?

To start, Nichols and Hayasaka (PthresholdPapers) provide an excellent introduction to the issue of FWE in neuroimaging in very readable fashion. You're encouraged to check it out.

Many scientific fields have had to confront the problem of assessing statistical significance in the context of multiple tests. With a single statistical test, the standard conventionally dictates a statistic is significant if it is less than 5% likely to occur by chance - a p-threshold of 0.05. But in fields like DNA microassays or neuroimaging, many thousands of tests are done at once. Each voxel in the brain constitutes a separate test, which usually means tens of thousands of tests for a given subject. If the conventional p-threshold of 0.05 is applied on a voxelwise basis, then, just by chance you're almost guaranteed to have many hundreds of false-positive voxels. In order to avoid any false positives, then, researchers generally correct their p-threshold to account for how many tests they're performing. This type of correction prevents Type I error across the whole family of tests you're doing - a familwise error correction, or FWE correction.

The standard approach to FWE correction has been the Bonferroni correction - simply divide the desired p-threshold by the number of tests, and you'll maintain correct control over the FWE rate. In general, the Bonferroni correction is a pretty conservative correction, and it suffers from a fatal flaw with neuroimaging data. The Bonferroni correction demands that all the tests be independent from each other, and that demand is manifestly not fulfilled in neuroimaging data, where there is a complex, substantial and generally unknown structure of spatial correlations in the data. Essentially, the Bonferroni correction assumes there are more spatial 'degrees of freedom' than there really are; one voxel is not independent from the next, and so one only needs to correct for the 'true' number of independent tests you're doing. This effort, though, is tricky, and so a good deal of theory has been developed on ways around Bonferroni-type corrections that still control the FWE at a reasonable level.

2. What is Gaussian random-field theory and how does it apply to FWE?

Worsley et. al (PthresholdPapers) is one of the first papers to link random-field theory with neuroimaging data, and that link has been tremendously productive in the years since. Random-field theory (RFT) corrections attempt to control the FWE rate by assuming that the data follow certain specified patterns of spatial variance - that the distributions of statistics mimic a smoothly varying random field. RFT corrections work by calculating the smoothness of the data in a given statistic image and estimating how unlikely it is that voxels (or clusters or patterns) with particular statistic levels would appear by chance in data of that local smoothness. The big advantages of RFT corrections are that they adapt to the smoothness in the data - with highly correlated data, Bonferroni corrections are far too severe, but RFT corrections are much more liberal. RFT methods are also computationally extremely efficient.

However, RFT corrections make many assumptions about the data which render the methods somewhat less palatable. Chief among these is the assumption that the data must have a minimum level of smoothness in order to fit the theory - at least 2-3 times the voxel size is recommended at minimum, and more is better. For those researchers unwilling to pay the cost in resolution that smoothing imposes, RFT methods are problematic. As well, RFT corrections are only available for statistics whose distributions in a random field have been laboriously calculated and derived - the common statistics fall in this category (F, t, minimum t, etc.), but ad hoc statistics can't be corrected in this manner. Finally, it's become clear (and Nichols and Hayasaka show in PthresholdPapers), that even with the assumptions minimally satisfied, RFT corrections tend to be too conservative.

Random-field theory corrections are available by default in SPM; in SPM99 or earlier, choosing a "corrected" p-threshold means using an RFT correction, while in SPM2, choosing the "FWE" correction to your p-threshold uses these methods. I don't believe corrections of this sort are available in AFNI or BrainVoyager.

3. What is false discovery rate (FDR)? How is it different from other types of multiple-comparison correction?

RFT methods may have their flaws, but some researchers have pointed out a different problem with the whole concept of FWE correction. FWE correction in general controls the error rate for the whole family; it guarantees that there's only a 5% chance (for example) of any false positives appearing in the data. This type of correction simply doesn't fit the intuition of many neuroimaging researchers, because it suggests that every voxel activated is a true active voxel, and most researchers correctly assume there's enough noise in every stage of the process to make a few voxels here and there look active just by chance. Indeed, it's rarely of crucial interest in a particular study whether one particular voxel is necessarily truly or falsely positive - most researchers are willing to accept that some of their signal is actually noise - but that level of inference is precisely what FWE corrections attempt to license.

Benjamini & Hochberg, faced with this conundrum, developed a new idea. Rather than controlling the FWE rate, what if you could control the amount of false-positive data you had? They developed a method to control the false discovery rate, or FDR. Genovese et. al (PthresholdPapers) recently imported this method specifically into neuroimaging. The idea in controlling the FDR is not to guarantee you have no false positives - it's to guarantee you only have a few. Setting the FDR control level to 0.05 will guarantee that no more than 5% of your active voxels are false positives. You don't know which ones they might be, and you don't even know if fully 5% are false positive. But no more than 5% are falsely active.

The big advantage of FDR is that is adapts to the level of signal present in the data. With small signal, the correction is very liberal. With huge signal, it's relatively more severe. This adaptation renders it more sensitive than an RFT correction if there's any signal present in the data. It allows a much more liberal threshold to be set than RFT, at a cost that most researchers have already mentally paid - a few false positive voxels. It requires almost no computational effort, and doesn't require laborious derivations to be used with new statistics.

FDR is not a perfect cure-all - it does require some assumptions about the level of spatial correlation in the data. At the outer bound, allowing any arbitrary correlation structure, it is only slightly more liberal than the equivalent RFT correction. But with looser assumptions, it's a great deal more liberal. Genovese et. al have argued that fMRI data in many situations fits a very loose set of assumptions, enabling a pretty liberal correction.

The latest edition of every major neuroimaging program provides some methods for FDR control - SPM2 and BrainVoyager QX have it built-in, and AFNI's 3dFDR program does the same work. Tom Nichols has predicted FDR methods will essentially replace most FWE correction methods within a few years, and they are beginning to be widely used throughout neuroimaging literature.

4. What is permutation testing? How is it different from other types of multiple-comparison correction?

Permutation testing is a form of non-parametric testing, and Nichols and Holmes give an excellent introduction to the field in their paper (PthresholdPapers), a much better treatment than I can give it here. But here's the extreme nutshell version. Permutation tests are a sensitive way of controlling FWE that make almost no assumptions about the data, and are related to the stats/CS concept of 'bootstrapping.'

The idea is this. You hope your experimental manipulation has had some effect on the data, and to the extent that it has, your design matrix is a model that explains the data pretty well, with large beta weights for the conditions of interest. But what if your design matrix had been different? What if you randomly re-labeled your trials, so that a trial that was actually an A trial in the real experiment was re-labeled as a B, and put into the design matrix as a B, and a B trial was re-labeled and modeled as a C trial, and a C as an A, and so forth. If your experiment had a big effect, the new, randomly mixed-up design matrix won't explain it well at all - if you re-ran your model using that matrix, you'd get much smaller beta weights. Of course, on the null hypothesis, there wasn't any effect at all due to your manipulation, which means the random design matrix should explain it just as well.

And now that you've re-labeled your design matrix and re-run your stats, you mix up the design matrix again, differently and do the same thing. And then do it again. And again, until you've run through all the possible permutations of the design matrix (or at least a lot of them). You'll end up with a distribution of beta weights for that condition from possible design matrices. And now you go back and look at the beta weight from your real experiment. If it's at the extreme end of that distribution you've created - congrats! You've got a significant effect for that condition. The idea in permutation testing is you don't make any assumptions about what the statistic distribution could be - you go out and empirically determine it, from your own real data.

But how does that help you with the multiple-comparison problem? One nice thing about permuation testing is that aren't restricted to testing significance for stats with known distributions, like t or F. We can use these on any ad hoc statistic we like. So let's do it across the design matrices, using as our statistic the maximal T: the value of the maximum T-statistic in the whole image for that design matrix. We come up with a distribution, just like before, and we can find the t-statistic that corresponds to the 5% most extreme parts of the maximal T distribution. And now, the clever bit: we go back to our real experiment's statistical map, and threshold it at that 5% level from the maximal T. Hopefully the t-statistics from our real experiment are generally so much higher than those from the random design matrices as to mean a lot of voxels in our real experiment will have t-statistics above that level - and we don't need to correct their significance at all, because anything in that extreme part of the maximal T distribution is guaranteed to be among the most extreme possible t-statistics for any voxel for any design matrix.

Permuation tests have the big advantages of making almost no (but not totally none - see Nichols and Holmes for details) assumptions about your data, which means they work particularly well with low degrees of freedom, where other methods' assumptions about the shape of their statistic's distribution can be violated. They also are extremely flexible - any true or ad hoc statistic can be tested, such as maximal T, or size of structure, or voxel's favorite color - anything. But they have a big disadvantage: computational cost. Running a permutation test involves re-estimating at least 20 models to be able to guarantee a 0.05 significance level, and so in SPM for individual data, that cost can be prohibitive. For other programs, the situation's not as bad, but it can still be pretty difficult to wait. Permuation tests are available at least in SPM99 with the SnPM toolbox, and in AFNI with the 3dMonteCarlo program. Not sure about BrainVoyager.

5. When should I use different types of multiple-comparison correction?

Nichols and Hayasaka's paper (PthresholdPapers) does an explicit review of various FWE correction methods (as well as FDR) on simulated and real data of a variety of smoothness levels and degrees of freedom, to judge how conservative or liberal different methods were. Their main findings are:

  • Random-field corrections are extremely conservative for all smoothnesses except the highest. This bias becomes stronger as the degrees of freedom go down, such that low-degree-of-freedom, low-smoothness images corrected with RFT methods show the worst underactivation. At the highest smoothness (8-12mm FWHM), they perform reasonably well for all df.
  • Permutation methods are almost exact for all degrees of freedom and for all smoothnesses. They become slightly better with data of high smoothness, but basically perform tremendously well under all conditions.
  • FDR is not strictly speaking intended to control FWE, but it does an excellent job doing so for low-smoothness data at all degrees of freedom. At high smoothnesses (6mm FWHM and greater), the correction becomes too conservative.

Accordingly, the nutshell recommendations are as follows:

  • Random-field methods are good for highly-smoothed data only and are best for single-subject data. For researchers who need a good deal of smoothing to collect significant signal, or who aren't particularly interested in very fine resolution, RFT corrections are quite exact and easily implemented for single subjects. At low degrees of freedom for any smoothness (say, less than 20 df), the RF corrections are generally too conservative for any smoothness.
  • For unsmoothed (or low-smoothed), single-subject data, FDR corrections are the best. They have very high sensitivity while still providing good control of false positives, even with low degrees of freedom. Group data tend naturally to be smoother than single-subject data, due to the blurring imposed by anatomical variability, and so may not be ideal for FDR corrections.
  • Permutation tests are optimized for group data - they perform perfectly at very low degrees of freedom, where other methods' assumptions are invalidated, and they improve slightly with high-smoothness data, although they still do fine with unsmoothed. In group testing, the permutation is whether each subject's t-statistic signs are true or flipped - presumably, if the mean is zero, flipping the sign of the statistic won't make a difference, but if the mean is nonzero, that flipping will matter. As well, the relative speed of estimating group models in most programs helps counter the increased computational cost of permutation testing in general.

6. What is small-volume correction?

All the FWE correction methods here adapt to the number of tests performed. The fewer tests, the less severe the correction, and in neuroimaging, the number of tests performed corresponds to the number of voxels or the volume corrected. So it's to your advantage when doing FWE correction to minimize the volume you're testing. If you have an a priori hypothesis about where you might see activation, like a particular anatomical structure or a particular area found to be active in another study, you might restrict your correction to only that area and be perfectly valid in only performing FWER correction there. In practice, this is often done when a particular activation is above the uncorrected threshold, but you'd like to report corrected statistics. You might also try it when you're using a corrected threshold to start, but not seeing any activation where you might expect some - you could restrict your correction to a smaller volume than the whole brain and suddenly get activation popping up above the new, small-volume-corrected threshold.

SPM has a shortcut to this sort of volume restriction - the small volume correction (or S.V.C.) button in the results interface. It'll let you re-calculate corrected p-statistics for a specified region only - an ROI mask image, or a sphere around a point, etc. This change won't change the uncorrected p-statistics for any activations, but it will make the corrected p-statistic for any activations in that region significantly better, depending on how big your specified region is.

Note that if you're using an uncorrected threshold to start, using S.V.C. won't show you anything new. This correction only re-jiggers the corrected p-statistic for a given region.

7. What do all the different reported values in my SPM table mean (p-corrected, p-uncorrected, cluster, set, etc.)? How are they calculated?

SPM reports a pair of p-statistics for each voxel, a p-statistic for each cluster, and a p-statistic for each set. At the voxel level, these are relatively self-explanatory. The p-uncorrected statistic is the probability that, by itself, a voxel with that t- (or F-)statistic would occur just by chance. This is the statistic that's used to threshold the brain using the uncorrected threshold, or "None" correction in SPM2. The p-corrected statistic is the probability of that same t-statistic, but corrected for FWE using Gaussian RFT methods. This statistic reflects the volume that's being corrected (and hence changes in small-volume-corrected regions).

The cluster and set values are more obscure and less useful - they're explained in detail in the Friston et. al (PthresholdPapers). Briefly, the cluster-level p-statistic is the probability that a cluster of that size would occur just by chance in data of the given smoothness. The key difference is that the activation of a cluster doesn't imply that any particular voxel in the cluster is active - you can't use that statistic to license inference that any one voxel in the cluster is above some threshold. The set-level p-statistic is similar, at the level of the whole brain; it's the probability that a pattern of activation of that size (number of clusters) would occur in data of the given smoothness. But it doesn't mean that any given cluster is active - it only tells you that there's some particular pattern of activation happening, in a regionally unspecific manner. Because both of these statistics are derived from Gaussian RFT theory, they're both, by definition, corrected p-statistics. But because neither of them license inference to any particular voxel, they're not widely used or cited.

8. What should my p-threshold be for some analysis X?

p < 0.05, corrected, remains the gold standard for any neuroimaging analysis. Because RFT corrections are so severe, though (and because other methods aren't widespread enough to challenge them), a de facto standard of p < 0.001 seems to be in operation these days a lot of the time. Depending on the type of analysis, you may be able to go even looser - group-level regressions are sometimes seen more loosely, such as p < 0.005, although there's not a particularly good reason for this.

Using FDR control instead of FWE correction is relatively new, so by default an FDR of 0.05 seems to be the current standard, but Benjamini & Hochberg, among others, have argued that a more liberal threshold in some situations may be reasonable - as high as 0.1 or even a bit higher.

For any type of non-voxel-based analysis, such as correlations of beta weights, etc., p < 0.05 is still the magic number for most reviewers.

9. What should my p-threshold be for conjunction analyses?

A good question. Check out the conjunction papers in ContrastsPapers for more detail, but the basic argument is simple. If a voxel that's active in a conjunction analysis simply has to be active in all of the component analyses, and you're thresholding the conjunction (not any component analyses), then the component analyses should have lower thresholds than the conjunction. Specifically, if you wanted to threshold the conjunction at p < 0.001, and you had two components to the conjunction, then you should threshold each of the components at sqrt(0.001). Any voxel active in both of those at that level will be less likely in the conjunction, so you can threshold each component at a very liberal level and be sure the conjunction's threshold will be quite stringent. In short - the conjunction threshold is the product of the component thresholds.

Many researchers, however, disagree with this line of reasoning. First, obviously, this argument depends on all of the components being independent - if they're dependent at all, then the product of the individual thresholds will be more stringent than the true conjunction threshold. Even if they're all independent, though, it's clear that using this line of argument means that any active voxel in the conjunction is a voxel that may well not be active at a "reasonable" threshold in any of the components. This problem is exacerbated with more than two components - with three, say, each component could be thresholded at p < 0.1 uncorrected, and the conjunction could have a threshold of p < 0.001. This flies in the face of what many people try to argue about their conjunctions, which is that they represent areas that are activated in all of their components. So many researchers use the strategy of simply thresholding their individual components at some liberal but reasonable threshold - p < 0.001, or p < 0.005 - and then simply assess the intersection of the active areas as the conjunction. This clearly results in extremely significant p-statistics in the conjunction, but it at least gets closer to the idea of "conjunction" that most researchers seem to have.

10. What should my p-threshold be for masked analyses?

If you're masking your one analysis with the results of an another analysis, you're basically doing a conjunction (see above), so you can liberalize your threshold at least a bit. If you're masking your analysis with a region of interest mask, anatomical or otherwise, you might also consider using a small volume correction and using p < 0.05 corrected as a threshold. If you're doing some other crazy kind of mask... well, you're kind of in uncharted waters. Start with something reasonable and go from there, and good luck to you.