Why, When and How to Adjust Your P Values? (2024)

Abstract

Currently, numerous papers are published reporting analysis of biological data at different omics levels by making statisticalinferences. Of note, many studies, as those published in this Journal, report association of gene(s) at the genomic andtranscriptomic levels by undertaking appropriate statistical tests. For instance, genotype, allele or haplotype frequencies atthe genomic level or normalized expression levels at the transcriptomic level are compared between the case and controlgroups using the Chi-square/Fisher’s exact test or independent (i.e. two-sampled) t-test respectively, with this culminating intoa single numeric, namely the P value (or the degree of the false positive rate), which is used to make or break the outcome ofthe association test. This approach has flaws but nevertheless remains a standard and convenient approach in associationstudies. However, what becomes a critical issue is that the same cut-off is used when ‘multiple’ tests are undertaken on thesame case-control (or any pairwise) comparison. Here, in brevity, we present what the P value represents, and why and whenit should be adjusted. We also show, with worked examples, how to adjust P values for multiple testing in the R environmentfor statistical computing (http://www.R-project.org).

Keywords: Bias, Gene Expression Profiling, Genetic Variation, Research Design, Statistical Data AnalysesCell Journal(Yakhteh), Vol 20, No 4, Jan-Mar (Winter) 2019, Pages: 604-607

Biological data is currently being generated ona massive scale, which has resulted not only in anavalanche of raw data, but has also led to the testingof multiple hypotheses. To test these hypotheses,inferential statistics is applied to relevant sampledatasets, leading to further biological insights andpossible discoveries. Essentially, hypothesis testing isa statistical method which computes the probability ofthe strength of evidence based on the sampled data foror against the null (i.e. no difference or no change)hypothesis, which is culminated in a single numeric,namely the P value. Here, we discuss P values, butmore importantly, with a focus on association studies,discuss why, when and how they should be adjusted.We hope that this short guide results in more accuratereporting of P values and the respective inferences.

What is a P value?

When you want to statistically infer whether a resultis significant, you quantify the probability of thatresult occurring by pure random chance given the nullhypothesis. A historical and intuitive cut-off to rejectthe null hypothesis (thus a meaningful non-randomevent) is 0.05 (1). Accordingly, if the probability oftesting the null hypothesis of equality of the mean ofnormalized expression levels of gene X in the caseand control groups (µ₁, µ₂) is <0.05, one would say(absolutely arbitrarily) that it is their eureka momentby shrugging off (reject) the null hypothesis (µ₁=µ₂),and embracing (accept) the alternative hypothesis(µ₁.µ₂). However, what we are actually quantifying isthe probability of observing data as or more extremethan what we have observed given the null hypothesisis true (2-4). Meanwhile, it should be noted that instatistical hypothesis testing, we should not onlyreport the P value, but to also include power of test,confidence intervals and effect size (5-8).

P value issues

There is a matter of considerable controversysurrounding the position of Pvalue in scientific inferenceand this has become even more heightened by theemergence of big data analysis, which mainly revolvesaround its misunderstanding and misuse (9, 10). Thefirst flaw is that the 0.05 cut-off is completely arbitraryand merely a convention. This, therefore, indicates thatthis value is not necessarily appropriate for all variablesand for all research settings. For instance, in diseaseassociation studies, a more stringent cut-off of 0.01 isrecommended to be applied. Moreover, two commonbiases further affect the integrity of research findings,namely selective reporting and P-hacking (7). In brief,selective reporting addresses the bias of substantiallyunder-reported negative results (i.e. non-significant Pvalues). This bias is apparent in the skewed distributionof reported results toward positive findings (11). Incontrast, P-hacking describes the biased selection ofdata to signify non-significant results when this isdesirable. Although this is technically true, it is a farmore unrepresented form of direct data manipulation(12).

The multiple testing issue

Assuming that all the flaws mentioned are addressed,the last but the most important issue that remains in Pvalue quantification is when multiple testing occurs,but what constitutes multiplicity? Imagine a scenariowhere the expression of twenty genes at the transcriptlevel have been compared between a fixed set of casesand controls or, at the genomic level, genotype/allelefrequencies of twenty single nucleotide polymorphisms(SNPs) have been compared. By pure chance,assuming independence of tests, one would expect,on average, one in twenty of transcripts or SNPs toappear significant at the 5% level. This is because the‘probability’ of a false positive in this scenario is nowinflated and clearly requires adjusting the originalsingle test significance level of 0.05. In other words,the probability of observing a false positive (i.e. typeI error) generated by all tests undertaken should notexceed the 5% level (2). This issue has become evermore apparent after the emergence of omics science,in which large number of independent variables aretested simultaneously and computing the fraction oftrue positives is crucial (5). As a simple calculation,suppose the probability of a type I error in a single testis α_single=5×10^-2. The probability of not observing a typeI error in a single test is then p_single=1-α=1-5×10-2=0.95.Accordingly, the probability of not observing a typeI error in multiple (e.g. 20) tests is p_multiple=(1-α)m=(1-5×10-2)20≈3.6e-01 and thus α_multiple=1-(1-α)m ≈ 0.64,therefore showing the substantial increase in typeI error after multiple testing. If the number of testsincreases dramatically, the inflated type I error rate(α_multiple) would reach 1. For instance, α_multiple= 0.9941if α=0.05 and m=100.

So how one ought to correct this inflation of thefalse positive rate? The first solution is to control typeI error by minimising the significance threshold (i.e.calculating α’). Say the probability of a type I error ina single test is the standard α_single=α´. The probabilityof not observing a type I error in a single test is thenp_single=1-α´. For independent tests, this probabilitywould be p_multiple=(1-α´)m. Next, the probability oftype I error for multiple tests is α_multiple=1-(1-α´)m.Rearrangment of the equation leads to the approximatedBonferroni correction for multiple testing α´≈ α/m.Following the same scenario, the α´ for each of thetwenty tests would be 0.05/20=2.5×10-3. By applyingthe same rule, when 1,000,000 SNPs are tested in agenome-wide association study (GWAS) αˊ would be5×10-8 and when expression dysregulation is examinedfor 20,000 genes on a whole-transcriptome microarray,αˊ would be 2.5×10^-6.

How to adjust P values?

Here we provide worked examples for the twomost commonly used methods without in-depthmathematical detail and formulae. This approach isanalytically more convenient compared with the firstmethod, in which, after setting an adjusted threshold,raw P values have to be checked against a' one at atime. The function used here is p.adjust from the statspackage in R. Imagine you have tested the level ofgene dysregulation between two groups (e.g. casesand controls) for ten genes at the transcript leveland below is the vector of raw P values obtainedby implementing the independent t test (assumingnormality of expression data).

P_value <- c(0.0001, 0.001, 0.006, 0.03, 0.095, 0.117,0.234, 0.552, 0.751, 0.985).

Bonferroni

The simplest way to adjust your P values is to usethe conservative Bonferroni correction method whichmultiplies the raw P values by the number of tests m(i.e. length of the vector P_values). Using the p.adjustfunction and the ‘method’ argument set to "bonferroni",we get a vector of same length but with adjusted Pvalues. This adjustment approach corrects accordingto the family-wise error rate of at least one falsepositive (FamilywiseErrorRate (FWER)=Probability(FalsePositive≥1)).

p.adjust (P_values, method="bonferroni")## [1] 0.001 0.010 0.060 0.300 0.950 1.000 1.000 1.0001.000 1.000

The results show that only two out of ten genes remainsignificantly dysregulated. Had we not undertaken thismultiple testing correction, we would have reportedsignificant dysregulation for another two genes. Thiscorrection method is the most conservative of all anddue to its strict filtering, potentially increases the falsenegative rate (5) which simply means rejecting truepositives among false positives.

Benjamini and Hochberg

A philosophically different and more powerfuladjustment method is that proposed by Benjamini andHochberg (13). This method, rather than controllingthe false positive rate (a.k.a FWER) as in theBonferroni method, controls the false discovery rate(FalseDiscoveryRate (FDR)=Expected (FalsePositive/(FalsePositive+TruePositive))). In other words, FDRis the expected proportion of false positives amongall positives which rejected the null hypothesisand not among all the tests undertaken. In the FDRmethod, P values are ranked in an ascending arrayand multiplied by m/k where k is the position of aP value in the sorted vector and m is the number ofindependent tests.

p.adjust (P_values, method="fdr")

## [1] 0.001 0.005 0.02 0.075 0.19 0.195

## [7] 0.334 0.690 0.834 0.985

A quick comparison of the results show that FDRidentifies one more dysregulated gene compared with theBonferroni method. This third gene (corrected P=0.02)is what would be called a false negative as it shows nosignificance when the conservative Bonferroni method isused but remains significant under FDR.

To better compare these two multiple testing correctionmethods, a large array of random P values (n=500) wereadjusted (Fig .1). The frequency distributions show thatthe Bonferroni method dramatically reduces the numberof significant P values and substantially increases large(close or equal to 1) P values. However, the FDR methodretains more significant P values while increasing non-significant P values with a peak at around P=0.8. Thisis consistent with a higher correlation between rawand FDR-adjusted P values than any other pairwisecombination. Although a number of different multipletesting correction methods exists (for instance seep.adjust documentation in R or permutation-basedcorrection methods), the most preferable approach iscontrolling FDR as it not only reduces false positives,but also minimises false negatives.

Open in a separate window

Fig.1

Comparison of the two multiple testing adjustment methods in a matrix plot. The distribution of 500 random P values before and after adjustmentis represented on the diagonal. The upper and lower triangles show the pairwise correlation coefficients and scatter plot between raw and adjusted Pvalues respectively.

The take home message is that it does not matterwhether you are interested in identifying a significantassociation with SNPs, differentially expressedgenes (DEG) or enriched gene ontology (GO) terms,the moment you conduct multiple tests on the samesamples or gene sets respectively, it would be essentialto address the multiple testing issue by adjusting theoverall false positive rate through calculating a´ oradjusting your raw P values (as shown here based onBonferroni or FDR) for true positives to be teasedout. This will in no doubt enhance reliability andreproducibility of research findings.

Acknowledgments

The authors have no financial support to disclose withrespect to this manuscript. The authors declare no conflictof interest.

Author’s Contributions

N.A-P.; Conceived and planned the overall structureof the paper. M.J.; Carried out computational analysis.Both authors discussed the main conceptual ideas to bepresented, contributed to the writing of the manuscript andapproved the final draft. All authors read and approvedthe final manuscript.

References

1. Fisher RA. Tests of significance in harmonic analysis. Proc R Soc A Math Phys Eng Sci. 1929;125(796):594–599. [Google Scholar]

2. Balding DJ. A tutorial on statistical methods for population association studies. Nat Rev Genet. 2006;7(10):781–791. [PubMed] [Google Scholar]

3. Cohen J. The earth is round (p<.05): rejoinder. Am Psychol. 1995;50(12):1103–1103. [Google Scholar]

4. Sterne JA, Davey Smith G. Sifting the evidence-what’s wrong with significance tests? BMJ. 2001;322(7280):226–231. [PMC free article] [PubMed] [Google Scholar]

5. Lee JK. Statistical bioinformatics. 1st ed. New Jersey: John Wiley & Sons Inc; 2010. [Google Scholar]

6. Ranstam J. Why the P-value culture is bad and confidence intervals a better alternative. Osteoarthritis Cartilage. 2012;20(8):805–808. [PubMed] [Google Scholar]

7. Vidgen B, Yasseri T. P-values: misunderstood and misused.Front Phys. Front Phys; 2016. [Google Scholar]

8. Lee DK. Alternatives to P value: confidence interval and effect size. Korean J Anesthesiol. 2016;69(6):555–562. [PMC free article] [PubMed] [Google Scholar]

9. Colquhoun D. An investigation of the false discovery rate and the misinterpretation of P values. R Soc Open Sci. 2014;1(3):140216–140216. [PMC free article] [PubMed] [Google Scholar]

10. Wasserstein RL, Lazar NA. The ASA’s statement on p-values: context, process, and purpose. Am Stat. 2016;70(2):129–133. [Google Scholar]

11. Franco A, Malhotra N, Simonovits G. Social science.Publication bias in the social sciences: unlocking the file drawer. Science. 2014;345(6203):1502–1505. [PubMed] [Google Scholar]

12. Jager LR, Leek JT. An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics. 2014;15(1):1–12. [PubMed] [Google Scholar]

13. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B. 1995;57(1):289–300. [Google Scholar]