Statistics: Sex difference analyses under scrutiny

A survey reveals that many researchers do not use appropriate statistical analyses to evaluate sex differences in biomedical research.
  1. Colby J Vorland  Is a corresponding author
  1. Department of Applied Health Science, Indiana University School of Public Health, United States

Scientific research requires the use of appropriate methods and statistical analyses, otherwise results and interpretations can be flawed. How research outcomes differ by sex, for example, has historically been understudied, and only recently have policies been implemented to require such consideration in the design of a study (e.g., NIH, 2015).

Over two decades ago, the renowned biomedical statistician Doug Altman labeled methodological weaknesses a “scandal”, raising awareness of shortcomings related to the representativeness of research as well as inappropriate research designs and statistical analysis (Altman, 1994). These methodological weaknesses extend to research on sex differences: simply adding female cells, animals, or participants to experiments does not guarantee an improved understanding of this field of research. Rather, the experiments must also be correctly designed and analyzed appropriately to examine such differences. While guidance exists for proper analysis of sex differences, the frequency of errors in published research articles related to this topic has not been well understood (e.g., Beltz et al., 2019).

Now, in eLife, Yesenia Garcia-Sifuentes and Donna Maney of Emory University fill this gap by surveying the literature to examine whether the statistical analyses used in different research articles are appropriate to support conclusions of sex differences (Garcia-Sifuentes and Maney, 2021). Drawing from a previous study that surveyed articles studying mammals from nine biological disciplines, Garcia-Sifuentes and Maney sampled 147 articles that included both males and females and performed an analysis by sex (Woitowich et al., 2020).

Over half of the articles surveyed (83, or 56%) reported a sex difference. Garcia-Sifuentes and Maney examined the statistical methods used to analyze sex differences and found that over a quarter (24 out of 83) of these articles did not perform or report a statistical analysis supporting the claim of a sex difference. A factorial design with sex as a factor is an appropriate way to examine sex differences in response to treatment, by giving each sex each treatment option (such as a treatment or control diet; see Figure 1A). A slight majority of all articles (92, or 63%) used a factorial design. Within the articles using a factorial design, however, less than one third (27) applied and reported a method appropriate to test for sex differences (e.g., testing for an interaction between sex and the exposure, such as different diets; Figure 1B). Similarly, within articles that used a factorial design and concluded a sex-specific effect, less than one third (16 out of 53) used an appropriate analysis.

Considering sex differences in experimental design.

(A) A so-called factorial design permits testing of sex differences. For example, both female (yellow boxes) and male mice (blue boxes) are fed either a treatment diet (green pellets) or control diet (orange pellets). Garcia-Sifuentes and Maney found that 63 % of articles employed a factorial design in at least one experiment with sex as a factor. (B) An appropriate way to statistically test for sex differences is with a two-way analysis of variance (ANOVA). If a statistically significant interaction is observed between sex and treatment, as shown in the figure, evidence for a sex difference is supported. Garcia-Sifuentes and Maney found that in studies using a factorial design, less than one third tested for an interaction between sex and treatment. (C) Performing a statistical test between the treatment and control groups within each sex, and comparing the nominal statistical significance, is not a valid method to look for sex differences. Yet, this method was used in nearly half of articles that used a factorial design and concluded a sex-specific effect.

Notably, nearly half of the articles (24 out of 53) that concluded a sex-specific effect statistically tested the effect of treatment within each sex and compared the resulting statistical significance. In other words, when one sex had a statistically significant change and the other did not, the authors of the original studies concluded that a sex difference existed. This approach, which is sometimes called ‘differences in nominal significance’, or ‘DINS’ error (George et al., 2016), is invalid and has been found to occur for decades among several disciplines, including neuroscience (Nieuwenhuis et al., 2011), obesity and nutrition (Bland and Altman, 2015; George et al., 2016; Vorland et al., 2021), and more general areas (Gelman and Stern, 2006; Makin, 2019; Matthews and Altman, 1996; Sainani, 2010; Figure 1C).

This approach is invalid because testing within each sex separately inflates the probability of falsely concluding that a sex-specific effect is present compared to testing between them directly. Other inappropriate analyses that were identified in the survey included testing sex within treatment and ignoring control animals; not reporting results after claiming to do an appropriate analysis; or claiming an effect when the appropriate analysis was not statistically significant despite subscribing to ‘null hypothesis significance’ testing. Finally, when articles pooled the data of males and females together in their analysis, about half of them did not first test for a sex difference, potentially masking important differences.

The results of Garcia-Sifuentes and Maney highlight the need for thoughtful planning of study design, analysis, and communication to maximize our understanding and use of biological sex differences in practice. Although the survey does not quantify what proportion of this research comes to incorrect conclusions from using inappropriate statistical methods, which would require estimation procedures or reanalyzing the data, many of these studies’ conclusions may change if they were analyzed correctly. Misleading results divert our attention and resources, contributing to the larger problem of ‘waste’ in biomedical research, that is, the avoidable costs of research that does not contribute to our understanding of what is true because it is flawed, methodologically weak, or not clearly communicated (Glasziou and Chalmers, 2018).

What can the scientific enterprise do about this problem? The survey suggests that there may be a large variability in discipline-specific practices in the design, reporting, and analysis strategies to examine sex differences. Although larger surveys are needed to assess these more comprehensively, they may imply that education and support efforts could be targeted where they are most needed. Compelling scientists to publicly share their data can facilitate reanalysis when statistical errors are discovered – though the burden on researchers performing the reanalysis is not trivial. Partnering with statisticians in the design, analysis, and interpretation of research is perhaps the most effective means of prevention.

Scientific research often does not reflect the diversity of those who benefit from it. Even when it does, using methods that are inappropriate fails to support the progress toward equity. Surely this is nothing less than a scandal.


Article and author information

Author details

  1. Colby J Vorland

    Colby J Vorland is at the Department of Applied Health Science, Indiana University School of Public Health, Bloomington, United States

    For correspondence
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4225-372X

Publication history

  1. Version of Record published: November 2, 2021 (version 1)


© 2021, Vorland

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.


  • 2,971
    Page views
  • 217
  • 6

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Colby J Vorland
Statistics: Sex difference analyses under scrutiny
eLife 10:e74135.

Further reading

    1. Cell Biology
    2. Medicine
    Thao DV Le, Dianxin Liu ... Julio E Ayala
    Research Article Updated

    The canonical target of the glucagon-like peptide-1 receptor (GLP-1R), Protein Kinase A (PKA), has been shown to stimulate mechanistic Target of Rapamycin Complex 1 (mTORC1) by phosphorylating the mTOR-regulating protein Raptor at Ser791 following β-adrenergic stimulation. The objective of these studies is to test whether GLP-1R agonists similarly stimulate mTORC1 via PKA phosphorylation of Raptor at Ser791 and whether this contributes to the weight loss effect of the therapeutic GLP-1R agonist liraglutide. We measured phosphorylation of the mTORC1 signaling target ribosomal protein S6 in Chinese Hamster Ovary cells expressing GLP-1R (CHO-Glp1r) treated with liraglutide in combination with PKA inhibitors. We also assessed liraglutide-mediated phosphorylation of the PKA substrate RRXS*/T* motif in CHO-Glp1r cells expressing Myc-tagged wild-type (WT) Raptor or a PKA-resistant (Ser791Ala) Raptor mutant. Finally, we measured the body weight response to liraglutide in WT mice and mice with a targeted knock-in of PKA-resistant Ser791Ala Raptor. Liraglutide increased phosphorylation of S6 and the PKA motif in WT Raptor in a PKA-dependent manner but failed to stimulate phosphorylation of the PKA motif in Ser791Ala Raptor in CHO-Glp1r cells. Lean Ser791Ala Raptor knock-in mice were resistant to liraglutide-induced weight loss but not setmelanotide-induced (melanocortin-4 receptor-dependent) weight loss. Diet-induced obese Ser791Ala Raptor knock-in mice were not resistant to liraglutide-induced weight loss; however, there was weight-dependent variation such that there was a tendency for obese Ser791Ala Raptor knock-in mice of lower relative body weight to be resistant to liraglutide-induced weight loss compared to weight-matched controls. Together, these findings suggest that PKA-mediated phosphorylation of Raptor at Ser791 contributes to liraglutide-induced weight loss.

    1. Epidemiology and Global Health
    2. Medicine
    Jeffrey Thompson, Yidi Wang ... Ulrich H von Andrian
    Research Article Updated


    Although there are several efficacious vaccines against COVID-19, vaccination rates in many regions around the world remain insufficient to prevent continued high disease burden and emergence of viral variants. Repurposing of existing therapeutics that prevent or mitigate severe COVID-19 could help to address these challenges. The objective of this study was to determine whether prior use of bisphosphonates is associated with reduced incidence and/or severity of COVID-19.


    A retrospective cohort study utilizing payer-complete health insurance claims data from 8,239,790 patients with continuous medical and prescription insurance January 1, 2019 to June 30, 2020 was performed. The primary exposure of interest was use of any bisphosphonate from January 1, 2019 to February 29, 2020. Bisphosphonate users were identified as patients having at least one bisphosphonate claim during this period, who were then 1:1 propensity score-matched to bisphosphonate non-users by age, gender, insurance type, primary-care-provider visit in 2019, and comorbidity burden. Main outcomes of interest included: (a) any testing for SARS-CoV-2 infection; (b) COVID-19 diagnosis; and (c) hospitalization with a COVID-19 diagnosis between March 1, 2020 and June 30, 2020. Multiple sensitivity analyses were also performed to assess core study outcomes amongst more restrictive matches between BP users/non-users, as well as assessing the relationship between BP-use and other respiratory infections (pneumonia, acute bronchitis) both during the same study period as well as before the COVID outbreak.


    A total of 7,906,603 patients for whom continuous medical and prescription insurance information was available were selected. A total of 450,366 bisphosphonate users were identified and 1:1 propensity score-matched to bisphosphonate non-users. Bisphosphonate users had lower odds ratios (OR) of testing for SARS-CoV-2 infection (OR = 0.22; 95%CI:0.21–0.23; p<0.001), COVID-19 diagnosis (OR = 0.23; 95%CI:0.22–0.24; p<0.001), and COVID-19-related hospitalization (OR = 0.26; 95%CI:0.24–0.29; p<0.001). Sensitivity analyses yielded results consistent with the primary analysis. Bisphosphonate-use was also associated with decreased odds of acute bronchitis (OR = 0.23; 95%CI:0.22–0.23; p<0.001) or pneumonia (OR = 0.32; 95%CI:0.31–0.34; p<0.001) in 2019, suggesting that bisphosphonates may protect against respiratory infections by a variety of pathogens, including but not limited to SARS-CoV-2.


    Prior bisphosphonate-use was associated with dramatically reduced odds of SARS-CoV-2 testing, COVID-19 diagnosis, and COVID-19-related hospitalizations. Prospective clinical trials will be required to establish a causal role for bisphosphonate-use in COVID-19-related outcomes.


    This study was supported by NIH grants, AR068383 and AI155865, a grant from MassCPR (to UHvA) and a CRI Irvington postdoctoral fellowship, CRI2453 (to PH).