Unpicking Cherry Picking

When you think about it, the possibilities for cherry-picking when you discuss research these days are growing in rather spectacular leaps and bounds, aren’t they? Questions, studies, specifics in the studies, interpretations, rationales, and theories: the internet and the explosion of research means there’s pretty much a bottomless supply of cherries.

It makes discipline and rigor in minimizing research bias ever more critical — along with the ability to tell a good cherry from one that should on no account be swallowed by anyone. And it makes finding efficient ways to tackle influential cherry-picking urgent.

We are so much better at identifying the problem and showing how to prevent it, than we are at minimizing the damage cherry-picking causes. Especially when it’s large-scale. It can become an academic equivalent of gish galloping: if you don’t maintain rigor, you can produce such torrents, it’s impossible to counter the vast detail with rigor in any reasonable amount of time.

In his article, The Unbearable Asymmetry of Bullshit, philosopher Brian Earp describes the extreme version of academic gish gallop:

[T]he trick is to unleash so many fallacies, misrepresentations of evidence, and other misleading or erroneous statements — at such a pace, and with such little regard for the norms of careful scholarship and/or charitable academic discourse — that your opponents, who do, perhaps, feel bound by such norms, and who have better things to do with their time than to write rebuttals to each of your papers, face a dilemma. Either they can ignore you, or they can put their own research priorities on hold to try to combat the worst of your offenses.

Finding strategies to overcome this, he writes, “should be a top priority for publication ethics”. Cherry-picking, in primary research or in reviews, doesn’t even have to be a real gish gallop to do damage. High-order cherry-picking is harmful enough — and perhaps even more effectively so.

Who has the time to get to grips with large bodies of evidence, especially outside areas in which they have specialized? A large and plausible-seeming overview on a massive complicated topic in a high-prestige journal often becomes influential. High-Order Cherry-Picking Supply: meet Desperate Need!

In the face of that, rebuttal can only go so far. Jeannette Banobi and colleagues found rebuttals can’t be counted on to slow down the trajectory of a publication’s influence (2011). (Via a report by Annalee Newitz this week in Ars Technica.)

Once it’s published, only retraction can slowly limit the ongoing damage from something that shouldn’t have been given high status in the first place. Science and science communication are too widely distributed for anything but the closest equivalent to a “defective product recall” to be effective.

I spent a lot of time this summer working through a single case, and thinking about how irredeemably flawed work can be rebutted efficiently. I’ve summarized the issues related to the subject of the article in my previous post: This is How Research Gender-Bias Bias Works.

The case is a highly-cited review on gender bias that I’d been asked for an opinion on (Ceci and Williams, 2011). It’s in the Proceedings of the National Academy of Sciences of the USA (PNAS). And it doesn’t need take long to realize it’s a very shaky house of cards.

But it took a long time to work through thoroughly. I had several false starts, going down roads that would need far too much time to finish. I hadn’t found a recent rigorous systematic review. So I tried to weigh up how much cherry-picking there was from within the study itself: by looking at all the studies addressing the questions referenced by the studies that were picked. That’s a basic, right? If you cite a paper, that means you should know what’s in it.

In this case, that was completely unfeasible. There were vastly more papers that had not been picked by these reviewers, than which had.

So I searched for studies that could give me an indication of the magnitude of literature on the questions relevant to the paper. What I found suggests that hundreds of studies may be eligible: thousands would need to be screened.

Then I streamlined the questions I used to assess the robustness of each included study to just a few. I thought that might be manageable. And it would have been, had the paper adhered to basic academic standards. But for example, it was sometimes time-consuming even to locate the sources of data referenced to a study, when the study sometimes didn’t even address the question.

Sometimes, the source for the attributed claim turned out to be second- or even third-hand — inside references cited inside references, sometimes of very large reports. Occasionally, even a few hours of scouring through a document couldn’t locate anything that even vaguely resembled what these authors had attributed to it.

Here’s a summary of my overall conclusions — detailed notes are at the end of the post. Ceci and Williams wrote that their conclusions are “based on a review of the past 20 y of data”. The evidence, they conclude, overwhelmingly shows that gender differences in science are because of women’s choices (free or socially constrained), but gender bias encountered as working scientists is a thing of the distant past. The authors wrote that only 4 studies suggest the possibility of a little gender bias, but those are an aberration.

I identified 31 unique studies cited as support for the authors’ conclusions that there is no appreciable gender bias in science. I don’t count a study that does not exist as it’s referred to in the summary here (although it is in the notes), or studies that were cited for historical reference only. And I didn’t address one section of the paper. That relates to the hypothesis that there is a gender difference in a specific ability. That is not included in the conclusions in the paper’s abstract.

In fact, 11 of the 35 studies they cite, not 4, conclude gender bias remains a problem. And several more found some signs of bias, too, but I think which side of the ledger those studies belongs on is more open to interpretation. That makes finding gender bias common, even in this cherry-picked group.

Altogether, in 19 of the studies — just over half — Ceci and Williams reported inaccurately and/or with serious academic spin: selective reporting and descriptions that spun others’ research results into support for these authors’ claims. I have only designated it selective reporting when in my opinion, the study was not fairly represented. Inaccuracies had to affect the weight that might be placed by a knowledgeable reader on the report: inaccurate referencing, for example, did not count.

That doesn’t mean that the other 24 studies constitute support for Ceci and Williams’ conclusions: I don’t believe they do.

Even before I started this exercise, I agreed with Earp that strategies for overcoming the damage these kinds of publications cause is a priority. Now that I’ve tried it, I think it’s urgent in our post-internet, science boom age.

Methodically unpicking this study took far more of my time than I had to spare. Reverse cherry-picking would have been a lot less work: selecting a few points that illustrate the paper’s weaknesses — much as I did in the previous post.

But sometimes, I think a paper probably has to be unpicked. To do so, though, could even be to invest more time than the original authors did in their paper. I think the way to go is for a group of people to carve up the fact-checking and assessment workload. Even if it turns out to be a powerful deconstruction, though, it might need to be accompanied by pressure for action.

It also needs to happen as soon as the paper is published, if it’s going to make enough difference. It’s a tedious and thankless task. And yes, we’ve all got too much on our plates as it is. But allowing high-order cherry-picking free rein is worse in the long term. When a journal that can have an impact lets a paper like this land, some of us have to put down what we’re doing and deal.

~~~~

My deep appreciation to Melissa Vaught and Bamini Jayabalasingham for thought-provoking discussions about gender bias in science, and the research on it.

Disclaimer/disclosure: I work at the National Institutes of Health (NIH), but not in the granting or women in science policy spheres. The views I express are personal, and do not necessarily reflect those of the NIH. I am an academic editor at PLOS Medicine and on the human ethics advisory group for PLOS One. I am undertaking research in various aspects of publication ethics, but am not aware of any conflict of interest in assessing this paper.

The cartoons are my own (CC-NC-ND-SA license). (More cartoons at Statistically Funny and on Tumblr.)

The Mary Poppins gif is from Popkey.

Supporting details for this post:

Overview

Section 1 (a): Journal manuscript acceptance

Section 1 (b): Journal publication productivity

Section 2: Grants

Section 3: Interviewing and hiring

Section 4: Hypothesis about women’s choices

Section 5: Additional studies entering only in the SI

Overview

The paper, “Understanding current causes of women’s underrepresentation in science”, was written by Stephen Ceci and Wendy Williams, and published in the Proceedings of the National Academy of Sciences of the United States of America (PNAS). It was received by the journal on 6 October 2010, approved for publication on 6 December, and published on 22 February 2011.

The paper aimed to directly influence policy agendas and strongly advocated redirection of resources and efforts in education, policy, and research. It continues to have strong impact. PNAS metrics show that interest in the article has not waned since 2012, with an Altmetric score placing it in the top 1% of articles of its age that are tracked by that system.

It is classified in the Web of Science as a highly cited paper, and has been folded into subsequent work in 2014 which in turn is also classified as highly cited.

The abstract described the paper as concerned with “recent and robust empiricism”, “claims of discrimination and their evidentiary bases”, and to be “based on a review of the past 20 y of data”.

This positions the paper in the realm of a comprehensive review of evidence, deriving its authority from a thorough examination of the extent and methodological quality of research on influences on women’s underrepresentation in science. This posture is reiterated throughout, particularly in these statements:

“six large-scale analyses have been published, on net providing compelling counterevidence to sex discrimination claims”;
“the weight of evidence overwhelmingly points to a gender-fair grant review process”;
“There are occasional small aberrations, sometimes favoring men and sometimes favoring women”;
“overwhelming counterevidence”;
“The pattern of null sex effects reviewed above is based on funding decisions since the mid-1980s”
“this review indicates a level playing field over the last two decades” (at funding agencies);
“the evidence shows women fare as well as men in hiring, funding, and publishing (given comparable resources)”; and
“we describe empirical evidence for claims of discrimination in the domains of publishing, grant reviewing, and hiring. We find the evidence for recent sex discrimination — when it exists — is aberrant, of small magnitude, and is superseded by larger, more sophisticated analyses showing no bias, or occasionally, bias in favor of women”.

The authors posited a limited set of areas for review: “the domains of publishing, grant reviewing, and hiring”. Although their conclusions are globally about science, their consideration in terms of funding and hiring are restricted to academia.

The authors also considered a limited of set of hypotheses as potentially contributing to women’s underrepresentation: (a) discrimination, (b) women’s life choices, expectations, and career preferences, and (c) “math-ability differences, potentially influenced by both socialization and biology”.

By a combination of a process of elimination of gender bias as a cause, and studies supporting women’s choices and preferences as playing a role, the authors arrived at a decisive conclusion: (b) “The primary factors in women’s underrepresentation are preferences and choices – both freely made and constrained”. Those constraints do not include gender bias in their studies or work in science.

“Past strategies to remediate women’s underrepresentation can be viewed as a success story”, the authors wrote, but they are now “misplaced effort”. No data was provided to demonstrate that removal of efforts to minimize the potential for gender discrimination would not result in an increase of gender bias. The differential experiences for groups of women, disciplines, and specific institutions and regions are not systematically addressed.

What is arguably the dominant hypothesis in the field is not addressed: that men are overrepresented in science because of cumulative advantage. Advantages do not have to be large individually, to contribute to the end result of underrepresentation in elite institutions and positions.

This analysis considers whether the strength of the evidence in this review supports the strength of the authors’ conclusions on gender bias and women’s choices (hypotheses (a) and (b) in this paper). It considers the scientific merit of the evidence review, on 3 broad areas of risk of bias in reviews of evidence:

Assessment of the risk of publication bias;
Selection bias;
Method of evaluation of the methodological and data quality of included studies.

Notes on individual studies that contributed to my conclusions follow.

The Ceci and Williams paper does not include a description of its methods for finding, selecting, or appraising evidence. There is no definition, for example, of “large-scale”, and how 6 particular studies (and only those studies) were determined to meet this criterion.

Assessment of the risk of publication bias

Although some of the studies cited in this paper point to the critical issue of publication (and language) bias in this area, the issue is never raised in the paper.

Legislation against gender discrimination has been in place since before the period under review (the 1980s), in most if not all of the countries studied here. For example, Title IX became law in the US in 1972.

Legal and reputational barriers may inhibit a journal, funding agency, or higher education institution from publicizing evidence of bias. The decision to gather and analyze data may itself be an indicator of awareness and concern for fairness, and/or obligation to have ensured fairness by specified measures.

For example: there were at least 16,000 active science and technology journals in 2011. This paper cites studies in 8 to 15 journals in 2 disciplinary areas. In 1 of those studies (Ref 23), the author had approached 24 journals: only 5 agreed to participate in the study. The risk of publication bias here is high, and such a small proportion of journals and scientific disciplines cannot be regarded as representative.

Mandated external review of public agencies offsets the risk of publication bias there. But that data is included only for very few agencies.

Selection bias

I counted 35 unique studies used to address hypotheses (a) and (b) in this review (excluding those included with historical data only, and 1 study whose existence, as it is described and references, I could not establish – details are included below). There were more than 35 studies cited within this review’s included studies, than are included. For example, a review of 21 studies is given considerable weight: 6 of the studies within it are additionally individually addressed, but the reason for this selection is not stated.

The range of studies eligible for consideration is likely to be far larger than the number encompassed by the selected studies here, and those they in turn selected. A bibliometric analysis of published studies on women in science and higher education identified 1,415 articles published between 1991 and 2012 (Dehdarirad, 2015). The gender and science database of a European Union project published in 2010 had 4,299 entries (Meulders, 2010 [PDF]).

In addition to the study selection bias evident in the Ceci and Williams paper, there is selective reporting of data and conclusions from the included studies that favors the authors’ conclusions.

A total of 4 studies are reported by Ceci and Williams as suggesting the possibility of a little gender bias. However, in my opinion an additional 7 studies concluded gender bias remained a problem. Nor are these the only studies where some bias is evident. Altogether, in 19 studies, there is either selective reporting and descriptions that spin study results in the direction of this review’s conclusions, or inaccurate reporting that could affect the weight placed on the evidence by a knowledgeable reader. I identified no instance of spin that did not favor the authors’ conclusions. (Notes on the individual studies are below.)

Method of evaluation of the methodological and data quality of included studies

The paper uses the terms gender bias/fairness and discrimination without precision and clear definition, with one exception. In the Supporting Information (SI), this level of evidence is set for satisfactorily demonstrating discrimination:

[F]or discrimination to be shown to exist, the women and men must possess similar credentials and must apply equally vigorously and be equally willing to relocate. If such aspects of the situation are true and can be verified, and if women are passed over for jobs relative to men, then discrimination is taking place. However, if women choose not to apply, won’t relocate, and/or lack the scholarly records possessed by their male counterparts, we cannot infer discrimination, because the characteristics of the applicants and their situations are not equivalent…When women PhD recipients choose not to apply for tenure-track posts, their refusal represents a choice, one that most of their male and many of their female colleagues do not make.

Although there are conclusions about some studies being better, there is no report of the criteria for this assessment or individual assessments for each study. There is no consistent consideration of data quality, such as the problems of large proportions of missing data in many studies, the risks of multiple testing, and other critical issues of robustness that affect the strength of conclusions that can be drawn from this data set.

There is no consistent approach to what is considered evidence of a potentially causal link. Statistical significance seems to be required for some statements, but not others.

Finally, PNAS editorial policy states that it may contain “additional substantive material, but the paper must stand on its merits”. This is not the case for this paper.

In my opinion, this paper suffers not only from considerable error. It also has irredeemable methodological flaws, particularly publication, selection bias, and selective reporting. The selectively chosen evidence within this paper supports neither the authors’ conclusion that there is a universally level playing field for women in science, nor that women’s choices are the primary cause of their underrepresentation.

Section 1 (a): Journal manuscript acceptance

Ceci and Williams’ conclusion:

The question of whether sex discrimination exists in getting work published is ideally answerable by examining manuscript acceptance rates of men vs. women, holding constant quality of work…Comparing women and men with comparable resources, we find no sex discrimination in publishing.

To support this, 6 studies are cited, covering between 8 and 13 journals (5 are not identified). All but 2 journals come from 1 area of science publishing (ecology and evolution): the other 2 come from neuroscience.

There would have been at least 16,000 active science and technology journals in the Scopus database in 2011. The experience of a small proportion of the journals in 2 areas is not enough to sustain a conclusion about the presence or absence of gender bias in science journals. Publication bias is a major concern as well. Journals whose editors have high levels of concern about gender fairness may be more likely to analyze their performance. Journals whose editors identify recent poor gender performance are presumably unlikely to want to broadcast this failing. There is an indication of the seriousness of this problem in 1 of the studies cited: Tregenza approached 24 journals to participate in his study and editors from only 5 agreed [PDF].

The scope of the section is restricted to editorial peer review and productivity. Other issues may be more likely to indicate inequality and disadvantage for women, rather than the presence of one or more women among a list of manuscript authors. These issues include authorship of editorials and invited perspectives, international collaboration, and being editors and peer reviewers. For some areas, book and monograph authorship will be as, if not more, critical for career advantage as journal publication.

There are 2 instances claiming there are additional studies supporting the authors’ conclusions, where these studies are not cited:

Other journals (e.g., Cortex) have also reported equal acceptance rates (25).

Finally, many others also report no sex differences in productivity, controlling for structural variables confounded with sex (e.g., refs. 7 and 8).

Individual study notes

Budden (2008) [PDF] (Ref 18)

This is the largest study, including 7,263 articles from 6 journals. This study addresses whether or not there is an apparent correlation between the introduction of blinded peer review, and an increase in articles authored by women. The authors concluded that blinded peer review benefited women, implying the presence of gender bias. However, a study comparing outcomes after attempts to blind peer review is not designed to identify whether or not gender bias exists. (For example, blinding is not completely successful: an absence of effect after introducing blinded peer review is not proof of absence of bias.)

Ceci and Williams refer to this as the only study finding possible signs of gender bias, without noting that it is the largest study they include. They provide no assessment of methodological or data quality.

Whittaker (2008) (Ref 22)

This is a letter to the editor on the study above (Budden, 2008). It considers acceptance rates for 1,140 articles from 1 one of the 6 journals studied by Budden et al. (Budden et al included 1,040 studies for this journal from an earlier time period.) 46% of the papers had mixed gender authorship. The number with women first or corresponding authors is not reported. (The number of women authors at this journal was relatively small, according to Budden (2008).) Acceptance rates are not reported. A chi-squared test result showing no statistically significant difference for gender of corresponding author is reported, with no underlying data to support interpretation.

Ceci and Williams refer to this as a study finding no signs of gender bias. They provide no assessment of methodological or data quality.

Tregenza (2002) [PDF] (Ref 23)

This single-author study by Tregenza is erroneously referred to as “Budden et al”. This study looks prospectively at acceptance rates of 2,680 articles in 5 unnamed journals from the same field as study 1 in this section. (Tregenza was a co-author for that study.)

The author of this study approached 24 journals in the field to participate: 7 editors from 5 journals agreed. The denominator of potential editors is not provided so the response rate for editors within those 5 journals is not known. Nor is there data provided to judge performance on gender between participating and non-participating journals. The author notes the high risk of self-selection bias in editors agreeing to participate in this study, who may be “unusually aware of the potential for bias”.

Gender was assigned according to the first author by Tregenza: “Gender was designated as unknown where it was not immediately apparent from the author’s first name (predominantly when only initials were provided) even if the editor actually knew the person’s gender”. The rate of “unknown” gender ranged from 20% to 50% across the 7 editors.

On this problematic basis of assignment of knowledge of gender, the rate of acceptance among the 7 editors for manuscripts by men was 35-46%, while for women and “unknown” it was 23-53% and 20-50% respectively. The number of manuscripts per editor ranged from 70 to 928. All differences greater than 10% disfavored women: 12% for an editor with 699 manuscripts, 19% for one with 178, and 20% for one with 77.

A model failed to reject the null hypothesis of no gender interaction, which is chosen as the final conclusion of this study reported by Ceci and Williams. But given the uncertainty around the designation of “unknown” for gender, the small numbers for some editors, and the high risk of self-selection bias, this cannot be regarded as strong evidence for absence of bias. Indeed, the author concludes: “However, differences among journals in the acceptance rate of papers relative to gender gives grounds for caution because this pattern is difficult to explain without invoking bias”. The abstract notes this conclusion as well: “…differences in acceptance rates across journals according to gender of the first author give grounds for caution”.

Ceci and Williams refer to this as a study finding no signs of gender bias, despite that not being the study author’s conclusion. They provide no assessment of methodological or data quality.

The results of this study are selectively reported.

Borsuk (2009) (Ref 17)

This study is not relevant to the consideration of experience at journals, as it is a simulation of peer review. It is a study with 989 biologists, mostly under- or post-graduate students (>96%), mostly female, doing a manuscript review. Rating of papers is also not a measure of whether or not the paper is regarded as publishable, which is the key outcome.

The authors came to 2 conclusions: gender of author of a paper made no difference to ratings, and female postdoctoral researchers were more critical reviewers.

Ceci and Williams refer to this as a study finding no signs of gender bias. They provide no assessment of methodological or data quality and do not point out that the study population is almost entirely students.

The results of this study are selectively reported.

Note about referencing error: Although the first author of this study is Borsuk, it is attributed to Budden even though she is neither first nor last author. Earlier, Ceci and Williams also wrote: “Budden and her colleagues published several analyses of gender bias in manuscript reviewing by undergraduates, graduate students, postdocs, and journal reviewers (6, 17–19).” Reference 6 is a methodological paper about publication bias, not an analysis gender bias in manuscript reviewing. Reference is a response to a letter about Reference 18. Budden is the first author only for Reference 18 and the letter, and is never the last author. She is not an author of Reference 17.

Nature Neuroscience (2006) (Ref 24)

This is an analysis of acceptance rates 469 papers in the first quarter of 2005 at this journal, looking at gender of first and last authors. Gender was unknown for 20. Acceptance rates were around 11% regardless of gender. No data are provided on the representativeness of the first quarter of 2005.

Ceci and Williams refer to this as a study finding no signs of gender bias. They provide no assessment of methodological or data quality.

Brooks (2009) (Ref 25)

This is an analysis of publication rates by gender of first authors at the neuroscience journal, Cortex in 2 periods: 1997/98 (103 papers where gender of authors could be ascertained) and 2007/08 (204 papers). There were no data on acceptance rates for submitted manuscripts. The publication rates for women first authors in the later period were similar to the rate for men (48% vs 53%).

Ceci and Williams refer to this as a study supporting their conclusions. They provide no assessment of methodological or data quality. They refer to it as an example of “equal acceptance rates”, even though acceptance rates are not provided.

This study is inaccurately reported.

Section 1 (b): Journal article productivity

Ceci and Williams’ conclusion on this question:

In sum, when publication data are controlled for structural position, ensuring that sex differences in manuscript acceptance rates are not conflated with sex differences in resources, there is no difference between the sexes.

See above 1 (a): The studies on journal publishing do not support Ceci and Williams’ conclusions.

Individual study notes

Xie (1998) [PDF] (Ref 26)

The analyses in this study use data from 4 unrelated, cross-sectional surveys, with very different methodologies and response rates (from 49% to 87%). Ceci and Williams, however, describe it as “a longitudinal analysis”, which would have been a higher level of evidence. 4 different models are used to analyze the data. It is not clear to me where Ceci and Williams derive the numbers they report.

Xie alternatively described gender as a “small”, “reduced”, “nil or negligible” contributor to women publishing fewer papers, taking data from 1969, 1973, 1988, and 1993 into account. The paper also pointed to a range of issues to take into consideration in interpreting the results of the models used, including (a) “the covariates we use may be the effects, rather than the causes, of research productivity” and (b) “we still do not know why men and women scientists differ systematically in these important dimensions, and in this sense the puzzle remains unsolved”.

The Xie study concluded that by 1988 and 1993, “there is a notable decrease in the power of the resource variables introduced in Model 3 to explain sex differences in productivity”. As the difference in access to resources narrowed, they were therefore less able to account for sex differences.

Ceci and Williams refer to this study as supporting their conclusions. They provide no assessment of methodological or data quality, and inaccurately describe the study’s methodology. Their statement that after factoring in variables, “the productivity gap completely disappeared” does not accurately reflect this study’s finding. I could not locate with certainty the ratios they quote in the study.

This reference is also used to support this statement: “any difference in productivity is due to structural variables that, although correlated with sex, are causally unrelated to it”. The study does not address the causes of reduced access to resources.

The results of this study are selectively reported and the study itself is inaccurately described.

Unnumbered: this study is excluded from the final count for this section. Allison (1990) (Ref 27)

The article as referenced in the Ceci and William paper does not exist. This is the issue of that journal. There is a paper with almost the same reference, but different page numbers, which I describe here.

That study analyzed a sample of 179 job changes between 1961 and 1975, which falls well within the period Ceci and Williams argue is no longer relevant for consideration. It includes no analysis of gender.

Allison pointed out that while productivity increases occurred after moving to a job at a more prestigious institution, the study did not rule out other potential explanations, such as that “prestigious departments search out and recruit scientists who are on an upward productivity trend”.

According to Ceci and Williams, the article to which they refer includes this data:

In this analysis, males produced 30% more publications than women, but when men tenured at R1 universities were compared with women tenured at R1 universities, the gap fell to 8%, and the difference between men and women full professors at R1s was <5%.

It seems likely that the data come from one of the other studies already cited. The paper referenced does not include this data.

Ceci and Williams refer to this study as supporting their conclusions. They provide no assessment of methodological or data quality. The data and claim cited by Ceci and Williams are not included in the paper that appears to be Reference 27. Reference 27 is an incorrect reference to a study too old to be relevant to their thesis, that does not address gender.

The study as reported here does not exist and is excluded from this analysis.

Committee on Gender Differences (2009) (Ref 28)

This report looks at productivity from several angles, including the possibility of dissatisfaction at work reducing productivity. Its analyses on publications are based on a survey, not a bibliometric analysis. Consequently, it has a risk of responder bias. In addition, “only 934 (of the 1,404 faculty) had complete information on all covariates in the model and had reported a number of journal articles”.

Ceci and Williams report: “Similarly, a National Research Council task force concluded that productivity of women science and engineering faculty increased over the last 30 y and is now comparable to men’s, the critical factor affecting publications being access to institutional resources (28)”.

This is from the Committee’s conclusion:

Overall, male faculty had published marginally more refereed articles and papers in the past 3 years than female faculty, except in electrical engineering, where the reverse was true. Men had published significantly more papers than women in chemistry (men, 15.8; women, 9.4) and mathematics (men, 12.4; women, 10.4). In electrical engineering, women had published marginally more papers than men (women, 7.5; men, 5.8). The differences in the numbers of publications between men and women were not significant in biology, civil engineering, and physics. All of the other variables related to the number of published articles and papers (discipline, rank, prestige of institution, access to mentors, and time on research) show the same effects for male and female faculty.

The Committee included these caveats:

Without all relevant controls accounted for in the analysis, the results need to be taken as preliminary and as an impetus for further, more sophisticated research, rather than a definitive statement on the existence of disparities between male and female faculty. Finally, it should be noted that the analyses presented here provide an aggregated, often average, view. That view is not inconsistent with some women having very few resources and some women having quite a lot, nor does it negate the possibility that individual women (or men) are discriminated against in their access to resources.

It is important to note that these statistics and those that follow related to publications could be misleading, given the significant interactions discovered in our multivariate analysis of gender, discipline, publications, and other variables.

Some of those results did include gender differences, such as: “Regarding the interaction of rank by gender, men increase the number of journal publications between the ranks of assistant and associated more than women do”.

Ceci and Williams refer to this study as supporting their conclusions. They provide no assessment of methodological or data quality.

The results of this study are selectively and inaccurately reported.

Hill (2010) [PDF] (Ref 7)

Ceci and Williams:

Finally, many others also report no sex differences in productivity, controlling for structural variables confounded with sex (e.g., refs. 7 and 8).

This report draws on various sources to reach its conclusions, which generally contradict Ceci and William’s statement rather than support it. I could not identify an analysis or conclusion specifically related to publishing productivity in the report that supports the statement. Stack (2004 [PDF]) is cited specifically in the Hill report on the issue of productivity. Stack concludes:

Gender remained a significant predictor of productivity after children and numerous other covariates of research productivity were included in the analysis.

Ceci and Williams refer to this study as supporting their conclusions. They provide no assessment of methodological or data quality. This reference contradicts, more than supports, the statement it references.

The results of this study are selectively reported.

National Academy of Sciences (NAS) (2006) (Ref 8)

Ceci and Williams: “Finally, many others also report no sex differences in productivity, controlling for structural variables confounded with sex (e.g., refs. 7 and 8)”.

This report’s conclusions also do not support the statement. Although the report concludes that given the same access to resources, women scientists are just as productive. However, the NAS report does not conclude that access to resources is only a confounding variable:

Incidents of bias against individuals not in the majority group tend to have accumulated effects. Small preferences for the majority group can accumulate and create large differences in prestige, power, and position. In academic science and engineering, the advantages have accrued to white men and have translated into larger salaries, faster promotions, and more publications and honors relative to women.

The results of this study are selectively reported.

Section 2: Grants

Ceci and Williams’ conclusions:

[T]he weight of evidence overwhelmingly points to a gender-fair grant review process. There are occasional small aberrations, sometimes favoring men and sometimes favoring women; all of the smaller-scale studies failed to replicate Wennerås and Wold’s provocative findings, and all but one of the large-scale studies did as well — however, this one study was reversed after a more ambitious joint reanalysis. Despite this overwhelming counterevidence, numerous organizations continue to suggest grant review is discriminatory, thus diverting attention from legitimate factors limiting women’s participation in math-based careers. The pattern of null sex effects reviewed above is based on funding decisions since the mid-1980s.

This is another area with a high risk of publication bias. Discrimination against women applicants for public agency grants has been against the law across this later period in many if not all of the countries included here. Nevertheless, findings that precluded reassurance, or caused concern that gender bias may not be adequately prevented, were common in the studies cited in this section.

The numbers are generally too low to be able to identify an absence of difference. Where there is statistical significance, it suggests at least gender bias against women. Adjusting for the risks of multiple testing in analyses with many variables was rarely reported.

Key differences such as size of grants were not routinely analyzed in this group of studies.

Individual study notes

Wennerås (1997) [PDF] (Ref 29)

This study was based on data on 114 fellowship applications to the Swedish Research Council, gained via a freedom of information application. The authors conclude there was gender bias. Ceci and Williams include extensive critiques of the study, including half a page in the supplementary information.

Ceci and Williams refer to this study as not supporting their conclusions. They include extensive methodological critique of the study.

Grant (1997a) (Ref 31)

This is a letter in response to Wennerås (1997), with very little detail. It reports on Wellcome Trust project grants as well as fellowship and career development awards (2,867 applications/proposals). There was a gender difference in the rate of application for career development awards. Sex of some applicants was not recorded, although how many was not reported. There were differences in success rates for career development awards, but the differences were not statistically significant.

Although not referred to here, this analysis was published (Grant, 1997b [PDF]).

Ceci and Williams refer to this data as supporting their conclusions. They provide no assessment of methodological or data quality.

Dickson (1997) (Ref 32)

This is not a study. It is a news report of a talk given by a UK parliamentarian quoting data.

Ceci and Williams refer to this news report as supporting their conclusions, without making clear it is not a reported study. They provide no assessment of methodological or data quality.

Friesen (1998) (Ref 33)

This is another letter in response to the Wennerås study, this time from the MRC in Canada, on success rates for grants and personal awards (8,283 applications/proposals). Rates were similar for women and men, except for fellowship programs (12.9% for women, 16.3% for men; a statistically significant difference).

Ceci and Williams refer to this data as supporting their conclusions, however the report the exception (which is included as an instance of the authors reporting contradictory findings). They provide no assessment of methodological or data quality.

Sandstrom (2008) [PDF] (Ref 34)

This study analyzes outcomes for 280 applications at the Swedish MRC in 2004, the same funder studied by Wennerås (1997) (above). Women were under-represented among principal investigators (PIs), but the reasons for this were not studied. As with the Wennerås study, they identify an influence from affiliation with reviewers, but not the large gender difference in the Wennerås study.

Ceci and Williams refer to this study as supporting their conclusions. However, the authors did not study reasons for gender differences. They provide no assessment of methodological or data quality.

Demicheli (2007) (Ref 35)

This is a systematic review of intervention studies aiming to improve the quality of peer review. The included studies did not address the reduction of gender bias or sex discrimination, and the study’s conclusions include no mention of gender bias or sex discrimination.

Ceci and Williams statement implies, however, that studies of sex discrimination were reviewed: “the Cochrane Methodology Review Group concluded that other than Wennerås and Wold’s study, ‘a number of other studies carried out in similar contexts found no evidence of (sex discrimination)’.” However, studies on whether or not sex discrimination (or gender bias) were not reviewed. The partial quote (which relates to gender bias) is drawn from the review’s background, not its conclusions. (Note: the Cochrane Methodology Review Group did not author this review.)

Demicheli (2007) cites 3 relevant sources in the review’s background: Wennerås (1997) and Grant (1997a), which were already considered by Ceci and Williams, and a book by Cole (1992) that does not appear to include recent studies. (I have not read the book.) It is not clear what studies are being referred to, or if they support Ceci and Williams’ conclusions. There is no indication that Ceci and Williams sought to identify those studies.

Ceci and Williams refer to a statement in the background text of this study as support for their conclusions, with no additional relevant studies than were already reported in their paper. The actual study includes no studies addressing gender.

RAND (2005) (Summary [PDF]) (Ref 36)

This is a study by RAND, which was commissioned to review federal agency compliance with the anti-discrimination provisions of the 1972 legislation, Title IX. Ceci and Williams refer only to the 2-page summary, not the full report.

Ceci and Williams report that RAND studied 3 federal agencies and concluded “there was no gender bias in awarding of grants”. In the SI, but not the main text, they refer to only one of the gender concerns identified by RAND: a gender difference in re-application rates.

RAND did not conclude there was “no gender bias”. RAND looked at 5 agencies, and concluded that 2 of the agencies did not have data that was adequate to assess outcomes by gender. Of the 3 agencies with analyzable data, they concluded there was no gender difference, “with two important exceptions” and “several caveats”. In addition to the gender gap in subsequent applications at the NIH, these included a gender gap in the amount of funding awarded by the NIH to women: women received “on average only 63 percent of the funding that male applicants received”. The mean received an average of around half a million dollars extra per grant. RAND also raised concern about some lack of availability of data at the NIH on co-investigators and amounts requested.

RAND concluded: “Our understanding of gender differences in federal research funding is incomplete”. In the key findings of the full report, the importance of the absence of application data at the NIH and USDA was stress, as gender differences among applicants may have affected findings.

RAND analyzed data from 2001-2003:

NSF: 115,537 person-year observations for the investigators with both gender and experience known;
NIH: 84,313 proposals;
USDA: 11,213 person-years (same method as for NSF).

Two federal surveys were also considered.

Ceci and Williams refer to this study as supporting their conclusions. They provide no assessment of methodological or data quality. They do not refer to the full report. They identify the analysis as large-scale.

The results of this study are selectively and inaccurately reported, with key findings of gender differences and concerns omitted. One of the 2 gender differences is referred to, but only in the SI, not the article. The results of this study contradict more than support Ceci and Williams’ conclusions.

Leboy (2008) (Ref 37)

Leboy refers to the RAND study via a report in Nature Medicine, referring to the lower grant awards women applying to the NIH received (Schubert, 2005), as well as studies on other questions. She includes 2 sets of data on NIH funding:

The first is a graph of percentages of NIH grants received from 1993 to 2004. Leboy concludes: “While women get over 40% of K awards series, women receive less than 20% of NIH research center grants”.

The second is a graph of success rates for RO1 awards from 1998 to 2004, Leboy concludes: “In the first round of RO1 awards, women do as well as men. But women trail men on renewal awards”. Data are only graphed, with no exact numbers provided.

This is a larger data set than in the RAND survey (which covered only 2001-2003), although details are sparse. Findings are similar: areas of both parity and lag.

Ceci and Williams report this following on from the RAND survey:

Leboy’s much smaller report of success rates for first-time RO1 grants at the National Institutes of Health for men and women revealed identical success rates for new submissions between 1998-2004 and very similar continuation grants by 2004.

Ceci and Williams refer to this study as supporting their conclusions. They provide no assessment of methodological or data quality. They classify the analysis as large-scale.

The results of this paper are selectively reported, with key findings of gender differences and conclusions omitted. The paper contradicts more than supports Ceci and Williams’ conclusions.

Jayasinghe (2003 [PDF]), Marsh (2008a [PDF]), Marsh (2008b)* (Refs 38-40)

Note *: I could not locate Marsh (2008b) on the internet. It is an unpublished report that is cited, but not discussed in the Ceci and Williams article.

Jayasinghe (2003) reports that a precursor to this study was a report of some gender imbalance in the previous year (1995), citing Bazeley (1998). Bazeley concluded that the imbalance, both in rates of application for large grants, arose as men were more likely to be senior, accounting for both increased application and success rates:

These results lend some support to the idea that the “Matthew effect”, or theory of accumulative advantage, has some impact on peer and panel review of applications within the highly competitive arena of Australian Research Council large grants.

This is not reported or cited by Ceci and Williams.

The study that is reported relates to peer review of 2,331 research grant applications in 1996 at the Australian Research Council (ARC). There had been an additional 653 proposals rejected that year without peer review (22%): no further data is reported on them. The 2,331 sent out to assessors covered science, social science, and the humanities. The proportion in these fields is not reported, except for the group where the gender of the assessors is also known: there was a high rate of missing data on gender of the assessors (23%).

For the 2,109 proposals where the field is reported, 68% were in science. Female researchers received similar scores to male researchers in the social sciences and humanities, but lower in science. There was some gender difference in assessor rankings.

Models involved considerable numbers of analyses, with no pre-specified hypotheses or method reported for adjusting for multiple comparisons in the articles. Numbers for some groups would be very small and under-powered.

Ceci and Williams only report the results of modeled data from reference 38, and do not report that data on 22% of the initial applications was not included. In addition, they report that “Marsh et al. (39) extended these results”. I could see no additional analyses reported in reference 39. They do not report that the study includes non-science research, or that the un-modeled data included a lower success rate for women in science (but not non-science). They do not report the study of the previous year’s ARC grants.

Ceci and Williams refer to this study as supporting their conclusions. They provide no assessment of methodological or data quality. They classify the analysis as large-scale.

The results of ARC findings are selectively reported.

Bornmann (2005) (Ref 41)

This is another study of personal awards (doctoral and postdoctoral fellowships), not project grants to groups. The study involves 2,697 applications (of which 743 were for postdoctoral fellowships) to a German foundation between 1985 and 2000. For postdoctoral applications, female applicants were so under-represented, analyses would be under-powered (50/743 applicants). The rate was therefore not reported. Postdoc fellowships had been discontinued in 1995.

For pre-doctoral fellowships, application rates were more similar. The success rates were 7% for women, 16% for men, which was statistically significant. They calculated that being female reduced chance of success from 50% to 33%.

Ceci and Williams report that the study involved 1,022 applications, which is inaccurate. They report that there was no gender bias in postdoc awards, without pointing out that this related to sample size. They report that there was a gender difference for doctoral fellowships, but do not report the size of the difference in the main text.

In the SI, they report only the 50% to 33% data. In addition, they state: “Thus, this study provides some support for approval biases due to sex existing between 25 and 20 y ago”. I could see no time trend analysis in Bornmann (2005), and the conclusion for doctoral fellowships is not restricted to 25 and 20 years prior to 2011: the year 2000 was only a decade prior to Ceci and Williams’ writing.

Ceci and Williams refer to this study as not supporting their conclusions. They provide no assessment of methodological or data quality, but include the authors’ own description of study limitations in the SI. They classify the analysis as large-scale.

The results of this study are both selectively and inaccurately reported.

Bornmann (2007a) (Ref 42, 43)

Reference 42 is a report of reference 43 (2007b; aRxiv version [PDF]).

This is a meta-analysis of studies of gender in grant peer review. It had a systematic search, at the end of 2005, and includes primary studies authored by the meta-analysis authors themselves. However, there was no quality assessment of the studies it included. The authors wrote:

Due to the fact that most of the studies used were published as peer reviewed publications or as reports of renowned institutions their quality is sufficiently guarantied (sic).

This lack of methodological critique in a review, is a serious methodological flaw of a review. In addition, there were no pre-specified inclusion criteria, and no report of what studies were excluded and why. The risk of both publication and language bias are high. (As this study is the subject of re-analysis below, which Ceci and Williams judge as superseding this analysis, I address methodological issues in more depth there.)

Data from 21 studies of predominantly science grant programs published between 1987 and 2005, was split into 66 “peer review procedures”. For example Grant (1997) (number 2 in this section) is split into 4: that number corresponds with the sections in the letter’s table (Wellcome Trust (WT) project grants, WT senior research fellowships, Medical Research Council (MRC) project grants, and MRC career development awards).

This study includes all but 2 of the grant review processes cited in this section of Ceci and Williams’ paper, that were published in time to be included. The missing 2 are the RAND report of NSF/NIH/USDA and Broder (1993 – Ref 44), an analysis of economic proposals to the National Science Foundation (NSF) between 1987 and 1990 (below).

Although Ceci and Williams in effect repeat individual consideration of the 5 grant programs/studies included in the Bornmann meta-analysis, they do not consider individually the other 16 reports included in the review (including Bazeley (1998) already referred to above). There is no explanation for this selection.

As well as the Bazeley ARC report, the 16 reports include the NSF, the UK Engineering and Physical Sciences Research Council (EPSRC), Germany’s major science funder (DFG, but for sociology only), the European Molecular Biology Organization (EMBO), the Netherlands Organization for Scientific Research (NWO), Australia’s National Health & Medical Research Council (NHMRC), and Marie Curie Fellowships.

Like Bazeley (1998), the report of the EPSRC concluded that small accumulative advantages appear to accrue for men across science, culminating in greater grant success (Viner, 2004). It was a study of 33,750 grant proposals from 1995 to 2001. The total number of applications/proposals in the Bornmann meta-analysis was 353,725.

In Bornmann’s meta-analysis (using a Bayesian method to compute log odds ratios for each “procedure”), men had a 15:14 chance of approval compared to women (or 52% to 48%). They point out that out of 50,000 grants submitted, women could expect to lose 2,000 approvals because of their gender. The log odds ratios are only presented graphically. It appears 47 appear to have their average fall on the side of favoring men, and 6 appear to have their average fall on the side of favoring women, with the rest of the 66 touching 0. In only 1 do the credible intervals fall completely below 1.

Ceci and Williams report very little on the results, stating that the gender advantage for men was “extremely small” and that “no bias was found for postdocs”. However, data on postdocs was not presented separately and the authors make no statement about postdocs. They do not report that there are 21 studies in this meta-analysis covering more than 350,000 applications/proposals, nor that it covers major granting institutions otherwise not mentioned in their paper.

Ceci and Williams refer to this study as not supporting their conclusions. They include methodological critique of this study in their discussion of reference (45) below. They classify the analysis as large-scale.

The results of this study are reported inaccurately. There is selective reporting of included studies in the paper.

Broder (1993) (Ref 44)

This is a study of economics proposals to NSF from 1987 to 1990. It includes “1,479 usable proposals”: no details are provided on the excluded proposals, or success rates. There were 6,764 reviews on those proposals. Scoring was from 1 (excellent) to 5 (poor): therefore a higher number represents a lower score.

Proposals by female PIs received a lower average score, regardless of the gender of the reviewer, and the differences were statistically significant. Male PIs received an average score of 2.59 from both male and female reviewers; female PIs received an average score of 2.69 from males and 2.91 from females. Female PIs were also more likely to get an extremely poor rating: also statistically significant. Further analyses indicated that the bias to female PIs was greater from female reviewers than male reviewers.

Ceci and Williams wrote:

Broder found female PIs fared well when rated by male reviewers at NSF, but less well when rated by female reviewers, a finding she suggested may have worked against increased representation of women.

This does not reflect Broder’s findings. Broder’s conclusions in the abstract, as elsewhere in the paper, are:

Based on reviews of grant proposals to the National Science Foundation (NSF), this study presents evidence of significant differences in the reviewing of female and male authors by male and female referees.

Ceci and Williams refer to this study as supporting their conclusions. They provide no assessment of methodological or data quality. They classify the analysis as large-scale. They do not report the magnitude of the differences.

The results of this study are both selectively and inaccurately reported. This study’s findings contradict Ceci and Williams’ conclusions.

Marsh (2009) (Ref 45)

This is a re-analysis of the same studies in 13 (Bornmann, 2007) above, initiated by the authors of the ARC study (9-11 above). It was initiated by the ARC authors because the Bornmann finding of gender bias “seem to contradict those based on the comprehensive study” of the ARC. The authors include both author groups (the ARC and Bornmann groups).

Sources of bias and serious methodological flaws in the original meta-analysis continue in this re-analysis:

The authors are analyzing their own primary studies;
There is no quality assessment of the included studies;
There were no pre-specified inclusion criteria, and no report of what studies were excluded and why;
The risk of both publication and language bias remain high and no methods of weighing the risk of publication bias was reported.

To these underlying risks of bias, new analyses are undertaken, using study characteristics as “potential explanatory variables to explain heterogeneity in the study outcomes”, beyond pre-specified hypotheses. The only pre-specified hypotheses reported were:

Instead of looking for evidence of gender difference, the hypothesis is reversed: that the results are consistent with gender similarity;
That men may have a larger advantage in applications for fellowships than grant applications;
Gender differences in favor of men would be more likely in older publications than more recent.

As well as the variables related to the pre-specified hypotheses (gender, application type, and year of publication), additional variables used in the meta-analytic modeling included country (including multiple country) and discipline. Studies were weighted for size.

As with the original review, studies were pooled regardless of their methods of assigning gender (e.g. PIs, all individuals), or whether they studied awards or reviews. There was no assessment of award size or prestige, and no method of accounting for missing data (which varied widely between included studies, in terms of both proposals and gender). There was no method of adjusting for multiple testing.

Models used were not reported as pre-specified: how many models and analyses were run and how many reported is not stated. The results of 8 models are reported. They report that there was unreported analysis within these models.

They conclude that their findings are consistent with no gender bias for grant applications overall, but with a difference in favor of men for fellowship applications. The hypothesis of a difference in publication year was not supported, although they note that there was no correspondence necessarily between years studied and year of publication. They conclude that their results for discipline (which includes science versus social science and humanities, which differed significantly in some of the modeling), as well as country, were “somewhat ambiguous”.

The authors conclude that their results provide “very strong support” for the gender similarity hypothesis and “no support at all for any gender differences in peer reviews of grant proposals”. However, success rates in awards were often the measure, not gender differences in reviews (and how often awards were consistent with the recommendations of those reviews). In addition, the results may also be consistent with publication bias. This was not addressed.

Ceci and Williams conclude that this study has “the most powerful analytic approach to date”, without reporting the criteria on which this judgment was made. They report “there was no evidence of sex differences favoring men in any category”. This does not reflect the data or the authors’ conclusions on issues such as discipline. For example, “[Effect sizes] for the biomedical and social sciences were significantly negative in Table 1 (i.e., in favor of men)”.

Although the authors state some of their findings other than for fellowships were ambiguous, Ceci and Williams report that:

The team interpreted this single finding in favor of men [fellowships] as an aberration from an otherwise unambiguous pattern of no sex advantages or even slight female advantage, and the lack of sex differences generalized over country and discipline.

Ceci and Williams refer to this study as supporting their conclusions. They provide no assessment of methodological or data quality. They classify the analysis as large-scale.

The results of this study are selectively reported.

Ley (2008) (Ref 46)

This is a study of gender in NIH applications and success rates, from 2003 to 2007. Gender of 3% of PIs was missing. It includes 121,734 applications in the following categories:

Loan replacement program (LRPs): 8,401 applications (54% women);
KO1s: 2,691 applications (52% women);
K23s: 2,871 applications (51% women);
KO8s: 2,795 applications (31% women);
RO1s (first time): 36,351 applications (32% women);
RO1s (experienced): 68,625 applications (25%).

LRPs are not research grants; the rest are.

The authors conclude that funding success rates for women were “near-equivalent”. Size of grants, however, was not assessed.

The following differences were statistically significant:

Female MDs had a 20% first-time success rate for RO1s compared with 24% for men;
Experienced female MD applicants had a 32% success rate for RO1s compared with 36% for men;
Experienced female MD/PhD applicants had a 31% success rate for RO1s compared with 34% for men;
Experienced female PhD applicants had a 33% success rate for RO1s, compared with a 35% success rate for men;
Female MD/PhD applicants had a 19% success rate for KO1s compared with 37% for men (a small category).

When LRP and research grants are all pooled, the success rates were “virtually equivalent” (31% for women, 32% for men).

To examine the cumulative effect of differences across time, Ley analyzed the NIH PI pool in 2007 for gender:

Women were significantly underrepresented (P < 0.001 for all categories). This trend is more severe for women over the age of 50, and is most severe for physician-scientists in both age groups.

The authors note that if the disproportionate career attrition for women continues, “this country will probably experience a shortage of biomedical scientists in the near future”.

Ceci and Williams reported this study as:

[O]n >100,000 NIH submissions in six biomedical categories between 1996 and 2007. The percentages of submissions funded were largely equivalent, with men favored slightly in some categories and women favored in others.

This includes the loan repayment applications as equivalent to the others, and the years as 1996 to 2007 instead of 2003 to 2007. There were no categories where men had a statistically significant advantage over women.

Ceci and Williams refer to this study as supporting their conclusions. They provide no assessment of methodological or data quality. They classify the analysis as large-scale.

The results of this study are selectively and inaccurately reported.

Section 3: Interviewing and hiring

Ceci and Williams’ conclusion:

[T]he evidence shows women fare as well as men in hiring, funding, and publishing (given comparable resources). That women tend to occupy positions offering fewer resources is not due to women being bypassed in interviewing and hiring or being denied grants and journal publications because of their sex…[W]omen in math-intensive fields are interviewed and hired slightly in excess of their representation among PhDs applying for tenure-track positions… These results are inconsistent with initiatives promoting gender sensitivity training for search committees and grant panels.

This section does not incorporate data that can adequately address those conclusions. There is data contradictory to the conclusions in the cited references.

This section does not provide any data to support the effects of ending gender sensitivity training and similar initiatives.

Individual study notes

Government Accounting Office (GAO) (2004) (Ref 49)

This GAO report examines 4 government science agencies’ compliance with legislative requirements to protect female students and employees from sex discrimination: Department of Education, Department of Energy, NASA, and the NSF.

Their main findings were that all had complied with requirements to investigate sex discrimination complaints, but only Education “conducted all required monitoring activities”. The GAO required the other 3 to “take actions to ensure that compliance reviews of grantees are conducted as required by Title IX regulations… Energy, NASA and NSF officials reported that they have not conducted any Title IX compliance reviews of their grantees”, although NASA had developed a program to do so.

The GAO also concluded that given the lack of awareness of, and “disincentives for filing complaints against superiors”, investigations of complaints alone by federal agencies are not enough to judge if discrimination exists”. They cite a report from the National Center for Education Statistics that concluded: “workplace discrimination is a consistent barrier to women in the sciences”.

Based on a longitudinal study of female college students in the mathematics, science, and engineering, and concluded that there was a shortage of female students, but they were just as likely as men to complete their degrees.

Although there had been progress for women scientists, they concluded that women “less often were given the opportunity to focus on their scientific research as their primary work activity”. They concluded discrimination still occurred, based on studies and their site visits to campuses. The GAO also found that: “the proportion of faculty in the sciences who are women has also increased, but they still lag behind men faculty in terms of salary and rank. However, studies indicate that experience, work patterns, and education levels can largely explain these differences”. They cite a figure of 91% for the proportion of difference that can be put down to these factors.

They recommended policies and practices for universities in: inclusive hiring, measuring status of women faculty, addressing climate issues, funding additional education, flexible work schedules, reduced teaching duties, and on-site child care.

Other than the review of compliance and agency policies, this is not a primary study. There are mentions of some feedback during site visits, but there is no full report or analysis of the site visits. There is no reporting of sex discrimination complaint data, and there is no analysis specifically of hiring discrimination.

Ceci and Williams reported on the GAO findings 3 times: some of the anecdotal reporting from site visits are cited in the main body of the text and SI (S11), and once “exit interviews” are cited (S2). However, the GAO did not report conducting exit interviews in this report.

In addition, they wrote:

[T]he GAO report mentions studies of pay differentials, demonstrating that nearly all current salary differences can be accounted for by factors other than discrimination, such as women being disproportionately employed at teaching-intensive institutions paying less and providing less time for research. Historically, however, this was not true; women, particularly senior women, lagged behind men in pay and promotion.

The GAO in fact concluded women still lagged (direct quote above), and only 91% of the difference was due to these factors (per NCES, 2002 [PDF]). In addition, it concluded that reduced time for research was not due solely to women’s choices.

Ceci and Williams refer to this report as supporting their conclusions. They provide no assessment of methodological quality. This report does not report data on discrimination (or lack of it) in interviewing or hiring.

The results of this report are selectively and inaccurately reported. This report contradicts rather than supports Ceci and Williams’ conclusions.

Mason (2009) (2009a: Ref 50 in main text (Ref 8 in SI); 2009b: Ref 9 in SI)

The text refers discusses results of a survey of that is incorrectly referenced to Reference 50/Ref 8 in the SI, which I refer to here as Mason (2009a). Mason (2009a) is the online survey itself, not results of the survey. According to the Internet Archive, this was so in 2010 as well.

In the SI, these survey results are further referred, with an additional reference (Mason, 2009b). The authors report a 43% response rate out of 19,678 doctoral students at the University of California in 2006 (8,373 respondents). The survey is explicitly about family as well as career and life plans. The material used to solicit participation is not provided. The combination of the subject matter of the survey and the low response rate suggests the survey may not be broadly representative. Other than gender, no data is provided to compare the sample with the overall population.

None of the questions or data provided relate to the issues of interviewing or hiring.

Ceci and Williams refer to this study as providing support for their conclusion about women’s choices. They provide no assessment of methodological or data quality. The response rate of this study is very low, and there is no data provided to indicate its results are representative.

Lubinski (2001) [PDF] (Ref 51)

This study tracked U.S. students identified before they were 13 on the basis of high achievement in mathematic tests. They were identified in the 1980s and followed up in 2003-2004. The response rate was reported as “>80%” here (and “>75%” elsewhere (Ferriman, 2009 [PDF]): 94 female, 286 male (ca 80% response rate), mean age 34. Data on the representativeness of the responders is not reported.

They were compared with graduate students enrolled at top-ranked U.S. universities in mathematics, engineering, and physical science programs in 1992, followed up 10 years later, with a response rate reported as “>80%” (287 female, 299 male), mean age 35. Women were under-represented in these programs, so all women were invited, along with a random sample of the men. Data on the representativeness of the responders at both stages is not reported.

About half the “gifted” group gained doctorates, and closer to 80% of the graduate students. The majority on both groups had not had children.

Response rates to individual questions of interest to the issues relevant to the Ceci and Williams paper were patchy. For example, on the question of how many hours worked, the response rates were: “gifted” women was 57% and 76% for men; for the graduate students’ group, it was 92% for both. As reported in Ferriman (2009), the full report of this study, there was an extensive number of questions, variables, and analyses done, with no method for adjusting for multiple testing reported.

None of the data reported related directly to experiences of interviewing or hiring.

Ceci and Williams refer to this study as supporting their conclusion about women’s choices. They provide no assessment of methodological or data quality. This study does not report data on discrimination (or lack of it) in interviewing or hiring.

Brooks (2009) (Ref 25)

The data on women academics and part-time work referred to here is not from the study referenced, but from a press release mentioned in its discussion: Higher Education Statistics Agency (2008). The data quoted are not of science academics, but all academics. They indicate significant lag in rank: women were 17.5% of professors, 36.8% of senior lecturers and researchers, around half of lower ranks, and 62.5% of non-academic staff. This does not report an analysis of interviewing and hiring.

Ceci and Williams refer to this study as supporting their conclusions about women’s choices. They provide no assessment of methodological or data quality. This study does not report data on discrimination (or lack of it) in interviewing or hiring. It is not specifically about women in science. The data reported by Ceci and Williams do not arise from the study cited.

Ginther (2006) (Ref 54)

This study is also discussed in the SI (Ref 19).

It is a book chapter, analyzing waves of data from 1973 to 2001 from the Survey of Doctoral Recipients (SDR), a biennial longitudinal survey by the National Research Council. The sample was PhD recipients 1972 to 1991 who remained in the survey for 10 years. The authors analyze likelihood of tenure and promotion, based on a large number of date variables. No method of adjusting for multiple testing is reported. Response rates are not reported, nor how missing data was handled.

According to the NSF, the response rate for the SDR declined from 97.3% in 1972 to 91.5% in 1998. Within the data, the rate of missing data for marital status and dependents was just over 10% (National Science Foundation, 2000 [PDF]). Non-response is not evenly distributed, among institutions for example. They report that the profile of non-respondents is significantly different to respondents in 5 out of the 8 critical variables they can measure (Hoffer (NSF), 2002 [PDF]).

Conclusions from these models around the effect of children/family in particular have a high risk of bias. Data on age of children was not requested from respondents until 2001, and the rates of missing data were 35 – 43% across those items (Hoffer, 2002[PDF]).

Few of the calculations in the Ginther study are disaggregated for the later period. Differences are found between disciplines.

In the SI, Ceci and Williams wrote (citing ref 19): “when the appropriate statistical controls are exerted, such as controls for having young children, the evidence of past discrimination disappears, with the sex gap entirely explained by fertility decisions”.

Ceci and Williams refer to this study as supporting their conclusions. They provide no assessment of methodological or data quality. The study does not provide enough analyses for later cohorts to judge recent experiences.

National Center for Education Statistics (NCES) (2000) [PDF] (Unreferenced)

A statement about women being employed in lower ranking institutions is unreferenced. The data they cite are included, however, in GAO (2004). The NCES reported on a study of faculty in 1992-93. They used multivariate regression analysis to study the factor of rank of institution Ceci and Williams cite, as well as other factors. They concluded:

[E]ven when comparing male and female faculty with similar characteristics, however, female full-time faculty had lower average base salaries than their male counterparts….This difference may be due to gender differences in still other variables that were not considered in this study, such as taking time out of the labor force for parenting responsibilities. Alternatively, this difference may be due to discrimination: differential returns in terms of salary for similar inputs across gender.

Ceci and Williams quoted the data on women being less likely to work in research universities, and concluded:

Whether this is a consequence of choices freely made, or constrained by gendered expectations related to work-family balance coupled with inflexibility in tenure-track timetables and employment options, is worthy of study.

Ceci and Williams refer to this study as supporting their conclusions. They include no citation or assessment of methodological quality of this report.

The results of this study are selectively reported.

Faculty Committee on Women in Science, Engineering, and Medicine (FCWSEM) (2010) (Ref 55)

This is an analysis on gender differences in career transitions (e.g. tenure). The report is based on 2 surveys in 2004-2005: of faculty and department chairs at major U.S. research universities in 6 fields – biology, chemistry, civil engineering, electrical engineering, mathematics, and physics. The response rate for the departmental survey was 85% (417 of 492 departments).

The researchers identified about 16,400 faculty from their websites and sampled 50 faculty per gender, rank, and field. They were able to identify 1,743 to contact: 1,278 filled out the survey (73%). There was also a review of literature.

Their survey found:

If women applied for positions at Research I institutions, they had a better chance of being interviewed and receiving offers than male job candidates had. Many departments at Research I institutions, both public and private, have made an effort to increase the numbers and percentages of female faculty in science, engineering, and mathematics.

However, they also point to great variation: some departments had no female faculty, and female faculty were less likely at large research institutions.

The report pointed to another study that used logit analysis on another national data set (Perna, 2000). That study included 9,636 faculty, a weighted sample drawn to represent 329,220 faculty in the 1993 National Study of Postsecondary Faculty. Perna found that women and men participating in tenure in general appeared equally likely to be tenured – but less likely to hold the highest rank of full professor even after controlling for differences in human capital, research productivity, and structural characteristics.

From Perna (2000):

[E]ven after controlling for other variables, average salaries are 8% lower for women than for men assistant professors with three to six years of experience, 9% lower for associate professors with 13 to 20 years of experience, and 6% lower for full professors with more than 20 years of experience.

Perna points to several possible explanations for the difference, that could not be resolved from this data. One that it relates to inequities from earlier years: because salary increases took the form of percentages, early career salary inequities persist. Another is that “differences creep into the process in the years following hiring and promotion”. It seemed that there was greater success in increasing the number of women hired, than in ensuring the same chances of progress thereafter.

Ceci and Williams quote only one sentence from FCWSEM (2010) in this section: that if women apply, they had a better chance of being employed.

In the following section on the support for their hypothesis that gender differences are because of choices (free or constrained) by women, this study is referred to again as a reference to this statement:

Not only is it more common for male academic scientists to have children than for female scientists, but males with children are more likely to be tenured than females with children. Compared with males, new female PhDs are less likely to apply for tenure-track posts; and among those who do apply, females are more likely to terminate for family reasons.

The source of the data for this statement is not specified, and I could not ascertain its provenance in a reasonable amount of time looking. Its source would not appear to be primary data from the report. This a quote from the report about its scope:

Many of the “whys” of the findings included here are buried in factors that the committee was unable to explore. We do not know, for example, what happens to the significant percentage of female Ph.D.s in science and engineering who do not apply for regular faculty positions at RI institutions, or what happens to women faculty members who are hired and subsequently leave the university. And we know little about female full professors and what gender differences might exist at this stage of their careers. We do know that there are many unexplored factors that play a significant role in women’s academic careers, including the constraints of dual careers; access to quality child care; individuals’ perceptions regarding professional recognition and career satisfaction; and other quality-of-life issues. In particular, the report does not explore the impact of children and family obligations (including elder care) or the duration of postdoctoral positions on women’s willingness to pursue faculty positions in RI institutions.

Ceci and Williams refer to this study as supporting their conclusions. They provide no assessment of methodological or data quality.

The results of this study are selectively reported. The report contradicts, more than supports, Ceci and Williams’ conclusions.

Section 4: Hypothesis about women’s choices

Ceci and Williams’ conclusion on reasons for women scientists’ career attrition and gender differences:

It is due primarily to factors surrounding family formation and childrearing, gendered expectations, lifestyle choices, and career preferences — some originating before or during adolescence — and secondarily to sex differences at the extreme right tail of mathematics performance on tests used as gateways to graduate school admission.

This section provides little additional direct or strong evidence in support of this hypothesis.

I have not addressed the issue of sex differences in mathematics performance, as it is neither a prominent conclusion (e.g. in the abstract), nor is there an assessment of evidence of possible relevant ability differences to success in science and gender differences.

Individual study notes

Ceci (2010) (Ref 3)

This refers to a paper by the same authors making the same points as this 2011 paper, based mostly on a subset of the same studies cited.

Some studies were not relied on in the same way in this paper, though. For example, on the subject of disadvantage in hiring, they cite 2 papers to which they do not refer in section 3 above, Trix (2003 [PDF]) and Steinpreis (1999 [PDF]).

Trix (2003) is a study of gender in 300 letters of recommendation at a U.S. medical school in the 1990s, finding that they reinforced gender issues, with more doubt raisers for women than men.

Steinpreis (1999) (Ref 15) is referred to earlier in this Ceci and Williams paper as an example from outside mathematics field, but not in this section on hiring. It is an experimental study with 238 psychologists, seeking response to the same (real) CVs with gender of the person either kept or swapped. It showed bias in favor of males.

This study is a self-citation of a paper making the same arguments as the present paper.

In addition, Ferriman (2009) (Ref 58) is referred to.

This is not an additional study. It is a further publication of Ref (51): Lubinski (2001) (study 3 in section 3 above).

Methodological quality of the data and the analyses are discussed in brief above. It reported mixed results on gender differences in the various analyses. The data are too limited to address the reasons for gender differences.

Section 5: Additional studies entering only in the SI

This group does not strengthen the paper’s case.

There are also some studies cited on differences in career preferences in childhood and adolescence: these have not been addressed, as so many fields now have equal or greater participation of women at undergraduate level.

Individual study notes

Mason (2004) (Ref 5 in SI)

Use the Survey of Doctorate Recipients (SDR) (NSF 2004) and the University of California Faculty Work and Family Survey: above study 2 in interviewing and hiring, but at an earlier stage, with 4,459 respondents (51% response rate).

They analyzed data from the SDR for 3 groups of people who had received their PhDs from 1978 to 1983 (28 to 33 years before 2011): “ladder-rank faculty men” (reported as weighted n = 27,30), “ladder-rank faculty women” (reported as weighted n = 10,112), and “second tier women” (reported as weighted n = 7,056) (non-tenure track positions, working part-time, or not working). No comparative data on “second tier men” is provided. No method of accounting for multiple testing is reported. Missing data are not reported.

There was a second SDR analysis of these 3 groups, but who received degrees awarded between 1978 and 1994 (from 17 to 33 years before 2011), with different analyses run. The period between 1983 and 1994 is not reported separately, and the “weighted n”s are not reported. Identifying to what extent these data relate to the period Ceci and Williams regard as too long ago to reflect current gender issues was not possible.

The authors draw a number of conclusions about women and family and their choices: in particular that some women with PhDs were delaying childbirth. They do not report that this was a societal-wide phenomenon for women with degrees in the 1980s and 1990s. They also attribute the gender difference in tenured positions to women’s choices.

Ceci and Williams refer to this study as supportive of their hypothesis about women’s choices. They provide no assessment of methodological or data quality.

Irvine (1996) [PDF] (Ref 10 in SI)

This study analyzes the percentage of women in full-time university teaching positions in Canada, in periods from the 1960s to 1991. Among the science disciplines, the rates of women in 1990-91 were: 3.6% in engineering and applied sciences, 7.0% in mathematics and physical sciences, 18.2% in agriculture and biological sciences, and 26.6% in health sciences. (Rates in all other areas were 20.0% to 30.1%.) The percentage of full professors in the science were lower than other positions: 0.7% in engineering, 1.9% in mathematics/physical sciences, 7.7% in agriculture/biology, and 9.1% in the health sciences.

Irvine estimates the number of women applicants for all disciplines “on the basis of earned doctorates” in the years he estimated they would have graduated in order to be applying for a job of the 1990-1991 rank. He then compares these estimated applicant rates at different periods with the rate of appointments. He provides no data to support the assumption that the rate of doctorate awards is equivalent to application rates for faculty positions, for either gender. Rather, he points to a lower likelihood for women appointed to less-than full professor positions to have doctorates. There is relatively little data on science separately, although what there is shows science faculty to be differ markedly.

Irvine concludes that there may have been a “modest degree of discrimination” against women in the 1960s, that contributed to some bias in the gender of current full professors. However, he concludes that the data “are consistent with there being significant discrimination in favour of women and against men” in the 1990s. He further concludes increasing the pool of women faculty had positive social consequences for young women in terms of role models, but

that, in all likelihood, more less-qualified applicants have been appointed over the past 20 years than otherwise would have been the case. If this is true, the consequences (although, again, not easy to judge) will be much less positive. They will include lower standards in teaching and research, poor academic (as opposed to social) role models and generally poor morale among those within the university community (both students and faculty alike) who believe that merit and excellence should not be sacrificed for social expediency. It should also be emphasized that such consequences will typically mitigate any positive social benefits brought about by preferential policies.

Either Ceci or Williams cite this paper here (it is not clear which is “I”):

One might argue (and I certainly would) that due to pronounced preferences for females in academic hiring (10), junior men in high quality institutions have to be slightly more gifted than their female counterparts.

Ceci or Williams cite this study as supporting their conclusions. S/he provides no assessment of the methodological or data quality of this paper. Its conclusions rest on estimates and assumptions not supported with data.

Puuska (2010) (Ref 32 in SI)

This study is of 1,367 scholars at the University of Helsinki from 2002 to 2004. Engineering and technology, sports sciences, and arts are excluded. It examines publications located from the university’s publication register and personnel register. The data for a random sample of the scholars were checked against bibliographic databases and Google Scholar to assess validity of the university sources. Personnel data including gender and funding sources were obtained from the university. In working years, the gender breakdown was 38% female and 62% male.

Ceci and Williams cite this study to support this statement:

Puuska showed that it is the presence of young children (as opposed to older children) that is associated with lower productivity of female faculty (32).

There are no data on children in Puuska’s study or her conclusions. (In the background, she cites studies on the subject.)

Ceci and Williams refer to this study as supporting their claims about young children and female faculty’s productivity. However, there are no data on children in this study. They provide no assessment of the methodological or data quality.

The reporting of this study is inaccurate. The data to which they refer is not the subject of this study.

* The thoughts Hilda Bastian expresses here at Absolutely Maybe are personal, and do not necessarily reflect the views of the National Institutes of Health or the U.S. Department of Health and Human Services.