Most scientific studies include people talking about “statistically significant” results that support whatever effect they are examining, typically mentioning p-values less than .05 as sufficient evidence. This is true in adverse impact analyses as well. A previous blog in the statistical significance series outlines what the term means, and mentions an issue that we will focus on in this part of the series: identifying when the principles of chance may be at play.
In the most commonly used statistical models, p-values indicate the likelihood that a set of observed data is compatible with the absence of whatever effect is specified in a given hypothesis (for example, that mean differences exist between protected classes ). This is called null hypothesis testing. So, a p-value of .05 indicates a 5% chance that the data is compatible with the null hypothesis (i.e., no mean differences between protected classes), and a 95% chance that the data is incompatible with the null hypothesis (i.e., there are mean differences between protected classes).
Those are pretty good odds, and we typically take them: when p<.05, we conclude the observed pattern of data is “significant” evidence against the null hypothesis, and, therefore, infer support for the hypothesized effect. In other words, this “significant” evidence does not tell us why there are differences, only that we are fairly confident that differences exist based on the data.
Things get trickier when running multiple statistical tests on your data. The 5% margin of error on each test (i.e., using a p-value of .05) means that if you were to run 100 analyses, you’d expect to incorrectly conclude, based on the data, that the hypothesized effect exists about five times. In other words, some results may be false alarms (i.e., the pattern in the data may seem to be related to the hypothesized effect but is in fact due to random variation). If you’re only running a few adverse impact analyses, this possibility is generally ignored. It’s a different story, however, when running proactive AAP analyses, where the number of individual analyses can easily run into the thousands!
For example, assume a company has 100 AAPs and 30 job groups. Running male/female adverse impact analyses on applicants could be up to 3,000 analyses. By the time analyses are also run on promotions and terminations, we are up to 15,000 analyses. Statistically speaking, we’d expect to see around 150 applicant analyses (3,000 x 5%) and 750 overall analyses (15,000 x 5%) with potentially false positive results at a p-value of .05. A recent statement from the American Statistical Association (ASA) cautions that when multiple analyses are run, only reporting p-values that surpass a significance threshold, without reporting the number, types, and hypotheses behind all statistical analyses run, makes those “reported p-values essentially uninterpretable.” So, how do we know if similar results are real disparities, or just due to luck of the draw?
One option, and the way OFCCP previously recommended in a 1979 Technical Advisory Committee (TAC) manual, is by applying a statistical correction that accounts for the Type I error rate (i.e., false alarms) when running repeated analyses. The Bonferroni correction is one of the most widely used, though there are others as well. These essentially work by adjusting the required significance level based on the total number of analyses run. For example, if 20 tests are being run with a p-value of 0.05, applying the Bonferroni correction means that significance would only be asserted when the p-value is less than or equal to 0.0025.
The field has yet to converge on clear guidance as to which correction is best or when they must be used. However, knowing what “statistically significant” actually means coupled with the practical realities of running hundreds or thousands of tests will put you in a good position to ask the right questions when working through proactive AAPs.
By Kristen Pryor, Consultant, and Sam Holland, Consultant at DCI Consulting Group