If you use deep learning for unsupervised part-of-speech tagging of Sanskrit 1, or knowledge discovery in physics 2, you probably don’t need to worry about model fairness. If you’re a data scientist working at a place where decisions are made about people, however, or an academic researching models that will be used to such ends, chances are that you’ve already been thinking about this topic. — Or feeling that you should. And thinking about this is hard.
It is hard for several reasons. In this text, I will go into just one.
The forest for the trees
Nowadays, it is hard to find a modeling framework that does not include functionality to assess fairness. (Or is at least planning to.) And the terminology sounds so familiar, as well: “calibration”, “predictive parity”, “equal true [false] positive rate”… It almost seems as though we could just take the metrics we make use of anyway (recall or precision, say), test for equality across groups, and that’s it. Let’s assume, for a second, it really was that simple. Then the question still is: Which metrics, exactly, do we choose?
In reality things are not simple. And it gets worse. For very good reasons, there is a close connection in the ML fairness literature to concepts that are primarily treated in other disciplines, such as the legal sciences: discrimination and disparate impact (both not being far from yet another statistical concept, statistical parity). Statistical parity means that if we have a classifier, say to decide whom to hire, it should result in as many applicants from the disadvantaged group (e.g., Black people) being hired as from the advantaged one(s). But that is quite a different requirement from, say, equal true/false positive rates!
So despite all that abundance of software, guides, and decision trees, even: This is not a simple, technical decision. It is, in fact, a technical decision only to a small degree.
Common sense, not math
Let me start this section with a disclaimer: Most of the sources referenced in this text appear, or are implied on the “Guidance” page of IBM’s framework AI Fairness 360. If you read that page, and everything that’s said and not said there appears clear from the outset, then you may not need this more verbose exposition. If not, I invite you to read on.
Papers on fairness in machine learning, as is common in fields like computer science, abound with formulae. Even the papers referenced here, though selected not for their theorems and proofs but for the ideas they harbor, are no exception. But to start thinking about fairness as it might apply to an ML process at hand, common language – and common sense – will do just fine. If, after analyzing your use case, you judge that the more technical results are relevant to the process in question, you will find that their verbal characterizations will often suffice. It is only when you doubt their correctness that you will need to work through the proofs.
At this point, you may be wondering what it is I am contrasting those “more technical results” with. This is the topic of the next section, where I’ll try to give a birds-eye characterization of fairness criteria and what they imply.
Situating fairness criteria
Think back to the example of a hiring algorithm. What does it mean for this algorithm to be fair? We approach this question under two – incompatible, mostly – assumptions:
The algorithm is fair if it behaves the same way independent of which demographic group it is applied to. Here demographic group could be defined by ethnicity, gender, abledness, or in fact any categorization suggested by the context.
The algorithm is fair if it does not discriminate against any demographic group.
I’ll call these the technical and societal views, respectively.
Fairness, viewed the technical way
What does it mean for an algorithm to “behave the same way” regardless of which group it is applied to?
In a classification setting, we can view the relationship between prediction (\(\hat{Y}\)) and target (\(Y\)) as a doubly directed path. In one direction: Given true target \(Y\), how accurate is prediction \(\hat{Y}\)? In the other: Given \(\hat{Y}\), how well does it predict the true class \(Y\)?
Based on the direction they operate in, metrics popular in machine learning overall can be split into two categories. In the first, starting from the true target, we have recall, together with “the rates”: true positive, true negative, false positive, false negative. In the second, we have precision, together with positive (negative, resp.) predictive value.
If now we demand that these metrics be the same across groups, we arrive at corresponding fairness criteria: equal false positive rate, equal positive predictive value, etc. In the inter-group setting, the two types of metrics may be arranged under headings “equality of opportunity” and “predictive parity”. You’ll encounter these as actual headers in the summary table at the end of this text. (Said table organizes concepts from different areas into a three-category format. The overall narrative builds up towards that “map” in a bottom-up way – meaning, most entries will not make sense at this point.}
While overall, the terminology around metrics can be confusing (to me it is), these headings have some mnemonic value. Equality of opportunity suggests that people similar in real life (\(Y\)) get classified similarly (\(\hat{Y}\)). Predictive parity suggests that people classified similarly (\(\hat{Y}\)) are, in fact, similar (\(Y\)).
The two criteria can concisely be characterized using the language of statistical independence. Following Barocas, Hardt, and Narayanan (2019), these are:
Separation: Given true target \(Y\), prediction \(\hat{Y}\) is independent of group membership (\(\hat{Y} \perp A | Y\)).
Sufficiency: Given prediction \(\hat{Y}\), target \(Y\) is independent of group membership (\(Y \perp A | \hat{Y}\)).
Given those two fairness criteria – and two sets of corresponding metrics – the natural question arises: Can we satisfy both? Above, I was mentioning precision and recall on purpose: to maybe “prime” you to think in the direction of “precision-recall trade-off”. And really, these two categories reflect different preferences; usually, it is impossible to optimize for both. The most famous, probably, result is due to Chouldechova (2016) : It says that predictive parity (testing for sufficiency) is incompatible with error rate balance (separation) when prevalence differs across groups. This is a theorem (yes, we’re in the realm of theorems and proofs here) that may not be surprising, in light of Bayes’ theorem, but is of great practical importance nonetheless: Unequal prevalence usually is the norm, not the exception.
This necessarily means we have to make a choice. And this is where the theorems and proofs do matter. For example, Yeom and Tschantz (2018) show that in this framework – the strictly technical approach to fairness – separation should be preferred over sufficiency, because the latter allows for arbitrary disparity amplification. Thus, in this framework, we may have to work through the theorems.
What is the alternative?
A quick glance at neighboring fields: law and political philosophy
In jurisprudence, fairness and discrimination constitute an important subject. A recent paper that caught my attention is Wachter, Mittelstadt, and Russell (2020a) . From a machine learning perspective, the interesting point is the classification of metrics into bias-preserving and bias-transforming. The terms speak for themselves: Metrics in the first group reflect biases in the dataset used for training; ones in the second do not. In that way, the distinction parallels Friedler, Scheidegger, and Venkatasubramanian (2016) ’s confrontation of two “worldviews”. But the exact words used also hint at how guidance by metrics feeds back into society: Seen as strategies, one preserves existing biases; the other, to consequences unknown a priori, changes the world.
To the ML practitioner, this framing is of great help in evaluating what criteria to apply in a project. Helpful, too, is the systematic mapping provided of metrics to the two groups; it is here that, as alluded to above, we encounter conditional demographic parity among the bias-transforming ones. I agree that in spirit, this metric can be seen as bias-transforming; if we take two sets of people who, per all available criteria, are equally qualified for a job, and then find the whites favored over the Blacks, fairness is clearly violated. But the problem here is “available”: per all available criteria. What if we have reason to assume that, in a dataset, all predictors are biased? Then it will be very hard to prove that discrimination has occurred.
A similar problem, I think, surfaces when we look at the field of political philosophy, and consult theories on distributive justice for guidance. Heidari et al. (2018) have written a paper comparing the three criteria – demographic parity, equality of opportunity, and predictive parity – to egalitarianism, equality of opportunity (EOP) in the Rawlsian sense, and EOP seen through the glass of luck egalitarianism, respectively. While the analogy is fascinating, it too assumes that we may take what is in the data at face value. In their likening predictive parity to luck egalitarianism, they have to go to especially great lengths, in assuming that the predicted class reflects effort exerted. In the below table, I therefore take the liberty to disagree, and map a libertarian view of distributive justice to both equality of opportunity and predictive parity metrics.
In summary, we end up with two highly controversial categories of fairness criteria, one bias-preserving, “what you see is what you get”-assuming, and libertarian, the other bias-transforming, “we’re all equal”-thinking, and egalitarian. Here, then, is that often-announced table.
Demographic parity | Equality of opportunity | Predictive parity | |
---|---|---|---|
A.K.A. / subsumes / related concepts | statistical parity, group fairness, disparate impact, conditional demographic parity 4 | equalized odds, equal false positive / negative rates | equal positive / negative predictive values, calibration by group |
Statistical independence criterion 5 | independence \(\hat{Y} \perp A\) |
separation \(\hat{Y} \perp A | Y\) |
sufficiency \(Y \perp A | \hat{Y}\) |
Individual / group | group | group (most) or individual (fairness through awareness) | group |
Distributive Justice | egalitarian | libertarian (contra Heidari et al., see above) | libertarian (contra Heidari et al., see above) |
Effect on bias 6 | transforming | preserving | preserving |
Policy / “worldview” 7 | We’re all equal (WAE) | What you see is what you get (WYSIWIG) | What you see is what you get (WYSIWIG) |
(A) Conclusion
In line with its original goal – to provide some help in starting to think about AI fairness metrics – this article does not end with recommendations. It does, however, end with an observation. As the last section has shown, amidst all theorems and theories, all proofs and memes, it makes sense to not lose sight of the concrete: the data trained on, and the ML process as a whole. Fairness is not something to be evaluated post hoc; the feasibility of fairness is to be reflected on right from the beginning.
In that regard, assessing impact on fairness is not that different from that essential, but often toilsome and non-beloved, stage of modeling that precedes the modeling itself: exploratory data analysis.
Thanks for reading!
Photo by Anders Jildén on Unsplash