The Fairness Dilemma: Why Bias Mitigation in AI Deserves a Rethink
In recent years, bias and fairness in AI models have become critical topics of discussion, especially as algorithmic decision-making has led to unintended and sometimes harmful consequences for specific groups.
In recent years, bias1 and fairness in AI models have become critical topics of discussion, especially as algorithmic decision-making has led to unintended and sometimes harmful consequences for specific groups. Notable examples include the “Toeslagenaffaire “ in The Netherlands (Toeslagenaffaire) and the COMPAS case in the United States (COMPAS case). Both of these examples highlight how models that include algorithmic decision making can display unwanted bias without proper governance.
In the Financial Crime Prevention (FCP) domain, AI models are often used to detect criminal behaviour such as fraudulent transactions, money laundering, and tax evasion. These models are not just operational tools – they are subject to regulation. For instance, National Central Banks may mandate that banks demonstrate sufficient effort in identifying financial crime (The DNB on Money Laundering and combating criminal money), with penalties for underperformance. Performance is typically measured using (a derivative of) recall2, which measures the model’s ability to identify as many true cases of criminal behaviour as possible.
Recently, several Dutch banks faced fines from regulators for failing to meet these requirements, underscoring the pressure to maximize recall.
However, this focus on maximizing recall must be balanced with fairness. Regulators also require banks to detect and mitigate unwanted bias in their models (EBA report, AI Act), which relates to the metric precision2 – the proportion of flagged cases that are actually criminal. A low precision rate can result in clients being wrongly flagged, which raises ethical and legal concerns.
The Pitfalls of Conventional Bias Fixes
Assessing the fairness of a model typically includes comparing precision-like metrics across different groups with similar sensitive information (e.g., groups based on gender or ethnicity). If significant disparities are observed between groups, steps are typically taken to align the precision values more closely. While this approach is popular and intuitive, it comes with several challenges:
- Conflicting objectives: Precision and recall are inherently at odds. Optimizing for one generally compromises the other. Banks must navigate regulatory demands for high recall while also ensuring fair treatment across groups, with no established best practices to guide these trade-offs.
- Levelling down: Achieving equal precision across groups can involve either improving the disadvantaged group's performance or reducing the advantaged group's performance. In practice, improving precision for disadvantaged groups is often infeasible due to limited data or inconsistent behavioural patterns. This leads to "levelling down" – artificially lowering the precision of the advantaged group to achieve parity (The Unfairness of Fair Machine Learning: Levelling down and strict egalitarianism by default). While this may equalize precision metrics, it does not improve outcomes for the disadvantaged group and often degrades overall model performance. Therefore, whether this approach is truly fair remains a subject of debate.
In the context of FCP, one could challenge the assumption that the goal should be to achieve strict statistical parity between groups at the cost of lower recall. For a model that e.g. detects fraudulent transactions, achieving strict parity by decreasing the recall for advantaged groups could be considered inappropriate, with potentially harmful societal consequences. Levelling down is a classic example of Goodhart’s Law that “when a measure becomes a target, it ceases to be a good measure”. - Use of sensitive information: Testing for unwanted model bias usually involves customer segmentation based on sensitive attributes. However, this process can lead to the sensitive data being used in the model development process, either directly (e.g., as input features) or indirectly (e.g., through group-specific decision rules or parameters). Under the General Data Protection Regulation (GDPR), this is generally prohibited.
Potential Next Steps
Bias and fairness considerations, especially within Financial Crime Prevention, require moving beyond the simple pursuit of strict statistical parity. In addition to established practices, more nuanced and responsible approaches should be considered:
- Establish a bank-wide “bias and fairness” committee: Create a bank-wide diverse committee consisting of experts across different departments such as modeling & data, compliance, and risk. This committee should define unified fairness principles, oversee their consistent application across departments, and act as a governance hub for addressing emerging fairness concerns in AI-driven decision-making.
- Develop an impact-based fairness framework: Move beyond solely equalizing metrics by building a framework that measures the tangible impact of bias on different customer groups. This enables institutions to focus resources where potential harm is greatest, ensuring fairness interventions deliver meaningful outcomes.
- Leverage explainability tools to detect hidden biases: Tools like SHapley Additive exPlanations (SHAP) are commonly used to meet regulatory requirements regarding explainability. Beyond compliance, SHAP can also help uncover which (combinations of) features act as proxies for sensitive attributes like gender or ethnicity. This could help to proactively detect hidden biases and strengthen fairness in FCP models.
This article is part of a larger series highlighting crucial topics for the future of financial crime prevention. See an overview of the whole series here.
Navigating the complex ecosystem of bias and fairness metrics in the context of FCP demands deep domain expertise and a clear understanding of the regulatory and ethical landscape. If you need guidance in translating complex fairness metrics into actionable, compliant, and effective practices, Zanders can help.
Citations
- Note that in this blog, bias refers to ethical bias. Specifically, cases where a model produces systematically more favorable or unfavorable outcomes for certain groups of individuals based on sensitive attributes such as ethnicity, gender, or similar characteristics. ↩︎
- In the bias and fairness literature, there is a large amount of metrics used, typically categorized in Independence, Sufficiency and Separation metrics (see e.g. A clarification of the nuances in the fairness metrics landscape | Scientific Reports ). For the purpose of this blog, the metrics Precision and Recall are used, as these are most commonly used within the context of FCP. ↩︎