Fairness Week 2: Predicting recidivism¶

How to use this notebook

Simply read the text and follow the instructions.
This notebook contains code cells, which can be modified and must be executed to see the result of their content.
To execute a cell, select it and click on the play button (▶) in the tool bar, or type Shift + Enter or Ctr + Enter.

As the variables contained in a cell are stored in memory, the order of execution of the cells is important!

Notebook by Eugène Bergeron, Cécile Hardebolle and the Responsible software team (2025).

Exercises adapted from the "Human Contexts & Ethics" (HCE) notebook "Algorithmic Fairness: Considering Different Definitions" from UC Berkeley. The authors are Alyssa Sugarman, Eva Newsom, and Sammy Raucher, you can find more information about their work here.
Source: https://github.com/ds-modules/HCE-Materials

Except where otherwise noted, the content of this notebook is licensed under a Creative Commons Attribution International License (CC BY 4.0 International).
Creative Commons License


Introduction¶

Hello and welcome to this notebook! This week we will study fairness in Machine Learning! This notebook will cover some very similar topics than the previous one, but applied specifically to Machine Learning.

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a commercial tool produced by the for-profit company Northpointe (acquired by equivant) known as a recidivism risk assessment system. Tools like COMPAS are used to predict the risk of future crimes for an individual who has entered the US criminal justice system by outputting a risk score from 1-10. These algorithms are believed by many to provide the ability to make the court system more just, by removing or correcting for bias of criminal justice officials.

While COMPAS was initially intended to aid decisions made by probation officers on treatment and supervision of those who are incarcerated, Northpointe has since emphasized the scalability of the tool to “fit the needs of many different decision points” including pre-screening assessments, pretrial release decisions (whether or not to hold an arrested individual in jail until their trial), and post-trial next steps for the defendant.

Today we are going to explore the fairness of this tool, and try to make one of our own that is the fairest that we can achieve!

Warning

This notebook is long, therefore we have flagged sections that can be skipped (with the [Optional] tag).
Of course doing the whole notebook will give you a more concrete understanding of the concepts that you will see in the videos, but you could come back to it later if you want.
Enjoy!

Learning goals

What will be covered:

  • Part 1: How to measure fairness, the point of view of different stakeholders
  • Part 2: Designing a Machine Learning model that is fairer
  • Part 3: Reducing bias in data

By the end of the session you will be able to:

  • ✅ Use different metrics to evaluate the fairness of a model and analyze the results you get from those measures
  • ✅ Explain when and why fairness metrics can give incompatible results
  • ✅ Use a Logistic Regression model in order to build a classifier
  • ✅ Apply different techniques to improve the fairness of a model
  • ✅ Apply different techniques to reduce bias in data

Setup and dataset exploration¶

Let's begin by importing the packages we need.

Instructions

Run the cell below to import the necessary libraries.


There might be warnings in the cell below but you can ignore them, it won't affect the rest of the notebook. You don't need the package "tensorflow" to run the notebook.

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import datetime
from rsrc.tests import *
from rsrc.src import fit_and_display, load_compas_data, Helper, LRModel
from aif360.sklearn.preprocessing import Reweighing
from aif360.sklearn.preprocessing import LearnedFairRepresentations
from aif360.sklearn.inprocessing import ExponentiatedGradientReduction

from warnings import filterwarnings, simplefilter
simplefilter(action='ignore', category=FutureWarning)
filterwarnings("ignore", "lbfgs")
WARNING:root:No module named 'tensorflow': AdversarialDebiasing will be unavailable. To install, run:
pip install 'aif360[AdversarialDebiasing]'
WARNING:root:No module named 'inFairness': SenSeI and SenSR will be unavailable. To install, run:
pip install 'aif360[inFairness]'
WARNING:root:No module named 'rpy2': FairAdapt will be unavailable. To install, run:
pip install 'aif360[FairAdapt]'

The dataset¶

We will be using the data that was obtained and used by ProPublica in their own analysis of the COMPAS tool from Broward County public records of people who were scored by COMPAS between 2013 and 2014. We will therefore work with data concerning real people. This represents 6172 cases.

In [2]:
data = load_compas_data()

# displaying
display(data.head())
print(f"Columns available: {list(data.columns)}")
print(f"Total number of cases: {data.shape[0]}")
id name first last sex dob age age_cat race juv_fel_count ... priors_count c_jail_in c_jail_out c_offense_date c_arrest_date c_charge_degree c_charge_desc is_recid in_custody out_custody
0 1 miguel hernandez miguel hernandez Male 1947-04-18 69 Greater than 45 Other 0 ... 0 2013-08-13 06:03:42 2013-08-14 05:41:20 2013-08-13 NaN F Aggravated Assault w/Firearm 0 2014-07-07 2014-07-14
1 3 kevon dixon kevon dixon Male 1982-01-22 34 25 - 45 African-American 0 ... 0 2013-01-26 03:45:27 2013-02-05 05:36:53 2013-01-26 NaN F Felony Battery w/Prior Convict 1 2013-01-26 2013-02-05
2 4 ed philo ed philo Male 1991-05-14 24 Less than 25 African-American 0 ... 4 2013-04-13 04:58:34 2013-04-14 07:02:04 2013-04-13 NaN F Possession of Cocaine 1 2013-06-16 2013-06-16
5 7 marsha miles marsha miles Male 1971-08-22 44 25 - 45 Other 0 ... 0 2013-11-30 04:50:18 2013-12-01 12:28:56 2013-11-30 NaN M Battery 0 2013-11-30 2013-12-01
6 8 edward riddle edward riddle Male 1974-07-23 41 25 - 45 Caucasian 0 ... 14 2014-02-18 05:08:24 2014-02-24 12:18:30 2014-02-18 NaN F Possession Burglary Tools 1 2014-03-31 2014-04-18

5 rows × 23 columns

Columns available: ['id', 'name', 'first', 'last', 'sex', 'dob', 'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score', 'juv_misd_count', 'juv_other_count', 'priors_count', 'c_jail_in', 'c_jail_out', 'c_offense_date', 'c_arrest_date', 'c_charge_degree', 'c_charge_desc', 'is_recid', 'in_custody', 'out_custody']
Total number of cases: 6172

Explanation of all features

  • id: Case identifier, unique
  • name: Full name of the defendant
  • first: First name of the defendant
  • last: Last name of the defendant
  • sex: Sex of the defendant, can be one of: ["Male", "Female"]
  • dob: Date of birth of the defendant
  • age: Age of the defendant
  • age_cat: Age category of the defendant, can be one of: ["Greater than 45", "25 - 45", "Less than 25"]
  • race: Race category of the defendant, can be one of: ["Other", "African-American", "Caucasian", "Hispanic", "Asian", "Native American"]
  • juv_fel_count: Number of juvenile felony offenses committed by the defendant
  • juv_misd_count: Number of juvenile misdemeanor offenses committed by the defendant
  • juv_other_count: Number of other juvenile offenses committed by the defendant
  • priors_count: Number of offenses committed by the defendant prior to this case
  • c_jail_in: Date when the defendant was jailed for this case
  • c_jail_out: Date of release of the defendant
  • c_offense_date: Date of the offense (if NaN, see c_arrest_date)
  • c_arrest_date: Date of arrest of the defendant (if NaN, see c_offense_date)
  • c_charge_degree: Degree of the offense comitted, can be one of ["F", "M"], resp. Felony or Misdemeanor
  • c_charge_desc: Charge description
  • is_recid: Indicator whether the defendant has recidivated after this case
  • in_custody: Date when the defendant was put in custody
  • out_custody: Date when the defendant was released from custody
  • decile_score: Score given by the COMPAS model, 1 means low risk, 10 means high risk

Accuracy¶

As a first exercise, let's figure out whether COMPAS is good at what it is supposed to do. As you can see in the cell above, the column is_recid contains the ground truth (i.e whether the defendant did recidivate or not) and decile_score contains the COMPAS prediction, in a scale from 1 to 10. For the sake of simplicity, we will consider that a score >= 5 means that COMPAS predicted that the defendant will recidivate, which is a standard in the analysis of COMPAS.

Instructions

Complete the cell below so that:

  • It computes the accuracy of the COMPAS tool ($\frac{\text{number of correct predictions}}{\text{total number of defendants}}$) over all the data

Recall that:

  • to get the rows of a dataset such that 2 columns are equals, you can do df[df['col1'] == df['col2']]
  • you can get the number of lines of a dataframe using df.shape[0]

Don't hesitate to look again at the tutorial on booleans in pandas we gave you last week!

In [3]:
### YOUR CODE HERE
# We add a column that contains whether the defendant has been predicted as potential recidivist or not
data['predicted_recid'] = data['decile_score'] >= 5 # SOLUTION

# We now compute the accuracy of the compas tool, using the column 'is_recid' (see above if you don't understand what the column 'is_recid' stands for)  
accuracy_compas = data[data['is_recid'] == data['predicted_recid']].shape[0] / data.shape[0] # SOLUTION
### END OF YOUR CODE

# simple test to check if you have the correct value :)
test_values(round(accuracy_compas*100), "accuracy_compas")

print(f"Accuracy of the COMPAS tool: {accuracy_compas*100:.1f}%")
🆗 Tests passed ! =)
Accuracy of the COMPAS tool: 65.9%

Reflection time!

Q1) What do you think of the accuracy of COMPAS?

Q2) Is it good enough for the context it is used in?

Feedback - Click on the "..." below only once you have really tried to answer the question!

COMPAS is about 16% more accurate than relying on a coin flip. Given the stakes for people, who risk prison time, and the influence the software can have on judges, such an accuracy cannot be considered great...

Let's now try to investigate the fairness of this tool!

Groups in the dataset¶

During this exercise, we will focus on fairness with respect to race, as this is the most common use case of the ProPublica dataset. But we could have focused on other sensitive attributes, like sex.

ProPublica indicates they have used the race classifications from the Sheriff’s Office of Broward County, which identifies defendants as Black, White, Hispanic, Asian and Native American. In this notebook, we will focus on two groups and compare Black and White defendants.

Let's have a look at the number of cases in the dataset for each of these two groups.

Instructions

Complete the cell below such that:

  • black_defendants is a DataFrame containing only cases where the defendant has a race equals to African-American
  • white_defendants is a DataFrame containing only cases where the defendant has a race equals to Caucasian
  • You compute the number of lines in each

Hint

If you really don't know: take a look at this.

In [4]:
print(f"All races: {list(data['race'].unique())}")

### YOUR CODE HERE
black_defendants = data[data["race"] == "African-American"] # SOLUTION
black_defendants_number = black_defendants.shape[0] # SOLUTION
print(f"Number of cases for Black defendants:{black_defendants_number}")

white_defendants = data[data["race"] == "Caucasian"] # SOLUTION
white_defendants_number = white_defendants.shape[0] # SOLUTION
print(f"Number of cases for White defendants:{white_defendants_number}")
### END OF YOUR CODE

test_values((white_defendants.shape[0], black_defendants.shape[0]), "defendants_counts")
All races: ['Other', 'African-American', 'Caucasian', 'Hispanic', 'Asian', 'Native American']
Number of cases for Black defendants:3175
Number of cases for White defendants:2103
🆗 Tests passed ! =)

Reflection time!

Q1) What can be the causes of this difference of number of defendants between the two races?

Q2) Could a bias be the cause of one of them?

Feedback - Click on the "..." below only once you have really tried to answer the question!

We can see 3 possible reasons for this difference:

  • The number of Black people in America is 1.5 times the number of White people
  • Black people statistically commit more crimes than White people (at the time of the data gathering)
  • Black people are getting more arrested and judged than White people (at the time of the data gathering)

What is interesting here is the 3rd option, the 1st one is easily provable as false and the 2nd one is almost improvable, for the same arguments as the 3rd one.
The 3rd option is an example of the measurement bias that you have seen in the lectures from last week! In our case, the issue is that the COMPAS algorithm is not trained on data relating to crime and recidivism, because it is impossible to obtain, but on data relating to arrests and rearrests. Do you see the difference? And how bias can get through?
If you have trouble understanding this part, watch again video 3.2 "Sources of unfairnes" in the Fairness 1 chapter, or call a TA! They are here for that :)


Part 1. Evaluating fairness, not trivial.¶

1.1 - ProPublica's perspective¶

ProPublica is a nonprofit organization that “produces investigative journalism with moral force”. ProPublica was founded as a nonpartisan newsroom aiming to expose and question abuses of power, justice, and public trust, often by systems and institutions deeply ingrained in the US.

In 2016, ProPublica investigated the COMPAS algorithm to assess the accuracy of and potential racial bias within the tool, as it became more popular within the United States court system nationwide. In their analysis, ProPublica tested for statistical differences in outcomes for Black and White defendants that we will now reproduce in order to get their point of view. The main metrics used by ProPublica in their analysis are metrics that you have seen in the Safety 1 notebook: the False Positive Rate (FPR) and the False Negative Rate (FNR).

Instructions

Complete the functions below so that:

  • fpr returns the False Positive Rate
  • fnr returns the False Negative Rate
In [5]:
def fpr(data):
    """
    Return the false positive rate for the data, using is_recid and the decile_score (or predicted_recid)
    """
    ### YOUR CODE HERE
    # dataframe with defendants who did not recidivate
    did_not_recidivate = data[data["is_recid"] == 0] # SOLUTION

    # number of defendants who did not recidivate (actual negatives)
    n = did_not_recidivate.shape[0] # SOLUTION
    
    # number of defendants predicted to recidivate among those who actually did not recidivate (false positives)
    fp = did_not_recidivate[did_not_recidivate["decile_score"] >= 5].shape[0] # SOLUTION

    # false positive rate
    result = fp / n # SOLUTION
    ### END OF YOUR CODE

    return result


def fnr(data):
    """
    Return the false negative rate for the data, using is_recid and the decile_score (or predicted_recid)
    """
    ### YOUR CODE HERE
    # dataframe with defendants who did recidivate
    recidivated = data[data["is_recid"] == 1] # SOLUTION
    
    # number of defendants who did recidivate (actual positives)
    p = recidivated.shape[0] # SOLUTION

    # number of defendants predicted NOT to recidivate among those who actually did recidivate (false negatives)
    fn = recidivated[recidivated["decile_score"] < 5].shape[0] # SOLUTION

    # false negative rate
    result = fn / p # SOLUTION
    ### END OF YOUR CODE

    return result

Test your code with the cell below.

In [6]:
resume_exec = True

print("Testing FPR...")
resume_exec = test(fpr)
print()
print("Testing FNR...")
resume_exec = test(fnr) and resume_exec

# This part won't execute if the tests don't pass
if resume_exec:
    propublica_point = pd.DataFrame(data={'White': [f"{int(fpr(white_defendants)*100)}%", f"{int(fnr(white_defendants)*100)}%"], 'Black': [f"{int(fpr(black_defendants)*100)}%", f"{int(fnr(black_defendants)*100)}%"]}, index=["Among non-recidivists: % of high scored (FPR)", "Among recidivists: % of low scored (FNR)"])
    display(propublica_point.style.set_caption("ProPublica metrics"))
Testing FPR...
🆗 Tests passed ! =)

Testing FNR...
🆗 Tests passed ! =)
ProPublica metrics
  White Black
Among non-recidivists: % of high scored (FPR) 21% 41%
Among recidivists: % of low scored (FNR) 50% 29%

Reflection time!

Q1) What is the problem here?

Q2) What do you think ProPublica concluded from these results?

Feedback - Click on the "..." below only once you have really tried to answer the question!

There are much more false positives concerning the Black defendants, and much more false negatives concerning White defendants, i.e. the model is overpredicting Black defendants and underpredicting White defendants.

ProPublica concluded that COMPAS has a racial biased due to the fact that it has been trained on a criminal justice system that has a history of racial injustices.


Note:

ProPublica also pointed out the fact that using COMPAS in court could results in perpetuating the racial injustices and thus never let a chance to get out of this vicious circle. This phenomenon is called a feedback loop, a concept you have already seen in Safety 2.

Northpointe disputed this claim with another fairness measurement, and we will see that right now!


1.2 - Northpointe's perspective¶

Northpointe (merged with two other companies to create equivant) is a for-profit computer software company that aims to advance justice by informing and instilling confidence in decision makers at every stage of the criminal justice system.

In the wake of criticism from ProPublica and other researchers, Northpointe produced a detailed response to ProPublica’s allegations, claiming that these critiques of their tool utilized the wrong type of classification statistics in their analysis and portrayed the tool incorrectly.

To support their claim, they computed the Positive Predictive Value (PPV) and Negative Predictive Value (NPV), which are computed as follows:

$\text{PPV}=\frac{\text{TP}}{\text{TP}+\text{FP}}=\frac{\text{TP}}{\text{Predicted P}}$

$\text{NPV}=\frac{\text{TN}}{\text{TN}+\text{FN}}=\frac{\text{TN}}{\text{Predicted N}}$

One way to see those metrics is the following: instead of considering the prediction/truth ratio, these metrics focus on the ratio of truth/predictions. This way, you can see the amount of correct predictions done over all predictions made.

But we can't use directly these metrics, because they is not comparable with the one used by ProPublica, which focus on the error rates instead of correct rates. Therefore, we need to compute 1 - PPV and 1 - NPV. You will find the intuitions behind these metrics when they will be displayed.

If you have trouble understanding those metrics, watch video 4.2 "Group fairness" and don't hesitate to ask a TA!

Instructions

Complete the cell below such that:

  • ppv_complement returns the complement of the Positive Predicted Value
  • npv_complement returns the complement of the Negative Predicted Value
    (the formulas are described above)
In [7]:
def ppv_complement(data):
    """
    Return the complement to the Positive Predicted Value for the data, using is_recid and the decile_score (or predicted_recid)
    """
    ### YOUR CODE HERE
    # dataframe with defendants who were predicted to recidivate
    predicted_recidivate = data[data["decile_score"] >= 5] # SOLUTION

    # number of defendants who were predicted to recidivate (predicted positives)
    pp = predicted_recidivate.shape[0] # SOLUTION

    # number of defendants who actually recidivated among those predicted to recidivate (true positives)
    tp = predicted_recidivate[predicted_recidivate["is_recid"] == 1].shape[0] # SOLUTION

    # positive predictive value
    ppv = tp / pp # SOLUTION
    ### END OF YOUR CODE

    return 1-ppv


def npv_complement(data):
    """
    Return the the complement to the Negative Predicted Value for the data, using is_recid and the decile_score (or predicted_recid)
    """
    ### YOUR CODE HERE
    # dataframe with defendants who were predicted NOT to recidivate
    predicted_not_recidivate = data[data["decile_score"] < 5] # SOLUTION
    
    # number of defendants who were predicted not to recidivate (predicted negatives)
    pn = predicted_not_recidivate.shape[0] # SOLUTION
    
    # number of defendants who actually did NOT recidivated among those predicted NOT to recidivate (true negatives)
    tn = predicted_not_recidivate[predicted_not_recidivate["is_recid"] == 0].shape[0] # SOLUTION

    # negative predictive value
    npv = tn / pn # SOLUTION
    ### END OF YOUR CODE

    return 1-npv
In [8]:
resume_exec = True

print("Testing PPV complement...")
resume_exec = test(ppv_complement)
print()
print("Testing NPV complement..")
resume_exec = test(npv_complement) and resume_exec

# This part won't execute if the tests don't pass
if resume_exec:
    northpointe_point = pd.DataFrame(data={'White': [f"{int(ppv_complement(white_defendants)*100)}%", f"{int(npv_complement(white_defendants)*100)}%"], 'Black': [f"{int(ppv_complement(black_defendants)*100)}%", f"{int(npv_complement(black_defendants)*100)}%"]}, index=["Among the high scored: % of non-recidivist (1-PPV)", "Among the low scored: % of recidivists (1-NPV)"])
    display(northpointe_point.style.set_caption("NorthPointe metrics"))
Testing PPV complement...
🆗 Tests passed ! =)

Testing NPV complement..
🆗 Tests passed ! =)
NorthPointe metrics
  White Black
Among the high scored: % of non-recidivist (1-PPV) 38% 31%
Among the low scored: % of recidivists (1-NPV) 31% 39%

Reflection time!

Q1) What are these metrics revealing about COMPAS? (use the labels of the rows to get a better intuition of the metrics)

Q2) What do you think Northpointe concluded from these results?

Feedback - Click on the "..." below only once you have really tried to answer the question!

These metrics show that when a defendant has a high score, they are more likely to be a non-recidivist if they are white (see the 1st row).
This means that among the high scored, white defendant have the highest incorrect predictions, invalidating the claim of racial bias towards Black people by COMPAS.

Also, the fact 39% of the low scored black defendant actually recidivated tends to show that the model underpredicts Black defendants, instead of the contrary.

1.3 - So, who is right?¶

In [9]:
display(propublica_point.style.set_caption("ProPublica metrics"))
print()
display(northpointe_point.style.set_caption("NorthPointe metrics"))
ProPublica metrics
  White Black
Among non-recidivists: % of high scored (FPR) 21% 41%
Among recidivists: % of low scored (FNR) 50% 29%

NorthPointe metrics
  White Black
Among the high scored: % of non-recidivist (1-PPV) 38% 31%
Among the low scored: % of recidivists (1-NPV) 31% 39%

Reflection time!

So, now we have the results of the metrics used by the companies.

Q1: Who do you think is right?

  • ProPublica
  • NorthPointe
  • None

Q2: Is the COMPAS tool biased or not?

Feedback - Click on the "..." below only once you have really tried to answer the question!

Q1) The anwser is: none! Each metric leads to a different conclusion.

The reason for this discrepancy between metrics has been the object of many studies, and researchers have shown that the metrics we have used above are incompatible by essence when groups in the underlying dataset do not have an equal prevalence of the outcome of interest. Sahlgren (2024) states that "an error-prone predictive model cannot simultaneously satisfy two plausible conditions for group fairness apart from exceptional circumstances where groups exhibit equal base rates". This phenomenon is called the "impossibility result".
For a mathematical explanation check Castelnovo et al. (2022).

There are two implications of this result here:

  1. We have already seen in the dataset exploration that the data contains more cases for Black defendants than White defendant. But in addition, the incompatible results we obtain on the fairness metrics suggest that we also have an imbalance in terms of the actual recidivism rates accross Black and White groups in the dataset.
    👉 We will explore this issue later in the notebook.
  2. If we want to work on the fairness of the model, we need to determine which fairness metric to prioritize (and therefore which notion of fairness is more important), since we will never be able to optimize them all.

Q2) Because the results on the two categories of fairness metrics are incompatible, we cannot conclude on whether the COMPAS is biased or not.
The only conclusions that we can draw for now are that:

  • the accuracy of the COMPAS raises questions in terms of its safety for use in the real world;
  • the dataset probably reflects historical biases which are very likely to lead to unfairness in any algorithm working with such data.


References:
Sahlgren, O. (2024). What’s Impossible about Algorithmic Fairness? Philosophy & Technology, 37(4), 124. https://doi.org/10.1007/s13347-024-00814-z
Castelnovo, A., Crupi, R., Greco, G., Regoli, D., Penco, I. G., & Cosentini, A. C. (2022). A clarification of the nuances in the fairness metrics landscape. Scientific Reports, 12(1), Article 1. https://doi.org/10.1038/s41598-022-07939-1


Part 2. A design matter?¶

In this section, we will try to build our own model! Our goal is to find out if we can do better than Northpointe in terms of fairness of the tool, through seeing if and how demographic attributes can predict recidivism.

2.1 - Creating a Logistic Regression model¶

Short introduction to logistic regression

In the rest of this notebook, we will be using a Machine Learning model called Logistic Regression.
If you are not familiar with this technique, we provide a short tutorial in the file Intro_To_LR.ipynb. Take a look at it if you feel it is necessary!

Let's try to train a Logistic Regression model on the ProPublica dataset!

Here are the design choices that we make:

  • We will try to predict whether a defendant will recidivate within two years, therefore the label that we will use is is_recid
  • For this first try, we will train the model on these features: sex, age_cat, race, juv_fel_count, priors_count, c_jail_in, c_jail_out and c_charge_degree. Some of them aren't usable without some preprocessing, and that is what we will do now!

Data preprocessing¶

Logistic regression can only take numerical values, and it is even better if those values are normalized because all features contribute equally to the model's learning process, preventing one feature from dominating due to its scale.

Normalization

First let's look at the normalisation of the numerical attributes we already have. The cell below normalizes the values from the columns priors_count and juv_fel_count thanks to a StandardScaler. It also computes the time spent in jail using c_jail_in and c_jail_out.

Instructions

Run the cell below, which will create a new features dataframe containing the normalized features.

In [10]:
scaler = StandardScaler()
features = pd.DataFrame(scaler.fit_transform(pd.concat(
    [
        data["priors_count"],
        data["juv_fel_count"],
        pd.DataFrame(data["c_jail_out"].map(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d %H:%M:%S").timestamp()) - data["c_jail_in"].map(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d %H:%M:%S").timestamp()), columns=['jail_time'], index=data.index)
    ], axis='columns')), columns=['priors_count', 'juv_fel_count', 'jail_time'], index=data.index)
features.head()
Out[10]:
priors_count juv_fel_count jail_time
0 -0.684413 -0.127923 -0.302617
1 -0.684413 -0.127923 -0.107825
2 0.158866 -0.127923 -0.300447
5 -0.684413 -0.127923 -0.295461
6 2.267065 -0.127923 -0.188773

Creating indicator variables

Now we can't pass to a Logistic regression model values such as "Male" or "Female", we need to make an indicator variable for them, that will be True if the original variable has a certain value, False otherwise.
This is done using pd.get_dummies:

In [11]:
pd.get_dummies(data['sex']).sample(5, random_state=123)
Out[11]:
Female Male
3983 True False
3406 True False
6001 False True
4423 True False
49 False True

Now since the resulting feature is binary, we only need one of the two columns to make an indicator:

In [12]:
pd.DataFrame(
    {"is_female": pd.get_dummies(data['sex'])['Female']}
).sample(5, random_state=123)
Out[12]:
is_female
3983 True
3406 True
6001 False
4423 True
49 False

Now it is your turn!

  • Add one indicator is_female,
  • And one indicator is_charge_felony to the features!

Instructions

Complete the cell below so that:

  • features['is_female'] is an indicator for the sex of the defendant, which can take values ["Male", "Female"]
  • features['is_charge_felony'] is an indicator for the c_charge_degree of the defendant, which can take values ["F", "M"], resp. Felony or Misdemeanor.

Both columns should have boolean values.

In [13]:
### YOUR CODE HERE
features["is_female"] = pd.get_dummies(data["sex"])["Female"] # SOLUTION
features["is_charge_felony"] = pd.get_dummies(data["c_charge_degree"])["F"] # SOLUTION
# END OF YOUR CODE

# Simply testing your values...
test_values(tuple(features["is_female"].sample(5, random_state=123)), "sex_indicator")
test_values(tuple(features["is_charge_felony"].sample(5, random_state=123)), "is_charge_felony_indicator")

features.head()
🆗 Tests passed ! =)
🆗 Tests passed ! =)
Out[13]:
priors_count juv_fel_count jail_time is_female is_charge_felony
0 -0.684413 -0.127923 -0.302617 False True
1 -0.684413 -0.127923 -0.107825 False True
2 0.158866 -0.127923 -0.300447 False True
5 -0.684413 -0.127923 -0.295461 False False
6 2.267065 -0.127923 -0.188773 False True

Awesome! Now let's tackle age_cat and race, it is a bit trickier as we can't just make an indicator for them.

age_cat can take 3 values: "Less than 25", "25 - 45" and "Greater than 45". There are several ways to numerize this column, but we will choose the option of making 2 indicators:

  • One "age < 25"
  • And the other "age > 45".
  • If both indicators are 0, it means that the defendant is between 25 and 45, so there is no loss of information!

For race, we will not use this trick as there are 6 values, so it is not worth it and might add some unwanted bias for technical reasons.
We will simply do 6 indicators: one for each race!

Instructions

Complete the cell below such that:

  • It adds 2 indicators for the age of the defendant, "age < 25" and "age > 45"
  • It adds 6 indicators for the race of the defendant, one for each race
In [14]:
### YOUR CODE HERE
# BEGIN SOLUTION
features[["age < 25", "age > 45"]] = pd.get_dummies(data["age_cat"])[["Less than 25", "Greater than 45"]]
features[["African-American", "Caucasian", "Asian", "Hispanic", "Native American", "Other"]] = pd.get_dummies(data["race"])[["African-American", "Caucasian", "Asian", "Hispanic", "Native American", "Other"]]
# END SOLUTION
### END OF YOUR CODE

test_values(sorted(features.columns), "naive features columns")
test_values(features[["age < 25", "age > 45"]].sample(5, random_state=246).to_numpy().__repr__(), "age_indicator")
test_values(features[["African-American", "Caucasian", "Asian", "Hispanic", "Native American", "Other"]].sample(5, random_state=5).to_numpy().__repr__(), "race_indicator")

features.head()
🆗 Tests passed ! =)
🆗 Tests passed ! =)
🆗 Tests passed ! =)
Out[14]:
priors_count juv_fel_count jail_time is_female is_charge_felony age < 25 age > 45 African-American Caucasian Asian Hispanic Native American Other
0 -0.684413 -0.127923 -0.302617 False True False True False False False False False True
1 -0.684413 -0.127923 -0.107825 False True False False True False False False False False
2 0.158866 -0.127923 -0.300447 False True True False True False False False False False
5 -0.684413 -0.127923 -0.295461 False False False False False False False False False True
6 2.267065 -0.127923 -0.188773 False True False False False True False False False False

Fitting a LR model¶

Perfect! All the features are now ready! The only thing left to do is defining the labels and fitting the model!

Instructions

Complete the cell below so that:

  • it fits the model on the features and the labels, as seen in the tutorial on Logistic Regression Intro_To_LR.ipynb.
In [15]:
# Get the labels
labels = data["is_recid"].copy()
# The features are available in the "feature" variable

# Create the model 
model = LRModel(max_iter=1000) # Custom implementation that uses LogisticRegression from sklearn package

# Fit the model!
### YOUR CODE HERE
# BEGIN SOLUTION
model.fit(features, labels)
# END SOLUTION
### END OF YOU CODE

print(f"Model trained! Prediction accuracy: {model.score(features, labels)*100}%")
model.print_coefs(features.columns)
Model trained! Prediction accuracy: 65.76673866090714%
Out[15]:
  Coefficients of the model
priors_count 0.849837
juv_fel_count 0.058701
jail_time 0.106816
is_female -0.371412
is_charge_felony 0.117323
age < 25 0.770163
age > 45 -0.738011
African-American 0.172767
Caucasian 0.055662
Asian -0.008043
Hispanic -0.177392
Native American -0.012142
Other -0.150954

Quizz time!

Q1) What are the 2 most predictive features? Recall that if a coefficient is negative, it means that it is used to predict towards the class 0

  • Previous crimes
  • Current charge informations
  • Sex
  • Age
  • Race

Q2) Based on the decision coefficients, do you think your model is racialy biased?

  • Yes
  • No

Feedback - Click on the "..." below only once you have really tried to answer the question!

Q1:

  • Age, which sounds quite reasonable, the younger you commit a crime the more probable it is that you will do it again.
  • Previous crimes, which also sounds reasonable.

Q2:
One could argue that the coefficients show that the model is racially biased: African-American have a coefficient of 0.17, which means that this feature is used towards the prediction of recidivism. Hispanic has an opposite value, meaning that the model stigmatises African-American defendants over Hispanic defendants.
But one could also argue that the model doesn't fully rely on those features, as the age and the priors_count features account for much more in the decision than the races.

At the end, if we take two identical young men that have 0 priors crime, if one of them is Black, he will be stigmatized more by the model than the other. Therefore we can can say that the model is racially biased.

2.2 - Evaluating fairness with the Disparate Impact Ratio¶

Let's see if our model has a racial bias or not by using a notion of fairness we have already seen in the notebook for Fairness 1: demographic parity.

In Machine Learning, demographic parity is generally measured using a metric called the disparate impact ratio.

Instructions

Try to understand what the disparate impact ratio measures by running the cell and analysing the function!

In [16]:
# /!\ mysterious_metric only works if predictions are 0 or 1
def mysterious_calculation(predictions):
    """
    :param predictions: Indicator of the predicted value for each sample (is 0 or 1 for each sample)
    """
    return predictions.sum() / predictions.shape[0]

black_disparate = mysterious_calculation(model.predict(features[features['African-American'] == 1], "test"))
white_disparate = mysterious_calculation(model.predict(features[features['Caucasian'] == 1], "test"))

disparate_impact_ratio = white_disparate / black_disparate

display(pd.DataFrame({"White defendants": [white_disparate], "Black defendants": [black_disparate]}, index=["Mysterious calculation"]))
print(f"Disparate impact ratio: {disparate_impact_ratio}")
White defendants Black defendants
Mysterious calculation 0.303125 0.599567
Disparate impact ratio: 0.5055731046931408

Reflection time!

  • Try to recall from Fairness 1: what is meant by demographic parity in terms of fairness for an algorithm?
  • What does the disparate impact ratio measure?

Feedback - Click on the "..." below only once you have really tried to answer the question!

Demographic parity is when an algorithm gives the same rate of a given outcome (in general we look at the favorable outcome) to people independently from their demographic group.

In the code above:

  • We look at the predictions from the model which have the value 1, which means the model predicts the defendant to recidivate.
  • We compute the proportion of positive predictions (here predicted recidivism) over all predictions for each group, this is what the "mysterious calculation" is about: getting the proportion of defendants predicted recidivist for each group.
  • Then we take the ratio between the values obtained for each group.

To summarize, the disparate impact ratio is the ratio between the proportion of a given outcome for one group and the proportion of the same outcome for the other group.
It is computed as:

$$DIR = \frac{P_{unpriviledged}}{P_{priviledged}}$$

Therefore it gives us an idea of whether our model gives the same rate of outcome for the groups we compare, i.e. the disparate impact ratio is one way to measure demographic parity.

Another reflection time!

Q1) What is the value of a "good" disparate impact ratio in the context of fairness between two groups?

Q2) What are your conclusion concerning your model, is it biased?

Feedback - Click on the "..." below only once you have really tried to answer the question!

Q1) If a model gives the same outcome rate for the two groups we compare then the disparate impact ratio will have a value of 1.
In practice this is rarely the case, and a widely accepted rule of thumb is to accept values which are above 0.8 as indicating fairness.
Q2) For our model, the ratio is 0.5, meaning that the model tends to predict twice as often a Black defendant as recidivist than a White defendant... That is not very good for what we want to achieve...

From here we will try to build a model with a better disparate impact ratio.

2.3 - Trying to improve fairness¶

In the following sections we are going to try out 3 different ways to improve the fairness of our model by modifying its design.

Removing the race in the input¶

First idea: if our model discriminates based on race, one obvious way to fix that is to make the model unaware of race!
This is what we will try to implement here: a model that can predict recidivism without taking the race of the defendant into account.

Instructions

Complete the cell below so that:

  • It creates features_no_race, which is the same as features but without the race indicators

Hint

Use dataframe.drop and be careful about the axis! It can be "index" (which will make drop rows), or "columns" (which will make drop columns).

In [17]:
### YOUR CODE HERE
features_no_race = features.drop(["Other", "African-American", "Hispanic", "Asian", "Native American", "Caucasian"], axis='columns') # SOLUTION
# END OF YOUR CODE

test_values(sorted(list(features_no_race.columns)), "features_no_race")

features_no_race.head()
🆗 Tests passed ! =)
Out[17]:
priors_count juv_fel_count jail_time is_female is_charge_felony age < 25 age > 45
0 -0.684413 -0.127923 -0.302617 False True False True
1 -0.684413 -0.127923 -0.107825 False True False False
2 0.158866 -0.127923 -0.300447 False True True False
5 -0.684413 -0.127923 -0.295461 False False False False
6 2.267065 -0.127923 -0.188773 False True False False

We have coded for you the function fit_and_display in the file src.py, which does all the training of the model and the calculations regarding the disparate impact ratio (you don't need to understand it to complete the exercice).
We can just run the function and get the results!

In [18]:
fit_and_display(features_no_race, labels, white_defendants, black_defendants)
Accuracy of the model: 67.170626349892%.
White Black
Proportion predicted recidivist 0.303125 0.573593
Disparate impact ratio: 0.5284669811320755

Reflection time!

Q1. The disparate impact ratio has improved only by 2,8%, meaning that not using race is not the solution. How can the model be racialy biased while not knowing the race of the defendants?

Q2. Is the argument "COMPAS cannot be racialy biased as it doesn't know the race of the defendant" admissible to prove that COMPAS is not racially biased?

  • Yes
  • No

Feedback - Click on the "..." below only once you have really tried to answer the question!

Q1. The issue is that race is implicitely related to the features that we feed to the model (for instance, Black Defendants live often in poorer neighborhood in the USA, where they can be more arrested for small endeavors, which counts in the priors_count. We will get back to that later on). Because of this, there is a correlations between the output of the model, and the race of the Defendant.
You have seen in video 3.1 that the attributes which are related to sensitive attributes such as race are called proxies: they are features that allow to retrieve information external to the dataset (in this case, race).

Q2. Based on what we can observe, this argument is not recevable. It is not because a model is not provided data about an attribute that he can't be biased towards this attribute.

A more bruteforce attempt to remove the bias¶

Let's look again at the features we have at our disposal:

In [19]:
data.columns
Out[19]:
Index(['id', 'name', 'first', 'last', 'sex', 'dob', 'age', 'age_cat', 'race',
       'juv_fel_count', 'decile_score', 'juv_misd_count', 'juv_other_count',
       'priors_count', 'c_jail_in', 'c_jail_out', 'c_offense_date',
       'c_arrest_date', 'c_charge_degree', 'c_charge_desc', 'is_recid',
       'in_custody', 'out_custody', 'predicted_recid'],
      dtype='object')

There are features in there we have not yet preprocessed, so let's do it and look at the resulting preprocessed data.

Instructions

Execute the cell below to get all features from the dataset preprocessed.

In [20]:
pre_processed_features = features.copy()

not_preprocessed_features = ['juv_misd_count', 'juv_other_count']
pre_processed_features[not_preprocessed_features] = pd.DataFrame(
    scaler.fit_transform(data[not_preprocessed_features]), 
    columns=not_preprocessed_features, 
    index=features.index
)

display(pre_processed_features.head())
print(f"Features at disposal: {list(pre_processed_features.columns)}")
priors_count juv_fel_count jail_time is_female is_charge_felony age < 25 age > 45 African-American Caucasian Asian Hispanic Native American Other juv_misd_count juv_other_count
0 -0.684413 -0.127923 -0.302617 False True False True False False False False False True -0.183232 -0.235102
1 -0.684413 -0.127923 -0.107825 False True False False True False False False False False -0.183232 -0.235102
2 0.158866 -0.127923 -0.300447 False True True False True False False False False False -0.183232 1.889425
5 -0.684413 -0.127923 -0.295461 False False False False False False False False False True -0.183232 -0.235102
6 2.267065 -0.127923 -0.188773 False True False False False True False False False False -0.183232 -0.235102
Features at disposal: ['priors_count', 'juv_fel_count', 'jail_time', 'is_female', 'is_charge_felony', 'age < 25', 'age > 45', 'African-American', 'Caucasian', 'Asian', 'Hispanic', 'Native American', 'Other', 'juv_misd_count', 'juv_other_count']

Let's try now to use only the features that could be meaningfully related to recidivism without being related to the race of the defendant.

Instructions

In the cell below, try to select features such that the model reaches 0.6 or more of disparate impact ratio!

We provided you with data that is already correctly pre-processed. If you want to use columns that are not in pre_processed_features (which should not be necessary), don't forget to pre-process correctly your data!

In [21]:
### YOUR CODE HERE
selected_features = ['jail_time', 'is_charge_felony', 'juv_fel_count', 'juv_misd_count', 'juv_other_count'] # SOLUTION
### END OF YOUR CODE

fit_and_display(pre_processed_features[selected_features], labels, white_defendants, black_defendants)
Accuracy of the model: 60.259179265658744%.
White Black
Proportion predicted recidivist 0.190625 0.290043
Disparate impact ratio: 0.6572294776119403

Warning

Do not spend more than 15mn maximum on this exercice!
The goal of this exercice is to show that it is really difficult to find the correct features going blindly :)

Using the correlations in order to select the best features¶

Instead of doing a trial and error scheme, let's make things right and see how the different features are correlated to race and the label we are tryint to predict.
If you have some knowledge in statistics, remember that correlation doesn't mean causation! But a Machine Learning model relies on those correlations to make its predictions, therefore it is always a good practice to check for correlations between the features and the label in the data.

Instructions

Execute the cell below to see the correlation factors between the features and is_recid, African-American and Caucasian.

In [22]:
display(
    pd.DataFrame({"is_recid": features.corrwith(labels), "African-American": features.corrwith(features["African-American"]), "Caucasian": features.corrwith(features['Caucasian'])})
    .drop(['African-American', 'Caucasian', 'Asian', 'Hispanic', 'Native American', 'Other']) # we drop the races as it is irrelevant
    .style.map(lambda e: f"color: rgb(255, {255 - abs(e)*255/0.3}, {255 - abs(e)*255/0.3})")
)
  is_recid African-American Caucasian
priors_count 0.294522 0.215184 -0.145093
juv_fel_count 0.085108 0.057850 -0.052881
jail_time 0.108439 0.059794 -0.043677
is_female -0.110840 -0.045781 0.071087
is_charge_felony 0.115557 0.104047 -0.077574
age < 25 0.112593 0.091101 -0.092665
age > 45 -0.143709 -0.157048 0.157450

Reflection time!

What is the feature that is the most correlated with the race?

  • priors_count
  • jail_time
  • is_charge_felony
  • age

Does it explain earlier results?

Feedback - Click on the "..." below only once you have really tried to answer the question!

priors_count has the biggest correlation with race. It is interesting to notice that it also has the biggest correlation with the label.
It explains earlier results, priors_count was the proxy for race!

Another reflection time!

Among these features, which one should we use and why?

  • juv_fel_count
  • priors_count
  • jail_time
  • is_female
  • is_charge_felony
  • age

Feedback - Click on the "..." below only once you have really tried to answer the question!

  • juv_fel_count is usable as it has a higher correlation with the label than with race, and the correlation with race is very low.
  • We can use jail_time as it has a high correlation with the label without any correlation with race.
  • Even though is_charge_felony has a high correlation with the race, it could be considered more meaningful for recidivism and less biased than priors_count, as the latter is related with the childhood, which is highly related to the race of the defendant (black people lives in poorer neighborhoods than white people, note: we chose this report over more recent ones because the data from the ProPublica dataset are from 2013-2014). Therefore we will use is_charge_felony as it has a high correlation with the label. You can try to not use it if you want!
  • is_female has a high correlation with the label, and a medium one with the race, it should be pretty safe to use it!
In [23]:
### YOUR CODE HERE
most_uncorr_features = features[['juv_fel_count', 'jail_time', 'is_charge_felony', 'is_female']] # SOLUTION

fit_and_display(most_uncorr_features, labels, white_defendants, black_defendants)
Accuracy of the model: 60.69114470842333%.
White Black
Proportion predicted recidivist 0.4625 0.601732
Disparate impact ratio: 0.768615107913669

This is way better! We succeded to design a model that almost has a good disparate impact ratio!

Reflection time!

Q1) What do you think of features selection as a way to improve fairness?

  • It is useless
  • It is a first step
  • It is enough

Q2) What is the main tradeoff of having a fairer model? Would you use the model in a real life scenario now?

Feedback - Click on the "..." below only once you have really tried to answer the question!

Q1) Features selection is a good first step: we started at 0.50 and ended up at 0.77 for the disparate impact ratio, which is way better and almost reaches the threshold needed to consider fairness in the usage of the model (in general, a good disparate impact ratio should be 0.8 or better).
However, if we look at the proportions of predicted recidivism for both groups, we see a large increase for both groups but more so for the Black group (60%!). This is an issue in itself, and it means that our model probably has a larger rate of false positives now - reflected in the lower accuracy we obtain. Q2) The main tradeoff we are identifying here is that getting a fairer model often comes at the cost of a reduced accuracy! To gain 0.3 of disparate impact ratio we lost 10% of overall accuracy. Our model is almost useless now, barely doing better than a coin flip.
The dilemma between accuracy and fairness is well known and very frequent in machine learning. We will discuss this issue again in the videos.


[Optional] 2.4 - Constraining the model¶

Another way to train a fairer algorithm is to make it target not only the best accuracy, but also a fairness metric. This is not done easily and requires some complicated maths, but fortunately the package AI Fairness 360 (aif360) does it for us!
In the cell below we train our model so that it follows certain constraints (more details in this paper from EPFL researchers!). In our case, the constraint is to have a good disparate impact ratio by protecting the 'African-American' race.

Instructions

Execute the cell below to train a new Logistic Regression model with a constraint on demographic parity.
Note that we give all features to the model because we train the model to protect the race, so it must know explicitely the race of the defendant (which raises privacy issues of course!).


There might be warnings from the cell below but it comes from the package, you can ignore them, it doesn't affect anything.

In [24]:
model = LogisticRegression(max_iter=1000)

model_constrained = ExponentiatedGradientReduction(estimator=model, prot_attr='African-American', constraints="DemographicParity", drop_prot_attr=False)

fit_and_display(features, labels, white_defendants, black_defendants, model=model_constrained)
Accuracy of the model: 65.48930654569021%.
White Black
Proportion predicted recidivist 0.453125 0.45671
Disparate impact ratio: 0.9921504739336492

Reflection time!

Compare these results with those obtained when manually choosing the features we gave to model. Why do you think it works better? (think about how is trained a Logistic Regression model)

Feedback - Click on the "..." below only once you have really tried to answer the question!

Clearly, those results are way better than before, both in terms of accuracy and fairness. We even achieved an accuracy almost as good as the COMPAS tool (which is not really something to brag about...)!
This can be explained because we have trained a model such that the problem it tries to solve contains the fairness contraints. Therefore, it has to find a line that respects the contraints given, namely beeing fair. When manually selecting the features used, the model wasn't trained in order to achieve a good fairness, therefore it could not achieve a perfect one, we could only approach it.


Part 3. Troubles from the data¶

As we will see more in details in the videos, there are two main ways to improve the fairness of a machine learning model:

  • Focus on the design of the model, which is what we have done so far.
  • Focus on the data, as bias from the training data has a direct impact on the fairness of the model.

For the last part of this notebook, we will focus on the data!

3.1 - Prevalence of recidivism accross groups¶

Remember our result from part 1.3: the incompatible results we obtained on the fairness metrics seem to indicate that we have an imbalance in the distribution of recidivism accross race groups in the data. Let's check it out!

Instructions

Complete the cell below such that:

  • It computes the number of recidivists and non-recidivists among White and Black defendants from the dataframe data

You have seen in the Fairness 1 notebook how to use the crosstab function of pandas.
Use this function to compute the recid_stats table that contains the raw numbers of defendants as follows:

CaucasianAfrican-American
is_recid
0# White non recid# Black non recid
1# White recid# Black recid

Pro tips!

  • Pay attention to the order in which you pass the dataframe columns of the dataset to crosstab;
  • The crosstab function returns a dataframe, and it is possible to select some of its columns using this syntax: dataframe[['column1', 'column2']]

Note: if you struggle too much with crosstab, you can also compute the values of each of the cells using pandas filtering, then build the dataframe manually.

In [25]:
### YOUR CODE HERE
recid_stats = pd.crosstab(data['is_recid'], data['race'])[["Caucasian", "African-American"]] # SOLUTION
display(recid_stats)
### END OF YOUR CODE

# checking you have the correct values
test_values(list((recid_stats.to_numpy().flatten())), "recid_stats")

# Compute the percentages
print("\nPercentages:")
for col in recid_stats.columns:
    recid_stats[col] = recid_stats[col].map(lambda e: f"{e} ({round(e*100 / recid_stats[col].sum())}%)")
recid_stats
race Caucasian African-American
is_recid
0 1229 1402
1 874 1773
🆗 Tests passed ! =)

Percentages:
Out[25]:
race Caucasian African-American
is_recid
0 1229 (58%) 1402 (44%)
1 874 (42%) 1773 (56%)

Reflection time!

Q1) What does this table reveal?

Q2) How does this table explain ProPublica's conclusions about COMPAS?

Feedback - Click on the "..." below only once you have really tried to answer the question!

Q1. This table reveals that in the data we feed to the model, almost 60% of the White defendants are not recidivists. This can be a factor of the unfairness of the model, because the target of the model is to reproduce the statistics we feed to it.
Having a 55/45 ratio for the Black people is more acceptable, but close to the limit of being too biased. On the other hand, we have seen that the dataset contains twice more samples of Black Defendants recidivists than of White Defendants recidivists, which remains an issue.

Q2. This tables explains ProPublica's conclusion about COMPAS, as the model is encouraged to statiscally conclude that White defendants tend to recidivate less than Black defendants.

[Optional] 3.2 - Solutions against biased datasets¶

This section is interesting for you, in order to have a first sight on what you can do against biased datasets. But it is optional in the notebook. You can jump to the conclusion if needed.

Okay so our dataset is biased, now what? We will see 2 methods to improve our model fairness by working on the dataset itself.

It is important to state here that those 2 methods are really different in essence:

  • The first one is about modifying the data in order to hide the bias in the data, and therefore have a model that will be trained on a fairer representation of the world.
  • The second one is about telling the model on which samples (which individual) it should focus on in order to output fairer results.

Both methods are state-of-art techniques that are more and more used in the Machine Learning field.

Transformation¶

One way to address the bias in the data is to create a modified version of the dataset that is representing well the original data while simultaneously obfuscating any information about membership in the protected group, in our case "African-American". This way, the model won't be able to use proxies in order to make a difference between White and Black defendants.
If you want to know more about this technique, here is the paper describing it.

This technique is available under the name "LearnedFairRepresentation" in aif360.

Instructions

Execute the cell below to obtain a dataset where the bias is "hidden" by transformation.

In [26]:
features_fair = features.set_index('African-American', append=True)
lfr = LearnedFairRepresentations('African-American', n_prototypes=10, max_iter=4, random_state=9876)
features_fair = lfr.fit_transform(features_fair.astype(np.float64), labels)

Instructions

Display the first few lines of the features dataframe before and after the manipulation done by aif360.LearnedFairRepresentations.

In [27]:
### YOUR CODE HERE
# BEGIN SOLUTION NO PROMPT
display(features.head().style.set_caption('Before the manipulation'))
print()
display(features_fair.head().style.set_caption('After the manipulation'))
# END SOLUTION
""" # BEGIN PROMPT
display(...)
display(...)
"""; # END PROMPT
Before the manipulation
  priors_count juv_fel_count jail_time is_female is_charge_felony age < 25 age > 45 African-American Caucasian Asian Hispanic Native American Other
0 -0.684413 -0.127923 -0.302617 False True False True False False False False False True
1 -0.684413 -0.127923 -0.107825 False True False False True False False False False False
2 0.158866 -0.127923 -0.300447 False True True False True False False False False False
5 -0.684413 -0.127923 -0.295461 False False False False False False False False False True
6 2.267065 -0.127923 -0.188773 False True False False False True False False False False

After the manipulation
    priors_count juv_fel_count jail_time is_female is_charge_felony age < 25 age > 45 Caucasian Asian Hispanic Native American Other
  African-American                        
0 False 0.507218 0.475127 0.466003 0.422267 0.424015 0.681114 0.621424 0.549810 0.734242 0.543436 0.555198 0.508463
1 True 0.516072 0.452611 0.467314 0.433455 0.407253 0.677141 0.603751 0.559123 0.746399 0.524001 0.575254 0.478592
2 True 0.541727 0.435203 0.436117 0.413283 0.408951 0.706159 0.612333 0.573176 0.737660 0.519425 0.587610 0.457087
5 False 0.506106 0.459035 0.482171 0.430548 0.386951 0.670566 0.590508 0.552366 0.739659 0.513164 0.580058 0.508807
6 False 0.590830 0.425352 0.438339 0.413150 0.420533 0.685877 0.586998 0.605859 0.730453 0.468725 0.604520 0.449076

Reflection time!

Q1) What do you observe about the new values of the different features? Look at those who were previously indicators.

Q2) Can you identify the ethical challenge we have here?

Feedback - Click on the "..." below only once you have really tried to answer the question!

Q1) All defendants now have very similar values for each feature. In addition, two defendants who had the same value in a feature now have different values...! It is not possible anymore for a human to understand what the values mean in the dataset. Q2) As the data is completely different from the original one, the dataset no longer represents the real world, but a world inspired from real data, that now fully carries the notion of fairness towards Black Defendants.
The ethical challenge is the following: should we train models on the real world, or idealized worlds? If we decide that it is better to train an ideal model, who decides the moral values that define the idealized world?

Instructions

Execute the cell below to see the results of fitting a model on this transformed data.

In [28]:
fit_and_display(features_fair, labels, white_defendants, black_defendants)
Accuracy of the model: 63.498920086393085%.
White Black
Proportion predicted recidivist 0.41875 0.404762
Disparate impact ratio: 1.0345588235294119

Reflection time!

Why do we obtain a disparate impact ratio that is greater than 1?
Look at the proportion of defendants predicted recidivists for our two groups, do you notice anything?

Feedback - Click on the "..." below only once you have really tried to answer the question!

A value greater than 1 for disparate impact ratio indicates that the proportions of defendants predicted recidivists is now inverted.
As indicated by the proportions in the table, more defendants from the White group are now predicted recidivists compared to the Black group.
In this case the values are still very close, but this indicates again that improving fairness for one group often comes at a cost for the other group.

Reweighing¶

In this technique, instead of modifying the dataset, we add information to it by assigning weights to the samples that can be taken into account at training time.

The Reweighing object of aif360 automatically determines the weights, which represent more or less what samples should be more represented than others during the training of the model for it to be less biased. We define the attribute that the Reweigher will try to protect from discrimination. Then, we pass the found weights to the model that will take them into into account during its training.

Instructions

Execute the cell below to see the results of fitting a model with the weights provided by aif360.

In [29]:
rw = Reweighing("African-American")
_, weights = rw.fit_transform(Helper.get_train_samples(features.set_index('African-American', append=True)), Helper.get_train_samples(labels))

model = fit_and_display(features.set_index('African-American', append=True), labels, white_defendants, black_defendants, weights=weights, return_model=True)
Accuracy of the model: 65.65874730021598%.
White Black
Proportion predicted recidivist 0.43125 0.495671
Disparate impact ratio: 0.8700327510917031

We obtain an accuracy that is almost the same as that of COMPAS, with a disparate impact ratio above the threshold of 0.8.

Instructions

Execute the cell below to see the weights computed by aif360.

In [30]:
weights_summary = pd.DataFrame(weights).value_counts().reset_index().rename(columns={0: 'Weight', 'count': 'Number of samples'})
display(weights_summary)
Weight Number of samples
0 0.869000 1511
1 0.868478 1505
2 1.164676 1202
3 1.192550 1028

A different weight is calculated for four categories of samples:

  • priviledged group + positive outcome
  • priviledged group + negative outcome
  • unpriviledged group + positive outcome
  • unpriviledged group + negative outcome

The table above shows you these weights, and the number of samples affected by each.

If you are interested to get more details, see this blog post.

Reflection time!

Q1) What are the advantages of the reweighting technique over the transformation technique?
Q2) What is a common drawback of the transformation and reweighting techniques for improving model fairness?

Feedback - Click on the "..." below only once you have really tried to answer the question!

Q1) With the reweighting technique we do not modify the original data.
In addition, since we can know the weights applied to the four categories of samples, this technique is also more transparent (we will further discuss questions around transparency in the Empowerment chapter).

Q2) For both techniques, we use additional algorithms to work on the data, with the risk on introducing new patterns that will be picked up by the model. In both cases, we cannot really know what side effects could arise on other aspects than fairness and accuracy.

Overall, choosing an appropriate data debriasing technique is difficult and multiple criteria need to be considered, which might lead to dilemmas.
Again, there is no magic bullet!

Notes

For the sake of simplicity we use aif360.sklearn. More complex and better versions of model constraining, data transformation, reweighting, etc. are available under aif360.algorithms. If you ever need to address biases in a real dataset for a real model, we advise you to check it as they propose more techniques and tools to facilitate the process. See aif360's github page if you are curious.

Synthesis¶

It is now time to step back and reflect on what you have discovered in this notebook!

Final reflection time!

Summarize what you have learned from this notebook:

  • Name all the different ways we have measured the fairness of algorithms in this notebook (called "fairness metrics")
  • Explain why in the COMPAS case the metrics used by ProPublica and Northpointe give incompatible results
  • Describe the 3 different ways we have tried to improve fairness by working on the design of the model
  • Describe the 2 different ways we have tried to improve fairness by working on the dataset

Feedback - Click on the "..." below only once you have really tried to answer the question!

We have seen three categories of metrics that can be used to compare the performance of a model on different groups:

  • FNR and FPR, the metrics used by ProPublica
  • 1-PPV and 1-NPV, the metrics used by Northpointe
  • The Disparate Impact Ratio (DIR), that we have used on our own logistic regression model

In the COMPAS case, the metrics used by ProPublica and Northpointe give different results because the two groups do not exhibit the same recidivism rate in the dataset. This is called the "impossibility result".

We have tried out 3 different ways to improve fairness by modifying the design of our model:

  • Removing the sensitive attribute from the data
  • Selecting a subset of the attributes for fitting the model (first by randomly trying, then by looking at the correlations)
  • Contraining the model by providing fairness as an objective for the model to optimize

We have tried out 2 different ways to improve fairness by working on the dataset:

  • Applying a transformation that hides the bias in the data
  • Computing weights that should be attributed to samples by the model

Conclusion¶

When using a Machine Learning model in real life scenarios involving decision that will affect people's life, we should be extremely careful about the ethical issues it can generate.
As engineers, we should also try to think about other solutions than ML. If we decide to use it because it is the best tool for the particular task at hand, then assessing fairness is a crucial part of the creation of the tool!

To go a bit further

A paper (Dressel & Farid) has been written to study and compare the results of COMPAS versus ordinary people on predicting the risk of recidivism. Through the paper, they address the problem of trying to predict recidivism. The results of the study are that ordinary people have the same biases as the ML model! We encourage you to read the article to discover why, and get a deeper understanding of subject :)

Congratulations! You have reached the end of this long notebook!