Safety Week 1: Content Moderation¶

How to use this notebook

Simply read the text and follow the instructions.
This notebook contains code cells, which can be modified and must be executed to see the result of their content.
To execute a cell, select it and click on the play button (▶) in the tool bar, or type Shift + Enter or Ctr + Enter.

As the variables contained in a cell are stored in memory, the order of execution of the cells is important !

Notebook by Maxime Lelièvre, Cécile Hardebolle and the Responsible software team (2024).

Exercises adapted from "Content Moderator Assignment" by Julie Jarzemsky and Casey Fiesler, under license CC BY 4.0.
Source: https://www.internetruleslab.com/ethicsbased-computer-science-assignments#content-mod
Reference: Jarzemsky, J., Paup, J., & Fiesler, C. (2023). "This Applies to the Real World" : Student Perspectives on Integrating Ethics into a Computer Science Assignment. SIGCSE 2023, Toronto, ON, Canada. https://doi.org/10.1145/3545945.3569846

Except where otherwise noted, the content of this notebook is licensed under a Creative Commons Attribution International License (CC BY 4.0 International).
Creative Commons License


Welcome to Week 1 exercise's session of Responsible Software!¶

The internet has revolutionized the way we access and share information, but it has also led to the proliferation of harmful content such as hate speech, fake news, cyberbullying, and violent extremism. As a result, detecting and removing harmful content has become a pressing challenge for online platforms and governments alike.

In this exercise, we will look at several content moderation algorithms that aim to detect offensive content and will discuss the challenges associated with such methods.

Learning Goals

What will be covered:

  • Part 1: Basic filtering systems for content moderation (keyword matching, sentiment analysis) on a toy example
  • Part 2: More advanced filtering systems for content moderation (natural language processing and machine learning) on realistic data from Twitter (now named X)

By the end of the session you will be able to:

  • ✅ Implement different filtering systems for content moderation and evaluate their quality
  • ✅ Compute and interpret the accuracy score
  • ✅ Compute and interpret a confusion matrix
  • ✅ Compute and interpret the False Positive Rate (FNR) and the False Negative Rate (FPR)
  • ✅ Explain the safety issues involved in content moderation


Part 1: Cats and dogs¶

The scenario¶

Catter is a social media platform built by and for cats. The cats’ platform has recently been getting spammed by dogs, so they have decided to remove all mentions of dogs from their platform entirely. However, cats are not great at programming. They need your help in removing all of the dog content from their platform and gathered for you some text examples in csv files.

Part 1.1 Keyword Matching¶

Task 1: Remove posts with the word "dogs"¶

The first method cats use to moderate content is keyword matching. They ask you to create a program that analyze each sentence and label them.

You receive a sample file with a list of sentences to filter.

Instructions

Execute this cell to see the content of the file:

In [1]:
import pandas as pd
import os
# Pandas reads the contents of the file and stores it into a "DataFrame", which is a table
path_catDogText = os.path.join('res', 'catDogText.csv')
sentences = pd.read_csv(path_catDogText, sep = ';')
sentences
Out[1]:
text
0 Cats rule, dogs drule.
1 Cats are the worst.
2 Dogs are the best pet you could ask for.
3 Dogs are a man's best friend.
4 Cats are smarter than dogs.
5 Cats are the best pet.
6 Cats can jump higher than dogs.
7 Cats are the bee's knees.
8 All dogs go to heaven.
9 Dogs are not cool.

To implement your filter, you will proceed in two steps:

  1. Create a function that indicates with a boolean if a given sentence contains a banned word
  2. Loop over the sentences in the sample file and apply your function so the resulting boolean is stored in a second column of the dataframe

Instructions

Complete the function has_banned_word below so that:

  • It returns whether or not a sentence contains the banned word: if the sentence contains the banned word, the function should return True, otherwise False.
  • Your code should catch the banned word, regardless of whether it contains capital letters or not ("dogs" / "Dogs").

Notes

  • In Python, text is handled with strings, which are objects containing a read-only sequence of characters. Strings share some behaviors with arrays or lists (such as indexing, slicing, and iteration). So you can see strings as a list of characters (a read-only one), and you can access individual characters and iterate on it but not modify it. Check this page for more information on strings.
  • Python has an operator in that allows to check whether a value is present in a sequence, e.g. 1 in [2, 5, 9, 1] will return True. This operator also works on strings, e.g. "t" in "test" will also return True. Check this page for more details on this membership operator.
  • Check the function str.lower()
In [2]:
def has_banned_word(sentence_to_scan, banned_word):
    """ has_banned_word checks whether sentence_to_scan contains the banned_word
    Arguments:
    - sentence_to_scan: a string of words to scan
    - banned_word: a string to scan for
    Returns:
    - True if the text contains any instances of the banned_word
    - False otherwise
    """
    ### YOUR CODE HERE
    result = banned_word.lower() in sentence_to_scan.lower() # SOLUTION
    ### END OF YOUR CODE
    
    return result

Run the following tests to check your work.

In [3]:
from res.tests import *
test(has_banned_word)
🆗 Tests passed ! =)

Instructions

Complete the cell below so that:

  • Each sentence is analyzed by the filter
In [4]:
# banned word to look for
banned = "dogs"

# iterating over the rows in the dataframe
for i, row in sentences.iterrows():
    # getting the sentence in that row
    sentence = row['text']
          
    ### YOUR CODE HERE
    # checking whether the sentence contains the banned word
    contains_banned_word = has_banned_word(sentence, banned) # SOLUTION
    ### END OF YOUR CODE
    
    # Storing the result in the table
    sentences.at[i, 'contains_banned_word'] = contains_banned_word 

# show the result, highlighting in red the sentences that contain the banned word
sentences.style.apply(lambda r: ['color: red'] * len(r) if r['contains_banned_word'] else [''] * len(r), axis=1)
Out[4]:
  text contains_banned_word
0 Cats rule, dogs drule. True
1 Cats are the worst. False
2 Dogs are the best pet you could ask for. True
3 Dogs are a man's best friend. True
4 Cats are smarter than dogs. True
5 Cats are the best pet. False
6 Cats can jump higher than dogs. True
7 Cats are the bee's knees. False
8 All dogs go to heaven. True
9 Dogs are not cool. True

Reflection time !

Are you satisfied with the results of your filter?

Feedback - Click on the "..." below only once you have really tried to answer the question!

We can't be very happy about this filter because:

  • some negative statements are not caught by this solution, e.g. "Cats are the worst".
  • some positive statements are caught by the filter where they should not be, e.g. "Cats are smarter than dogs".

Task 2: Matching a list of banned words¶

The dogs have started to get creative while putting content onto Catter, using slang like “doggo”, “dawg”, etc. To fix this, the developers of Catter are maintaining a list of words to remove.

Instructions

Complete the function has_word_from_list below so that:

  • It returns whether or not a sentence contains any of the banned words in a provided list of words.

💡Tip: functions are made to be reused, your previous function has_banned_word can be useful here...

In [5]:
def has_word_from_list(sentence_to_scan, list_of_words):
    """ has_word_from_list checks whether the sentence_to_scan contains any words from the list_of_words

      Arguments:
      - sentence_to_scan: An array of words to scan
      - list_of_words: The list of banned words to scan for (lowercase)

      Returns:
      - True if sentence_to_scan contains any words from the list_of_words
      - False otherwise
    """
    for banned_word in list_of_words:
    ### YOUR CODE HERE
        # BEGIN SOLUTION
        if has_banned_word(sentence_to_scan, banned_word):
                return True 
    return False 
        # END SOLUTION

Run the following tests to check your work.

In [6]:
test(has_word_from_list)
🆗 Tests passed ! =)

Here is a list of the slang words that dogs start to use:

In [7]:
# retrieving the list of words that are banned
path_bannedWords = os.path.join('res', 'bannedWords.csv')
list_of_banned_words = pd.read_csv(path_bannedWords, sep = ';')['text']
list_of_banned_words
Out[7]:
0      dawgs
1        dog
2       dogs
3      doggo
4    doggies
Name: text, dtype: object

Instructions

Complete the cell below so that:

  • Each sentence is analyzed by the filter
In [8]:
# reading the test file with slang
path_dog_variants = os.path.join('res', 'dogVariants.csv')
sentences = pd.read_csv(path_dog_variants, sep = ';')
  
# iterating over the rows in the dataframe, `i` is the index of the row, and `row` contains the row itself 
for i, row in sentences.iterrows():
    # getting the sentence in that row
    sentence = row['text']
          
    ### YOUR CODE HERE
    # checking whether the sentence contains the banned word
    contains_banned_word = has_word_from_list(sentence, list_of_banned_words) # SOLUTION
    ### END OF YOUR CODE
    
    # Storing the result in the table
    sentences.at[i, 'contains_banned_word'] = contains_banned_word 

# show the result
sentences.style.apply(lambda r: ['color: red'] * len(r) if r['contains_banned_word'] else [''] * len(r), axis=1)
Out[8]:
  text contains_banned_word
0 dawgs are the best! True
1 He's a cool cat. False
2 Cats > doggos True
3 Did you know that cats can live 1000 lives ? False
4 Doggies forever! True
5 I heard that this kibble is the best one of all. False
6 Fido had to get his tooth pulled. False
7 Isn't her doggo the cutest thing? True
8 Cats are horrible False

Reflection time !

Are you satisfied with the results of your filter?

Feedback - Click on the "..." below only once you have really tried to answer the question!

While we have some of the same problems as before, the filter seems to be a bit more robust.
However we can see clear examples where it totally fails... It is a step forward, but we can't be really happy yet about this filter.


Part 1.2 Sentiment Analysis¶

The dogs are now working to spam Catter with negative statements about cats.

To address this issues, we will use the NLTK toolkit, a popularly used library for processing texts. It provides a number of practical tools for language processing and includes several corpora and lexical resources (documentation here).

We will use a sentiment analyzer developed for social media text called VADER, which means Valence Aware Dictionary and sEntiment Reasoner. This tool is rule-based and it relies on a lexicon of sentiment-related words, i.e. a list of word with associated sentiment ratings (positive or negative).

In the following cell, we first initialize the sentiment analyzer SentimentIntensityAnalyzer and show an example of the function polarity_scores.

In [9]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

# We instantiate the sentiment analyzer
sia = SentimentIntensityAnalyzer()

# Example of its usage
ps = sia.polarity_scores("This person is good")
ps
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/mac/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
Out[9]:
{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}

A polarity_score consists of:

  • 4 values neg (negativity), neu (neutrality), pos (positivity) which all add up to 1
  • and the compound, which is an overall normalized score that ranges from -1 (very negative) to 1 (very positive).

From the result of the analyzer, you can extract individual scores in the following way:

In [10]:
ps['neu']
Out[10]:
0.508

Task: Detect negative statements¶

Instructions

Write the function is_negative_statement so that:

  • The statement is evaluated using sia.
  • The function returns True if the statement is negative, False otherwise.
In [11]:
def is_negative_statement(statement, sia=sia):
    """ is_negative_statement checks whether the statement has a negative compound polarity score.

    Arguments:
    - sia: the sentiment analyzer, initialized in main function
    - statement: the string to analyze

    Returns:
    - True if the statement is negative
    - False if the statement is neutral or positive
    """
  
    ### YOUR CODE HERE
    result = sia.polarity_scores(statement)['compound'] < 0 # SOLUTION
    ### END OF YOUR CODE

    return result

Run the following tests to check your work.

In [12]:
test(is_negative_statement)
🆗 Tests passed ! =)

Let's see the new sentences we have and put them into a panda dataframe.

In [13]:
path_catNegativity = os.path.join('res', 'catNegativity.csv')
sentences = pd.read_csv(path_catNegativity, sep = ";")
print(f'Number of sentences: {len(sentences)}')
sentences.head() #shows the first 5 rows of the dataframe
Number of sentences: 6
Out[13]:
text
0 cats are the worst!
1 I hate cats
2 Ewwww cats
3 Cats are so cute and amazing
4 Dogs are the worst

Instructions

Complete the cell below so that:

  • Each sentence is analyzed by the sentiment analyzer
In [14]:
for i, row in sentences.iterrows():
    sentence = row['text']
    
    ### YOUR CODE HERE
    is_negative = is_negative_statement(sentence, sia) # SOLUTION
    ### END OF YOUR CODE

    sentences.at[i, 'negativity'] = is_negative

# let's apply some coloring to better visualize the result
sentences.style.apply(lambda r: ['color: red'] * len(r) if r['negativity'] else [''] * len(r), axis=1)
Out[14]:
  text negativity
0 cats are the worst! True
1 I hate cats True
2 Ewwww cats False
3 Cats are so cute and amazing False
4 Dogs are the worst True
5 Dogs are superior to cats False

Reflection time !

Did the sentiment analyzer filter out all sentences correctly ?

  • Yes
  • No

If we were considering a social media platform for humans instead of cats, what could be the potential implications of using this system to filter out toxic content ?

Feedback - Click on the "..." below only once you have really tried to answer the question!

Quizz answer:
No, some negative sentences have not been filtered.


The system would not really prevent the exposure of users to some toxic content on the platform.
However we don't have a clear idea of the proportion of toxic content that would escape, we need to have better measures to quantify these effects.

Conclusion from Part 1¶

You have now practiced a few basic filtering techniques (keywords matching, sentiment analysis) on the example of Catter, and you have seen their limits.
In the next part of the exercise, you will practice with more realistic data that comes from Twitter, and apply more advanced techniques of Natural Language Processing (NLP) that are now widely used in the latest moderation systems.


Part 2: Twitter¶

This part uses real world data coming from Twitter and you will practice using Natural Language Processing (NLP) techniques in two tasks:

  • Task 1: filter out offensive content based on an opinion lexicon, and learn how to use the accuracy score and the confusion matrix
  • Task 2: use a Machine Learning model to detect offensive content, and evaluate its performance

The dataset¶

Let's first have a look at the tweet_eval dataset which is a collection of tweets from Twitter (now named X) available on the Hugging Face platform.
We load it using the load_dataset method of the datasets library, specifying that we only want to load the data about offensive content. After loading it, we print it to see the structure of the data.

Note: this cell may take a few seconds to execute.

In [15]:
from datasets import load_dataset

# dataset that we will use to evaluate our model
dataset = load_dataset("tweet_eval", "offensive")
dataset
Out[15]:
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 11916
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 860
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1324
    })
})

As can be seen from the structure, the data is separated into three subsets, which are called train, validation and test data. This is a common approach in machine learning where the data is split into different subsets for different tasks. Basically, the train data is used to develop a machine learning model, the validation data is used refine and optimize the model, and test data is used to measure the performance of the resulting model with previously unseen data.
We will come back to these concepts later.

You can access a datapoint from one of the subsets using array indexing. The cell below displays the first tweet among the list of tweets in the training dataset as an example.

In [16]:
dataset["train"][0]
Out[16]:
{'text': '@user Bono... who cares. Soon people will understand that they gain nothing from following a phony celebrity. Become a Leader of your people instead or help and support your fellow countrymen.',
 'label': 0}

As you see above, a datapoint is a dictionary with 2 keys:

  • The key text refers to the text of the tweet itself;
  • The key label refers to a value which can be 0 or 1 , meaning "non-offensive" and "offensive" respectively.
    These labels have been determined by human experts during data labelling (more information about it here). Such labels are often called "ground truth", however this term is questionable as it suggests the labels are perfectly factual and accurate, when in reality they are often subjective, noisy, or biased depending on how they were collected or who annotated them. For instance, the tweet above has been labelled as non-offensive when in fact many people would consider it offensive.

In the following, we will compare:

  • the labels attributed by the human experts
  • with the labels determined by the filtering system we will build.

This will allow us to check if our filter works or not.

Part 2.1: Using an opinion lexicon¶

In this exercise, we will first implement a simple method to identify harmful content by identifying whether the text contains negative words. Instead of building and using our own list of negative words, we will use a lexicon provided by the NLTK library that we have started to use in the previous section. A lexicon is a kind of dictionary that contains a list of language elements associated to other information. The lexicon we will use has been built by the Computer Science University of Illinois at Chicago and contains around 6800 words along with the opinion (negative or positive) they represent.

Implementing the filtering system¶

First we need to load the list of negative words from the lexicon.

In [17]:
import nltk
from nltk.corpus import opinion_lexicon

# Retrieve the list of negative words
nltk.download('opinion_lexicon')
negative_words = opinion_lexicon.negative()

# Show a few examples
negative_words
[nltk_data] Downloading package opinion_lexicon to
[nltk_data]     /Users/mac/nltk_data...
[nltk_data]   Package opinion_lexicon is already up-to-date!
Out[17]:
['2-faced', '2-faces', 'abnormal', 'abolish', ...]

Next let's implement our filtering system using this list of negative words.

Instructions

What is the function has_negative_word below doing? To figure it out, comment very briefly what each line is doing!

In [18]:
# BEGIN SOLUTION NO PROMPT
# Function that tests whether a text contains one of the negative words
def has_negative_word(negative_words, text):
    # Iterate over all negative words
    for negative_word in negative_words:
        # Test whether the text contains a negative word - we put the text in lower case because the lexicon is in lower case
        if negative_word in text.lower():
            # Interrupt the loop and return 1 as soon as we find one negative word
            return 1
    # Return 0 if we didn't find any negative word
    return 0

# END SOLUTION
""" # BEGIN PROMPT
def has_negative_word(negative_words, text):
    for negative_word in negative_words:
        if negative_word in text.lower():
            return 1
    return 0
"""; # END PROMPT

Feedback - Click on the "..." below only once you have really tried to answer the question!

The function determines whether a text (e.g. a tweet) contains one of the negative words from the provided list (e.g. the lexicon).
It iterates over all the negative words in the list and interrupts as soon as one negative word is found to return 1, or return 0 if no negative word is found (which will take longer to check).
The text is put in lower case because the lexicon is in lower case.

⚠️⚠️⚠️ Do not forget to execute the cell above to make sure the has_negative_word function is defined. ⚠️⚠️⚠️

Now we are going to apply this function to the test dataset with the offensive tweets and see which tweets it flags as offensive, then compare with the labels from the experts.

In the following, we will use the term predictions to talk about the outputs of our filtering system.
Like the term label that we have introduced earlier, the term prediction comes from the field of Machine Learning, as ML algorithms are used to generates probable values for variables.

In [19]:
# Initialize two lists for storing the predictions and the labels
predictions = []
labels = []

# Iterate over all texts in the test dataset
for sample in dataset["test"]:
    # Store the result of the function indicating whether the sample contains a negative words
    predictions.append(has_negative_word(negative_words, sample["text"]))
    # Store the actual label 
    labels.append(sample["label"])

# Build a two-column dataframe with the results
comparisons = pd.DataFrame({'prediction': predictions, 'label': labels})

# Let's have a look at a few rows from the result:
comparisons.sample(5, random_state=30) # Draws 5 random rows among the dataframe (the random_state parameter ensures the reproducibility of the code)
Out[19]:
prediction label
290 1 1
661 1 0
200 1 0
80 1 1
148 1 1

Reflection time !

How much does our filter agree with the experts in the above sample?

Feedback - Click on the "..." below only once you have really tried to answer the question!

There is some agreement between the rating by expert for samples 290, 80 and 148, where the tweets are rated as offensive on both sides.
On the other hand, our system rates samples 661 and 200 as offensive while experts don't.
Overall, our system rates all sampled tweets as offensive, so it looks like it may overfilter potentially.

Task 1: Compute the accuracy¶

We can now evaluate our basic method by comparing our predictions with the actual labels. The accuracy score is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition:

$$\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$$

Instructions

Using the comparisons dataframe, compute the accuracy of our filtering method.

Note:

  • you can identify which rows have equal values in two columns of a dataframe with: dataframe[dataframe["columnA"] == dataframe["columnB"]]
  • you can obtain the length of a dataframe (number of rows) using: len(dataframe)
In [20]:
def compute_accuracy_score(comparisons):
    """ Computes the accuracy score for a dataset with two columns

    Arguments:
    - comparisons: a dataframe with two columns labelled 'prediction' and 'label'

    Returns:
    - accuracy: proportion of correct predictions among all rows in the dataset
    """
    
    ### YOUR CODE HERE
    correct_pred = len(comparisons[comparisons["prediction"] == comparisons["label"]]) # SOLUTION
    total_pred = len(comparisons) # SOLUTION

    accuracy = correct_pred / total_pred # SOLUTION
    ### END OF YOUR CODE
    
    return accuracy

print(f'Accuracy: {compute_accuracy_score(comparisons)}')
Accuracy: 0.4232558139534884

To know whether your computation is correct, you can compare with the result given by scikit-learn, a popular machine learning library, which provides an easy way to compute the accuracy.
The function accuracy_score takes two arguments:

  • first the ground truth
  • second the predictions

Execute the code cell below to see the result:

In [21]:
from sklearn.metrics import accuracy_score
print(f'Accuracy: {accuracy_score(comparisons["label"], comparisons["prediction"])}')
Accuracy: 0.4232558139534884

As we can see, our basic method is only correct for about 42% of the samples.

Task 2: Generate the confusion matrix¶

To get a better idea of the errors of our approach, we can plot the confusion matrix of our results.
It provides us with a way to evaluate how well the model is doing at this task by comparing the result provided by the system (predicted value) and the ground truth (actual value, called label). It breaks down performance into four categories: true positives, true negatives, false positives, and false negatives.
In our context, "positive" i.e. the value 1 means that the content of the tweet is considered offensive, therefore:

  • A true positive (TP) is when the model correctly identifies offensive content as offensive.
  • A false positive (FP) is when the model incorrectly identifies non-offensive content as offensive. This is sometimes referred to as a "false alarm" because the model is sounding the alarm for content that is not actually offensive.
  • A true negative (TN) is when the model correctly identifies non-offensive content as non-offensive.
  • A false negative (FN) is when the model incorrectly identifies offensive content as non-offensive. This is sometimes referred to as a "miss" because the model is missing the offensive content that it was supposed to identify.

The confusion matrix looks like this:

    Predicted value
  Positive (1) Negative (0)
Actual value
i.e. label
Positive (1) True Positives False Negatives
Negative (0) False Positives True Negatives

Based on the definitions above, for binary classification, accuracy can also be calculated as follows:

$$\text{Accuracy} = \frac{\text{TP + TN}}{\text{TP + TN + FP + FN}}$$

By looking at the numbers in each category of the confusion matrix, we can get a better sense of how well the model is performing overall.

Instructions

Complete the function compute_confusion_matrix below: read the example provided for the true positive rate, then use the same syntax for the other parts of the confusion matrix.

As we have seen in the Introduction notebook, remember:

  • The booleans operators with the pandas library are & and | and putting parentheses is mandatory when using more than one condition.
  • .shape[0] will give you the number of rows of a dataframe (but you could also use len(...)).
In [22]:
def compute_confusion_matrix(comparisons):
    """ Computes the four elements of the confusion matrix: TP, FP, TN and FN

    Arguments:
    - comparisons: a dataframe with two columns labelled 'prediction' and 'label'

    Returns:
    - TP, FP, TN and FN (in this order)
    """

    ### YOUR CODE HERE
    # reminder: offensive = 1 = positive
    TP = comparisons[(comparisons['prediction'] == 1) & (comparisons['label'] == 1)].shape[0]
    TN = comparisons[(comparisons['prediction'] == 0) & (comparisons['label'] == 0)].shape[0] # SOLUTION
    FP = comparisons[(comparisons['prediction'] == 1) & (comparisons['label'] == 0)].shape[0] # SOLUTION
    FN = comparisons[(comparisons['prediction'] == 0) & (comparisons['label'] == 1)].shape[0] # SOLUTION
    ### END OF YOUR CODE
    
    return TP,FP,TN,FN

Run the following tests to check your work.

In [23]:
test(compute_confusion_matrix)
🆗 Tests passed ! =)

Run the following cell which plots your implementation of the confusion matrix and visualize the results !

In [24]:
TP, FP, TN, FN = compute_confusion_matrix(comparisons)
draw_confusion_matrix(TP, FP, TN, FN)
No description has been provided for this image

Reflection time !

Q1) What conclusion can you make from the confusion matrix ?

Q2) Where do you think the high number of false positives come from ? Hint: Try to think about tweets that could be false positives.

Q3) Do you think the model successfully differentiates offensive from non-offensive tweets ?

Feedback - Click on the "..." below only once you have really tried to answer the question!

Q1) From the confusion matrix, we see that the number of false positives is very high, which means that the model is over labelling in favor of offensive content, i.e. the model censures a lot of content.

Q2) The high number of false positives might come from tweets having banned words but without the purpose of being offensive, like when we do sarcasm. Innocent tweets are thus flagged as offensive ones.

Q3) The model doesn't successfully differentiates offensive from non-offensive tweets, as it considers most non-offensive tweets as offensive ones

Quizz time !

In general, a toxic content filtering algorithm should always minimize the number of:

  • True Positive and True Negative
  • True Positive and False Positive
  • True Positive and False Negative
  • True Negative and False Positive
  • True Negative and False Negative
  • False Positive and False Negative

Feedback - Click on the "..." below only once you have really tried to answer the question!

False Positive and False Negative, as True Positives & Negatives are correctly labeled samples.

We have seen with the accuracy and confusion matrix that this filter is not effective at detecting offensive content: while it may protect users from most harmful content (high number of True Positives), it actually harms users by censoring them (very high number of False Positives), therefore infringing on their freedom of speech.

We have to mention here that lexicons are usually not used in that way. They are important elements of content moderation approaches that use Machine Learning techniques, which we are going to explore in Part 2.2. However, this first simple filter has allowed you to practice two essential tools for evaluating the quality of such software: the accuracy score and the confusion matrix.

Part 2.2: Using a deep learning model¶

In this exercise, we will use a deep learning model to detect offensive content.
It would be highly inefficient to have you train your own deep learning model, which is why we have created one for you by "fine-tuning" an existing language model called DistilBERT base model (uncased) on the tweet_eval dataset. (More information on fine tuning here - this approach of post-training an existing model has shown significant improvement over training a new model from scratch on the new data). We have made the fine-tuned model available on the Hugging Face library: https://huggingface.co/RS-course/model-safety-W1

In the following you will:

  • evaluate the performance of this model
  • use this model on example tweets

Task 1: Compute the accuracy and confusion matrix¶

After training our model with the train dataset, we have tested the model on the test dataset and saved the results into the tweetsPredictions.csv file.
Your task is to use the accuracy score and the confusion matrix that we have seen above to evaluate the quality of this deep learning model.

Execute the cell below to load the results of our test:

In [25]:
# Retrieve the results from the CSV file and load them into a pandas dataframe
path_tweetsPredictions = os.path.join('res', 'tweetsPredictions.csv')
model_test_results = pd.read_csv(path_tweetsPredictions)
model_test_results.head()
Out[25]:
prediction label
0 1 1
1 1 0
2 1 0
3 0 0
4 1 0

Instructions

Complete the cell below to compute the accuracy of the deep learning model (you can reuse the code from previous exercises).

NB: to retrieve one column from a dataframe, you can use: dataframe["columnname"]

In [26]:
### YOUR CODE HERE
accuracy_score(model_test_results["label"], model_test_results["prediction"]) # SOLUTION
Out[26]:
0.8151162790697675

Instructions

Complete the cell below to create the confusion matrix of the deep learning model by using the function you have already implemented.

In [27]:
### YOUR CODE HERE
TP,FP,TN,FN = compute_confusion_matrix(model_test_results) # SOLUTION
print(f'TP: {TP}\nFP: {FP}\nTN: {TN}\nFN: {FN}\n') # SOLUTION
draw_confusion_matrix(TP, FP, TN, FN) # SOLUTION
TP: 162
FP: 81
TN: 539
FN: 78

No description has been provided for this image

Check your work:

  • You should obtain an accuracy of: 0.8151162790697675
  • Your confusion matrix should look like the following:
TP = 162 FN = 78
FP = 81 TN = 539

Reflection time!

Compare with the accuracy and confusion matrix you obtained with our basic system based on the lexicon in Part 2.1.
What do you think about the performance of the model?

Feedback - Click on the "..." below only once you have really tried to answer the question!

The accuracy has roughly doubled compared to the previous technique, which is a great improvement!
In the confusion matrix, the number of true negatives has increased a lot, and the number of false positives as decreased a lot, which is good!
However, the number of true positives has decreased and the number of false negatives has increased, which is less good.

Task 2: Relative proportions of false positives and false negatives¶

One issue with the confusion matrix as presented above is that it shows us absolute numbers, but we actually didn't check how many offensive and non-offensive tweets are in the test dataset, therefore we don't know how well the model actually performs relatively to the contents of the dataset.
Let's have a look:

In [28]:
counts = model_test_results["label"].value_counts()
print("Number of non-offensive tweets (labelled 0):", counts[0])
print("Number of offensive tweets (labelled 1):", counts[1])
Number of non-offensive tweets (labelled 0): 620
Number of offensive tweets (labelled 1): 240

From the cell above we see there are way more non-offensive tweets (labelled 0) than offensive tweets (labelled 1).

To have a clearer idea about the performance of the model, we will now take into account the number of offensive and non-offensive tweets which are in the test dataset, and compare the errors made by the model with these. For this, we are going to compute two additional metrics:

  • the False Positive Rate (FPR): it represents the proportion of tweets that have been wrongly predicted as offensive compared to all the tweets which are actually non-offensive, therefore it gives us an idea of how many have been wrongly censored among non-offensive tweets.
$$FPR = \frac{FP}{FP + TN}$$
  • the False Negative Rate (FNR): it represents the proportion of tweets that have been deemed non-offensive compared to all the tweets which are actually offensive, therefore it gives us an idea of how many have wrongly escaped moderation among offensive tweets.
$$FNR = \frac{FN}{FN + TP}$$

These two metrics should be the lowest possible.

Instructions

Complete the cell below to compute the FPR and FNR for this model.

In [29]:
def false_positive_rate(TP, FP, TN, FN):
    ### YOUR CODE HERE
    FPR = FP / (FP + TN) # SOLUTION
    
    return FPR

def false_negative_rate(TP, FP, TN, FN):
    ### YOUR CODE HERE
    FNR = FN / (FN + TP) # SOLUTION

    return FNR

Run the following tests to check your work.

In [30]:
test(false_positive_rate)
🆗 Tests passed ! =)
In [31]:
test(false_negative_rate)
🆗 Tests passed ! =)

Now let's have a look at the FPR and FNR for our deep learning model:

In [32]:
# Load the predictions from the model on the test dataset
comparisons = pd.DataFrame({'prediction': model_test_results["prediction"], 'label': model_test_results["label"]})
TP,FP,TN,FN = compute_confusion_matrix(comparisons)

# Compute the FPR and FND
FPR = false_positive_rate(TP, FP, TN, FN)
FNR = false_negative_rate(TP, FP, TN, FN)

# Display the results
print(f'Proportion of tweets wrongly censored (FPR): {round(FPR*100)}%')
print(f'Proportion of tweets that have wrongly escaped moderation (FNR): {round(FNR*100)}%')
Proportion of tweets wrongly censored (FPR): 13%
Proportion of tweets that have wrongly escaped moderation (FNR): 32%

Reflection time !

What do you think about the performance of the model based on these two metrics?

Feedback - Click on the "..." below only once you have really tried to answer the question!

The FNR is 32%, which means that approximately one third of offensive tweets go unmoderated, which is quite a high proportion.
In addition, the FNR is twice as high as the FPR (13%), meaning that the model is twice more likely to let offensive tweets go unmoderated than to wrongly censor tweets. In a sense, this model is relatively liberal, which may result in more people being exposed to harmful content.

Task 3: Use the model to evaluate sample tweets¶

In this part of the exercise, you will do a "reality check" and get to test the model on some example tweets.

We will first load the model, then create a function that runs the model on a dataset to get the results.
Since this exercise is not about deep learning, you don't need to understand the code below.
Execute the cell then move to the activity just after.

In [33]:
# Import the libraries we need
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from threadpoolctl import ThreadpoolController, threadpool_limits 
controller = ThreadpoolController()

# Optimization: if GPUs are available we use them, if not, we limit the number of threads to avoid loosing performance
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.set_num_threads(2)

# Loading our pre-trained model
model = AutoModelForSequenceClassification.from_pretrained("RS-course/model-safety-W1").to(device)

# Loading a tokenizer to preprocess the text
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

# Building a pipeline that combines the tokenizer and the model
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Create a function that generates predictions with the model for an array of content texts
@controller.wrap(limits=2, user_api='blas') # limits the number of threads to 2 to avoid performance loss
def predict(data, classifier): 
    """ Applies the provided classifier to the data and returns a dataframe with the resulting predictions.

    Arguments:
    - data: the data to classify (simple list/array, iterable)
    - classifier: classifying pipeline to use

    Returns:
    - Pandas dataframe with two columns: the texts and the obtained prediction from the model
    """
    items = []
    predictions = []
    
    # for all items in the data
    for item in data:
        # save the item
        items.append(item)
        
        # run the model and save the prediction result
        pred = classifier(item)[0]['label']
        predictions.append('Not offensive' if pred=="LABEL_0" else 'Offensive')

    return pd.DataFrame({'items': items, 'predictions': predictions}) 
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.

Instructions

Add some tweets to the list in the cell below and evaluate whether the model is able to correctly label them or not.

In [34]:
tweets = [
    "Just spilled coffee over my new shirt. Great start to the day!",
    "Apparently, the secret to success is waking up at 4 am. Guess I'm doomed.",
    "Spent two hours in traffic today. Absolutely loved it.",
    # Add more examples here...
]

predict(tweets, classifier)
Out[34]:
items predictions
0 Just spilled coffee over my new shirt. Great start to the day! Not offensive
1 Apparently, the secret to success is waking up at 4 am. Guess I'm doomed. Not offensive
2 Spent two hours in traffic today. Absolutely loved it. Not offensive

Reflection time !

What is your conclusion from this "reality check"? In which cases have you found the model to fail?

Feedback - Click on the "..." below only once you have really tried to answer the question!

You should be able to observe that the model is quite sensitive to specific terms, which means that tweets containing these terms will be flagged as offensive even if the terms are used in a positive way. The model also tends to fail on subtle formulations.

Synthesis¶

It is now time to step back and reflect on the implications of what you have discovered in this notebook!

Instructions

To summarize what you have done, we suggest that you review this notebook and:

  • list the different methods you have used to filter content
  • list the different methods you have used to measure the quality of your filters

Feedback - Click on the "..." below only once you have really tried to answer the question!

Methods you have used to filter content:

  • keyword matching
  • sentiment analysis
  • lexicon
  • deep learning model

Methods you have used to measure the quality of your filters:

  • test on an example dataset
  • compute the accuracy
  • generate the confusion matrix and look at the False Positives and False Negatives
  • compute the False Positive Rate and the False Negative Rate

Then answer these final reflection questions and check your reasoning.

Final reflection time !

Q1) What are the implication of a toxic content filter predicting a lot of false positives ?

Q2) What are the implication of a toxic content filter predicting a lot of false negatives ?

Q3) What should be prioritize when implementing a content moderation algorithm ?

  • Reduce the toxic content
  • Respect freedom of speech
  • None

Bonus) To train our model that detects offensive content, we have used a dataset that includes labels that represent the "ground truth". What type of issues could there be with these labels ?

Feedback - Click on the "..." below only once you have really tried to answer the question!

Q1) By predicting a lot of false positives, the filter flags innocent content. The platform will probably not respect the freedom of speech of its users, and this is a safety issue.

Q2) By predicting a lot of false negatives, the filter misses toxic content. The platform can potentially be flowed by toxic content, and this is a safety issue.

Q3) None, the balance between the two is key !

Bonus) Having the "ground truth" comes with the process of data labelling, usually done by human experts. In the context of offensive content, it means that some humans had to read thousands of tweets ranging from very neutral to very offensive, with the possible desastrous psychological consequences on them (racial, homophobic content for instance). This is also a safety issue, which concerns an indirect type of stakeholders from the content moderation system.

Conclusion¶

Congratulations! You have finished this notebook!
Now is time to watch the videos from the MOOC to further your understanding of safety issues with software and with content moderation in particular.