Safety Week 2: Content recommendation¶

How to use this notebook

Simply read the text and follow the instructions.
This notebook contains code cells, which can be modified and must be executed to see the result of their content.
To execute a cell, select it and click on the play button (▶) in the tool bar, or type Shift + Enter or Ctr + Enter.

As the variables contained in a cell are stored in memory, the order of execution of the cells is important!

Notebook by Eugène Bergeron, Cécile Hardebolle and the Responsible software team (2025).

Except where otherwise noted, the content of this notebook is licensed under a Creative Commons Attribution International License (CC BY 4.0 International).
Creative Commons License


Introduction¶

You are a software engineer working on Memeverse, an application where people can share and explore memes. Memes are cultural ideas, jokes, or trends that rely on humor, relatability, or commentary on current events, which make them spread rapidly online. They can take various forms such as images, videos, or short texts. Memeverse is specialized in images.

Memeverse includes a smart recommendation system that personalizes the app content for each user.
A recommendation system is an algorithm that selects items (in our case images) to present to each user from a large dataset. They are now everywhere in our life, especially in social media. Usually, they are developed in order to maximize the satisfaction of the users (or the time they will spend on the app).

It is your job to implement the recommendation algorithm for Memeverse.
You work closely with the ethics department in order to avoid potential risks with your implementation.

Learning goals

What will be covered

  • How to simulate the effects of content recommendation algorithms at large scale
  • 3 different recommendation algorithms:
    • Random pick
    • Advertisement aware
    • Preference based
  • Vulnerability of recommendation algorithms to actions from malicious actors

By the end of the session you will be able to:

  • ✅ Implement basic recommendation systems for memes recommendation and see them in action
  • ✅ Analyze the ethical issues involved in different types of recommendation systems
  • ✅ Identify the impact of malicious users on content recommendation

Simulating how users react to content recommendation¶

In this notebook, we will use a simulation to explore the impacts at large scale of the different recommendation systems you will develop. As individuals, we all have personal interests and preferences regarding the content we like, which might evolve over time for countless reasons. The content we are exposed to can be one of theses reasons, impacting positively or negatively our preferences. The simulation aims at making visible how the preferences of users evolve globally when they are exposed to recommended items, in our case, memes.

The overall principles of the simulation as well as the modelisation are explained here. If this is not enough, you can find a more comprehensive explanation at the end of this notebook or check directly the code available in the file src/Simulation.py.

Instructions

Execute this cell to import the simulation library.

In [1]:
from src.simulation import *
from src.tests import *

Modelling choices¶

In our simulation, both items to recommend (in our case, memes) and users are characterized by 5 attributes which take values in $[-1, 1]$: humour, sarcasm, offensive, motivational and overall sentiment.

For the Items, these attributes represent to the sentiment they convey.
The values have been determined from this dataset (Memotion dataset 7k) by converting the original text-based attributes into numerical values (ex: "hilarious" for the humour has been converted to the value 1, "not_funny" to -1, etc.).

For the Users, these values correspond to their preferences, i.e what type of memes they prefer. If they prefer sacarstic memes, the value "sarcasm" in their preferences will be close to 1. A value of 0 means they are indifferent about it.

The memes dataset¶

Memes are available in the memes pandas DataFrame (imported together with the simulation library previously).

Instructions

Execute this cell to have a look at the content of the memes dataframe.

In [2]:
print(f"Number of memes in the dataset: {memes.shape[0]}")
memes.head()
Number of memes in the dataset: 6993
Out[2]:
humour sarcasm offensive motivational overall_sentiment author
image_name
image_0.png 0.0 -1.0 -1.0 1 1.0 Celebrations
image_1.jpg 1.0 1.0 -1.0 -1 1.0 Bob
image_2.jpeg -1.0 1.0 -1.0 1 1.0 GigaChad
image_3.JPG 0.5 -1.0 -1.0 -1 0.5 Walter
image_4.png 0.5 -0.5 0.5 1 0.5 Kurt

Each meme is indexed by its file path and has its 5 attributes describing the sentiment conveyed ranging from -1 (very negative) to 1 (very positive). For example, a meme with an overall_sentiment of 1 means that people might feel good when looking at it while a humour score of -1 means that the meme is really not funny. Little disclaimer: we are not responsible for how memes are considered funny or not, and we disagree with some scores.

You can visualize a meme using the show_meme function, which takes the file path of the image as an argument:

In [3]:
# this cell may take some time to run the first time
show_meme(memes.iloc[5000].name)
No description has been provided for this image

Disclaimer

The memes of the dataset do not reflect the views, opinion or humour of the course team. Viewer discretion is advised as some memes may be shocking, disturbing or contain mature themes.
If at some point you want to stop seeing memes, feel free to remove all occurences of the show_meme function.

The user dataset¶

Users are available in the dataframe init_users. This frame must not be modified! If you need to modify a user, first make a copy of the frame with init_users.copy(). If you made a mistake, just re-run the cell below to regenerate the user dataframe.

Instructions

Execute this cell to have a look at the content of the init_users dataframe.

In [4]:
nb_users = 5
# Generate users
init_users = generate_users(nb_users)
# Display 5 lines
init_users.head()
Out[4]:
name humour sarcasm offensive motivational overall_sentiment
0 Quentin 0.273923 -0.460427 -0.918053 -0.966945 0.626540
1 Xx_D4rkL0rd_xX 0.213272 0.458993 0.087250 0.870145 0.631707
2 Charlie -0.994523 0.714809 -0.932829 0.459311 -0.648689
3 Walter 0.082922 -0.400576 -0.154626 -0.943361 -0.751433
4 An Unnamed cell 0.341249 0.294379 0.230770 -0.232645 0.994420

Now let's implement some recommendation algorithms and use the simulation to look at their effects!

1. A trivial recommendation algorithm: random pick¶

As you just have been hired, the first recommendation system that you code simply returns a random selection of memes.

Random sampling in a dataframe¶

Let's first see how you can randomly select a meme among the dataset with the function sample.
Note that you can simply do Shift+Tab on a method when you are coding to see its documentation, or type function?? and run the cell.

In [5]:
# select a random item in the memes dataframe
meme = memes.sample(1) # returns a dataframe even if there is only 1 row in it!

# display the meme
show_meme(meme.iloc[0].name)
No description has been provided for this image

Recommending random items¶

And now it is time for you to implement the recommendation algorithm!

The central piece of the algorithm is the selector function. It takes 3 arguments:

  • the items from which it must select a susbet of items to recommend,
  • one user, for whom the selection is made,
  • k, the number of items that should be returned.

The set of items that returned by the function is called a slate, they are the items that are presented to the user when they start the app. They must be returend as a pandas dataFrame.

Instructions

Complete the selector_random function so that:

  • It returns a dataframe of k random items

Note: for this implementation we do not personalize the content to the user and so the user argument is not really needed, but we will use it in more complex versions.

In [6]:
def selector_random(items, user, k):
    """
    :param items: dataframe with all items from which to select
    :param user: user for whom to select the items
    :param k: number of items to select
    return: (DataFrame) a slate of k items drawn randomly from the dataset of items, regardless of the user.
    """
    ### YOUR CODE HERE
    slate = items.sample(k) # SOLUTION
    ###
    
    return slate

# tests to check if your function is correct
test(selector_random)
🆗 Tests passed ! =)

Let's have a look at the result of your selection for one user (the first one in our list of users) with 3 recommended items:

In [7]:
selector_random(memes, init_users.iloc[0], 3)
Out[7]:
humour sarcasm offensive motivational overall_sentiment author
image_name
image_3962.jpg 0.5 -1.0 -1.0 1 0.5 Quentin
image_5798.jpg -0.5 1.0 -1.0 -1 0.5 Judy
image_5947.png -1.0 1.0 0.5 1 0.5 Ivan

Now let's use the simulation to see how your recommendation system affects all the users of the app more globally.

The function simulate_and_render, implemented in Simulation.py, is used to simulate a recommendation system and shows its effects. You do not need to understand how it works in order to complete the exercises.

Instructions

Execute the following cell to see the simulation results.

In [8]:
init_users = generate_users(30)
cols = ["humour", "sarcasm"]  # you can modify the columns to plot in other axis (2 columns necessary)
simulate_and_render(init_users, memes, selector_random, col=cols, nb_steps=30, custom=custom_items_plotting(memes, cols, nb_items = 100))
No description has been provided for this image

How to interpret the results?

Below is the output of a simulation with 5 users: Simulation output explanation

The grey points are the items (memes), their position are determined by their attributes. One grey point can represent several items (see some points are darker than others). For example, there is a greater number of memes with (humour,sarcasm) = (0.5,1.0) than (humour,sarcasm) = (1.0,1.0). The paths represent the evolution of the users' preferences with the cross indicating a user's preferences at the end of the simulation. One can see that the user with the green path enjoys more and more humoristic content for example.


You can launch the cell several times to see that no user is determined to converge to a certain point in the graph and their walk can be very chaotic, which is expected from a full random dataset.

Reflection time!

Could there be negative effects of recommending random content to users?

Feedback - Click on the "..." below only once you have really tried to answer the question!

One risk when recommending content randomly is that harmful content (e.g. violent, chocking or misleading) may be presented to users if such content is present in the dataset (e.g. when the platform accepts content from users).
This is why content moderation systems need to be put in place in combination with recommendation algorithms.

Congratulation! You just made your first recommendation algorithm!
The marketing department, however, is not really happy about it, as it doesn't attract users nor brands...

2. Recommentation with advertisement¶

The brand "Celebrations" has contacted Memeverse to make some custom advertisement.
You come to an agreement for the way you will do it: you will systematicaly put the item corresponding to their products in 2nd position of the slate (reminds you of some real life recommendation systems?).
The rest of the recommendation stays random.

Concatenating dataframes¶

Before we start implementing the recommendation algorithm, we need to introduce you to pd.concat.
It is a function that allows to stack several Dataframes row-wise.

Here is an example of how concat works:

In [9]:
# Let's generate some fake dataframes
first_df = pd.DataFrame([['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']], index=[0, 1, 2])
second_df = pd.DataFrame([['x', 'y', 'z']], index=[3])
third_df = pd.DataFrame([[10, 10, 10]], index=[4])

# And look at the result of the concatenation
concatenation = pd.concat([first_df, second_df, third_df]) # WATCH OUT, concat takes a **list** of dataframes i.e. the [ and ] are needed
display(concatenation)
0 1 2
0 a b c
1 d e f
2 g h i
3 x y z
4 10 10 10

Selecting rows in dataframes¶

You will also need to select rows in your dataframe. For this you can use the iloc function, which allows you to access rows (and columns) of a dataframe using their integer position.

Here are two different ways to use iloc that can be useful for this exercise (there are many more ways to use it, see the documentation):

  • df.iloc[x] will select one row, at integer position x
  • df.iloc[start:stop] will select the range of rows which starts at integer position start included and stop at integer position stop excluded (i.e. the last row selected has position stop-1).
    The start and stop parameters are optional, and so [:stop] will give you the first rows until position stop excluded, and [start:] will give you the last rows starting at start position included.
    This syntax is called a slice, and it is a very powerful tool to select elements in a dataframe.

Execute the cell below to see the results (feel free to modify the code):

In [10]:
print("first row:")
print(concatenation.iloc[0])

print("\nfirst two rows:")
print(concatenation.iloc[:2])

print("\nrows with positions [1;3[:")
print(concatenation.iloc[1:3])
first row:
0    a
1    b
2    c
Name: 0, dtype: object

first two rows:
   0  1  2
0  a  b  c
1  d  e  f

rows with positions [1;3[:
   0  1  2
1  d  e  f
2  g  h  i

⚠️ WATCH OUT: When you select only one row using iloc, the result is a Series, not a dataframe!⚠️
To create a dataframe from a Series, use pd.DataFrame(...) as illustrated below:

In [11]:
# select one row, we get a Series
first_row = concatenation.iloc[0]
print(type(first_row))
print(first_row, "\n")

# create a Dataframe with the Series
# /!\ If we don't pass the Series in a list (i.e. between []), it will be a column instead of a row
df_first_row = pd.DataFrame([first_row])
print(type(df_first_row))
print(df_first_row)
<class 'pandas.core.series.Series'>
0    a
1    b
2    c
Name: 0, dtype: object 

<class 'pandas.core.frame.DataFrame'>
   0  1  2
0  a  b  c

For more details, check the documentation of iloc and this good tutorial on dataframe slicing.

Including an ad in the recommended items¶

Now that we have seen how to stack and select rows in DataFrames, here is the item to promote:

Item to promote

It is available in the item_to_promote variable:

In [12]:
item_to_promote = memes.iloc[0]
print(type(item_to_promote))
print(item_to_promote)
<class 'pandas.core.series.Series'>
humour                        0.0
sarcasm                      -1.0
offensive                    -1.0
motivational                    1
overall_sentiment             1.0
author               Celebrations
Name: image_0.png, dtype: object

Instructions

Complete the function selector_advertisement so that:

  • It returns a data frame that consists of:
    • k-1 random items
    • the item_to_promote, placed at the 2nd position of the slate

Note that item_to_promote does not need to be passed in argument to the function as its value has already been saved in memory with the previous cell!

In [13]:
def selector_advertisement(items, user, k):
    """
    :param items: dataframe with all items from which to select
    :param user: user for whom to select the items
    :param k: number of items to select
    return: a slate consisting of the selected item together with k-1 random items. Order matters!
    """
    ### YOUR CODE HERE
    # first let's draw k-1 random items
    random_items = items.sample(k-1) #SOLUTION
    
    # first dataframe: has to be one of the random items, will be the first row in the result (has to be in a dataframe!)
    first_df = pd.DataFrame([random_items.iloc[0]]) # SOLUTION
    
    # second dataframe: include the item to promote (has to be in a dataframe!)
    second_df = pd.DataFrame([item_to_promote]) # SOLUTION
    
    # third dataframe: remaining random items
    third_df = random_items.iloc[1:] # SOLUTION

    # concatenate all dataframes
    slate = pd.concat([first_df, second_df, third_df]) # SOLUTION
    ###
    
    return slate

# tests to check if your function is correct
test(selector_advertisement)
🆗 Tests passed ! =)

Let's have a look at the result of your selection for one user with 3 recommended items:

In [14]:
selector_advertisement(memes, init_users.iloc[0], 3)
Out[14]:
humour sarcasm offensive motivational overall_sentiment author
image_4347.jpeg 0.5 -0.5 -1.0 -1 0.5 Charlie
image_0.png 0.0 -1.0 -1.0 1 1.0 Celebrations
image_4951.jpg -0.5 1.0 0.5 -1 1.0 Ted

Now let's see what are the global effects of this recommendation algorithm on all users of Memeverse.

Instructions

Execute the following cell to see the simulation results.

In [15]:
cols = ["offensive", "overall_sentiment"]
init_users = generate_users(30)
simulate_and_render(init_users, memes, selector=selector_advertisement, col=cols, nb_steps=30, custom=plot_item_to_promote(item_to_promote, cols))
No description has been provided for this image

Reflection time!

Analysis of the result:
Before answering, feel free to re-run the previous simulation by changing the parameters (features, number of steps, number of users,...)

  • Can you guess what the blue star represents?
  • Do you think the results are satisfying for the brand?
  • (optional) Try plotting according to other columns. Do you get the same results? What could be an explanation for very different results?

Reflection questions:

  • This time the request came from a candy brand. What can be the negative effects of advertising for candy?
  • What if other types of brands ask for the same solution?
  • How would users of Memeverse react to this implementation?

Feedback - Click on the "..." below only once you have really tried to answer the question!

Analysis:

  • The blue star represents the item_to_promote.
  • The preferences of users converge quite clearly to align with the properties of the Celebrations meme (overall sentiment of +1 and offensive attribute of -1). So Celebrations might be super happy with the results of the deal.
  • However the results seem to vary depending on which features one visualizes in the simulation. For example, if we plot with respect to humour and motivational, we can see much more random walks. One explanation is that the dataset of memes is unbalanced, so the users gets to see more memes that are opposed to the item_to_promote on these two attributes, so the effects of the advertisement are balanced by the opposite memes the users will get to see.

Reflection questions:

  • The negative effect could be the overconsumption of candy, which could have harmful impacts for people dealing with health issues for instance.
  • If the item being promoted contained false information or was political in nature, it could have considerable potential for manipulation.
  • Currently our implementation does not make transparent to the user that one of the memes is actually an ad, and users could lose trust in the platform.

Note

In reality, those effects are extremely attenuated: in the case of the simulation the effects are amplified comparing to real life impacts.
But can't you name one or many brands that are omnipresent in the advertisement world?
For example, we can see countless advertisements for Coca-Cola on every media, the brand try to be everywhere and associate their image with good moments.

3. Preference-based recommendation¶

As more and more users use the application and more and more items are added, you are asked to complexify your algorithm such that it takes into account the preferences of the user.

Distance to preferences¶

One way to do it is to recommend items that are the closest to a user's preferences, which requires to compute the euclidean distance between the items' characteristics and a user's preferences.
We provide you the function to do it.

The higher the distance between the items' characteristics and a user's preferences (i.e. the more different they are), the greater the value returned by the function.

Instructions

Execute the following cell to define the function.

In [16]:
# We provide this function that computes the euclidean distance between each row of a table (that can contain the items) and a vector (that can be a user).
def dist(table, vec):
    """
    Compute the euclidean distance between all rows of a DataFrame and a Series (or a vector).
    The provided DataFrame and Series must contain only numerical values! No string!
    """
    # converting whatever is given to numpy in order to be able to apply linalg.norm
    rows_ = table.to_numpy(dtype=np.float64)
    vec_ = vec.to_numpy(dtype=np.float64)
    return np.linalg.norm(rows_ - vec_, axis=1)

Since this function cannot work with columns that contain non-numerical values, we provide you with an array-like variable called categories that you can use to select only numerical columns in the memes and user dataframes:

In [17]:
print(categories)
memes[categories].head()
Index(['humour', 'sarcasm', 'offensive', 'motivational', 'overall_sentiment'], dtype='object')
Out[17]:
humour sarcasm offensive motivational overall_sentiment
image_name
image_0.png 0.0 -1.0 -1.0 1 1.0
image_1.jpg 1.0 1.0 -1.0 -1 1.0
image_2.jpeg -1.0 1.0 -1.0 1 1.0
image_3.JPG 0.5 -1.0 -1.0 -1 0.5
image_4.png 0.5 -0.5 0.5 1 0.5

Sorting dataframes¶

In order to get the closest item to a user's preferences, you will need to be able to sort the dataframe of items by the distance to the user's preferences.
The example below demonstrates how to sort the dataframe memes with respect to the humour feature using the function sort_values and its parameter ascending.
Let's see what's the funniest meme in the list!

In [18]:
# sort the memes dataframe on the "humour" feature by decreasing value
sorted_by_humour = memes[categories].sort_values("humour", ascending = False) # WATCH OUT! sort_values does not modify the dataframe but returns a sorted version instead

# display the top lines of the result (the humour values should be 1.0 for all of the rows)
display(sorted_by_humour.head())

# display the first meme in the list, supposedly the funniest
print("One of the supposedly funniest meme in the list:")
show_meme(sorted_by_humour.iloc[0].name)
humour sarcasm offensive motivational overall_sentiment
image_name
image_1.jpg 1.0 1.0 -1.0 -1 1.0
image_5.png 1.0 0.5 0.5 -1 0.0
image_6970.jpg 1.0 0.5 -0.5 1 0.5
image_6.jpg 1.0 1.0 -0.5 1 -0.5
image_2915.jpg 1.0 1.0 -1.0 -1 0.5
One of the supposedly funniest meme in the list:
No description has been provided for this image

Recommending items based on preferences¶

Now let's apply what we have learned to build a recommendation algorithm which takes user preferences into account.

Instructions

Complete the function selector_preferences so that:

  • It returns the k items closest to the users' preferences.

Note: remember that the dist function works only with columns that contain numerical values, you can use the list of numerical columns called categories to select only the appropriate columns.

In [19]:
def selector_preferences(items, user, k):
    """
    :param items: dataframe with all items from which to select
    :param user: user for whom to select the items
    :param k: number of items to select
    return: a slate which takes the items that best suit the user's preferences
    """
    ### YOUR CODE HERE
    # put the distance between the user and the item in the "dist_with_user" column of the dataframe
    items["dist_with_user"] = dist(items[categories], user[categories]) # SOLUTION

    # sort the items by increasing distance
    items = items.sort_values("dist_with_user") # SOLUTION

    # return the k items closest to the user
    slate = items.head(k) # SOLUTION
    ###
    
    return slate

# tests to check if your function is correct
test(selector_preferences)
🆗 Tests passed ! =)

Let's have a look at the result of your selection for one user with 3 recommended items:

In [20]:
selector_preferences(memes, init_users.iloc[0], 3)
Out[20]:
humour sarcasm offensive motivational overall_sentiment author dist_with_user
image_name
image_1979.png 0.5 -0.5 -0.5 1 1.0 Charlie 0.92945
image_6629.jpg 0.5 -0.5 -0.5 1 1.0 Niaj 0.92945
image_3246.jpg 0.5 -0.5 -0.5 1 1.0 Sybil 0.92945

Now let's see what are the global effects of this recommendation algorithm on all users of Memeverse.

Instructions

Execute the following cell to see the simulation results.

In [21]:
init_users = generate_users(30)
cols=["offensive", "motivational"]
simulate_and_render(init_users, memes, selector=selector_preferences, col=cols, custom=custom_items_plotting(memes, cols, nb_items = 100), nb_steps=30)
No description has been provided for this image

Reflection time!

Analysis:

  • Comment the results.

Reflection:

  • Let's step back. Your recommandation system will give to users only the content that correspond to their preferences. What could be some negative consequences of this?

Feedback - Click on the "..." below only once you have really tried to answer the question!

Analysis feedback

  • One can see that the users' preferences are being amplified, positively as well as negatively. Over a larger number of steps we could imagine having two different groups of users with completely different preferences.
  • The algorithm tends to separate people that were close in preferences and create poles of preferences around the items.

Reflection feedback

  • The amplification of user preferences can lead to problematic effects when the nature of the recommended content has potential for harm, for instance depressing content, addictive content, content related to harmful health advice or diet/sport tendencies, and of course misinformation (e.g. conspiracies) or politically related content (e.g. radical views).
  • As users are progressively separated into groups of similar preferences, such an algorithm can amplify partisan divides and reduce trust across groups.

A common concern in the context of recommendation systems is the idea of 'filter bubbles', whereby users are mainly exposed to content that matches their existing views. This has the potential to both reinforce people's views and separate them into groups of opposed views. Research has shown that such bubbles can form on certain platforms. However, studies also highlight the active role of users, who often seek out content that matches their opinions themselves, beyond the recommendation algorithm.

The users are very satisfied and you get great reviews!
While this is awesome for the marketing department, the ethics department doesn't like it very much...
You reach a compromise to solve this problem by placing two completely random items at the top of the list in 50% of cases. This forces users to be more exposed to diversity.

Increasing the variety of recommendations¶

Instructions

Complete the function selector_preferences_with_50p_random such that:

  • It returns the k items closest to the user's preferences
  • In 50% of the cases, 2 random items replace the top ones.

Note:

  • You can reuse the code of selector_preferences to compute the distance and sort the values.
  • To implement a 50% probability, you can make a random draw with the method np.random.randint.
In [22]:
def selector_preferences_with_50p_random(items, user, k):
    """
    :param items: dataframe with all items from which to select
    :param user: user for whom to select the items
    :param k: number of items to select
    return: a slate that takes the items that best suit the user's preferences 
    and 50% of the time: a random item in first place
    """
    ### YOUR CODE HERE
    # compute the distance between the user and each item
    items["dist_with_user"] = dist(items[categories], user[categories]) # SOLUTION
    
    # sort the items by distance
    items = items.sort_values("dist_with_user") # SOLUTION

    # BEGIN SOLUTION NO PROMPT
    # in 50% of the cases, replace the first two items at the top with a random selection
    if np.random.randint(0, 100) < 50: 
        # keep the k first ordered items except the first two
        remaining_ordered_items = items.iloc[2:k]

        # select two random items from the rest of the dataset (i.e. after the k first ones)
        random_items = items.iloc[k:].sample(2)

        # build the slate
        slate = pd.concat([random_items, remaining_ordered_items]) 
    else:
        # select the first k items from the ordered list
        slate = items.iloc[:k]
    ###
    # END SOLUTION
    """ # BEGIN PROMPT
    # in 50% of the cases, replace the first two items at the top with a random selection
    if ...: 
        # keep the first k ordered items except the first two
        # select two random items from the rest of the dataset (i.e. after the k first ones)
        # build the slate
        slate = ...
    else:
        # select the first k items from the ordered list
        slate = ...
    ###
    """; # END PROMPT
    
    return slate

Let's have a look at the result of your selection for one user with 3 recommended items:

In [23]:
selector_preferences_with_50p_random(memes, init_users.iloc[0], 3)
Out[23]:
humour sarcasm offensive motivational overall_sentiment author dist_with_user
image_name
image_6080.png 1.0 -0.5 1.0 -1 0.5 Yohan 0.913798
image_3823.jpg -0.5 1.0 -1.0 -1 -0.5 Judy 2.536853
image_6585.jpg 1.0 0.5 0.5 -1 0.5 Ludovic 0.732143

Now let's see what are the global effects of this recommendation algorithm on all users of Memeverse.

Instructions

Execute the following cell to see the simulation results.

In [24]:
init_users = generate_users(30)
cols=["offensive", "motivational"]
simulate_and_render(init_users, memes, selector=selector_preferences_with_50p_random, col=cols, custom=custom_items_plotting(memes, cols, nb_items = 100), nb_steps=30)
No description has been provided for this image

Reflection time!

Analysis:

Is it better than before? Why?

Reflection:

Do you see other strategies that could be used to diversify the content recommended to users? Could there be negative effects of strategies that increase the variety of recommendations?

Feedback - Click on the "..." below only once you have really tried to answer the question!

Analysis feedback

While it does not fix the problem, we can see that some users have converged to different points than before, and are more spread around the items. This means that someone that previously would only see motivational content will now get to see some hilarious memes!

Reflection feedback

Another strategy for inscreasing diversity can be to systematically put an item that is very different from what the user is used to. You can try this (or other ideas) by yourself!

Many platforms do deliberately inject variety (so-called diversity heuristics) to avoid exposing users to overly narrow content. For example, YouTube and TikTok sometimes surface content "outside" a user’s usual cluster to increase engagement or novelty (and keep the user on the platform).
However, research has also showed that exposing users to content that is opposed to their views can sometimes backfire. For instance, it can lead to an exacerbation of social conflict online when the content is violent, negative and inflammatory.
So we have to be cautious with this strategy!

With this experiment, you realize the responsability that comes with designing recommendation systems: you have the power to influence what users see without them realizing it, which may affect their centers of interest, but also have a larger impact on their health, their opinions, etc.

Overall, predicting the results at large scale of changes in recommendation systems is extremely challenging due to the socio-technical nature of social media platforms.
When the content being amplified has political implications, these dynamics can influence public debate and even the health of democracies. We will explore this issue in more depth in the videos.

Note

This exercise deviated from reality by 2 things:

  • We perfectly know the preferences of the users in real time, while in reality approximations must be done and they evolve with the choices of the user.
  • In our simulation, recommending a content that satisfies a user's preferences has a very high probability to have a positive impact, we don't take into account the possibility that a user might become bored of seeing a certain type of content.

4. Nowhere to be safe?¶

In this exercise we will now explore some unexpected behaviors that might arise with recommendation algorithms.

Instructions

Execute the following cell to setup the simulation for this exercise.

In [25]:
# Run this cell before you start the next exercise. It won't work otherwise
memes, init_users = obfuscation(memes, init_users, lambda x: (x**3) * 27, chr, 115)
init_users = generate_users(20)

Back to random pick¶

As you have taken the wonderful course Responsible Software, you decide that it is better to avoid certain effects that you will see in the coming videos. You tell this to your boss and the ethics department and you all agree that an acceptable compromise is to use only the random recommendation system for the moment.
This works well, the adoption of Memeverse increases and users can now add their memes to the database.

Unexpected effects¶

Unfortulately, after some time you start to get some complaints from users who see very often the same type of content. They accuse Memeverse to lie about its neutrality.

You decide to investigate and your first reaction is to go check the measured preferences of the users.

Instructions

Execute the following cell to see the results.

In [26]:
simulate_and_render(init_users, memes, selector=selector_random, col=["humour", "offensive"], nb_steps=100)
No description has been provided for this image

That was unexpected! It has nothing in common with the first time you used that recommendation algorithm...
Now it is your turn to discover what is creating this mess.

Investigating the issue¶

You suspect that a bad actor might be the source of the issue (check Safety 1 if the term does not ring a bell).
After a thourough audit of the system, your security experts all confirm that there has not been any security breach and the code has not been tampered with.
However, you realize that actually users can manipulate your recommendation algorithm without even touching the code...

Instructions

Try to discover who is responsible for those behavior. You can use the cell below to print, display, simulate, ... Do what you think is necessary!

💡Here are some ideas to try out:

  • Look at the memes that are now hosted on Memeverse, do you see any pattern?
  • How many memes have been authored by each of the different authors in the dataset? The method value_counts could be a great help for this analysis.

You don't have to look into the Simulation.py code nor manipulate it in order to find the culprit! Look at the datasets available to you.

In [27]:
# BEGIN SOLUTION NO PROMPT
# Let's display a sample our list of memes and look at the authors
print(memes.sample(15)) # We can have here a first seeing that there is a problem as we see the same image multiple times

# Let's count the number of memes by each author
print(memes["author"].value_counts()) # Here it is clear that Mallory doesn't play according the same rules. 

# Another case solved!
# END SOLUTION
""" # BEGIN PROMPT
# Do whatever is necessary here detective (⌐■_■)


"""; # END PROMPT
                humour  sarcasm  offensive  motivational  overall_sentiment           author  dist_with_user
"troll.jpeg"       0.0      0.0        0.0           0.0                0.0          Mallory             NaN
"troll.jpeg"       0.0      0.0        0.0           0.0                0.0          Mallory             NaN
"troll.jpeg"       0.0      0.0        0.0           0.0                0.0          Mallory             NaN
"troll.jpeg"       0.0      0.0        0.0           0.0                0.0          Mallory             NaN
image_6530.jpg     1.0      1.0       -0.5           1.0                1.0  An Unnamed cell        2.382942
"troll.jpeg"       0.0      0.0        0.0           0.0                0.0          Mallory             NaN
"troll.jpeg"       0.0      0.0        0.0           0.0                0.0          Mallory             NaN
"troll.jpeg"       0.0      0.0        0.0           0.0                0.0          Mallory             NaN
"troll.jpeg"       0.0      0.0        0.0           0.0                0.0          Mallory             NaN
"troll.jpeg"       0.0      0.0        0.0           0.0                0.0          Mallory             NaN
"troll.jpeg"       0.0      0.0        0.0           0.0                0.0          Mallory             NaN
"troll.jpeg"       0.0      0.0        0.0           0.0                0.0          Mallory             NaN
"troll.jpeg"       0.0      0.0        0.0           0.0                0.0          Mallory             NaN
"troll.jpeg"       0.0      0.0        0.0           0.0                0.0          Mallory             NaN
"troll.jpeg"       0.0      0.0        0.0           0.0                0.0          Mallory             NaN
author
Mallory            27000
Yohan                276
Walter               271
Xx_D4rkL0rd_xX       269
Mike                 268
Olivia               264
Franck               261
Quentin              256
Ted                  255
Charlie              254
Niaj                 253
Ludovic              253
Bob                  250
Sybil                249
Leander              249
Zakarias             248
Ivan                 246
Chad                 245
Judy                 244
Vanna                243
Alice                242
Roman                242
Heidi                242
Eve                  239
David                239
Kurt                 237
GigaChad             236
Pat                  234
An Unnamed cell      227
Celebrations           1
Name: count, dtype: int64

Instructions

Put the name of the responsible author in the cell below to check your answer!

In [28]:
answer = "Mallory" # SOLUTION
print("This is not the correct answer..." if not verification(answer) else "You got it!")
You got it!

Once you got the answer...

Click on the "..."

Analysis of the problem

What did she do to manipulate all users?

  • She spammed the app with the same image
  • She modified all images
  • She altered the selector to put her images in priority

Feedback - Click on the "..." below only once you have really tried to answer the question!

She spammed Memeverse with the same image 27 000 times. So when the selector chose a meme at random, there was a very high chance that it would select Mallory's image.

Reflection time!

What can be the dangers of this manipulation of the application?

Feedback - Click on the "..." below only once you have really tried to answer the question!

Imagine if false information or political content were spread massively in a similar way... We will see in the videos the effects that exposure to false information at large scale can have and why.


Note

This type of behavior and manipulation of the algorithms is often used in Facebook, Google, Instagram, YouTube, etc... The companies addressed those issues with a pretty good success (Google Panda), but their countermeasures are sometime outsmarted by malicious users.

[Optional] Implement your own recommendation algorithm¶

If you wish to try a more complex recommandation system or experiment with the framework, this is your time to shine!
Note that this exercise is completely optional, feel free to skip this section and jump to the last exercise of the notebook.

Instructions

Execute the following cell to reset the memes dataset.

In [29]:
# To revert the changes
memes = memes.drop('"troll.jpeg"')
memes.shape
Out[29]:
(6993, 7)

Here are some interesting ideas, that might represent a coding challenge:

  • Recommend items with respect to the neighboring of the user (memes that other users close to him in preferences have seen)
  • Impose a user to see things he doesn't like when he has converged
  • Make a user see every type of content before making him converge
  • Try to make the best compromise between variety and preferences satisfaction

Remember to do Shift+Tab to see a documentation

In [30]:
def selector(items, user, k):
    return selector_random(items, user, k)
    return selector_preferences_with_50p_random(items, user, k)
    return selector_advertisement(items, user, k)
    return selector_preferences(items, user, k)
    return ... # your own

# attributes to visualize (x and y): "humour", "sarcasm", "offensive", "motivational", "overall_sentiment"
col = ["humour", "sarcasm"]
nb_steps = 100 # number of steps of the simulation

simulate_and_render(init_users, memes, selector=selector, col=col, nb_steps=nb_steps)
No description has been provided for this image

Synthesis¶

It is now time to step back and reflect on the implications of what you have discovered in this notebook!

Final reflection time!

List the different ethical issues that you have seen with recommendation algorithms while working for Memeverse.
Try to do it from memory, as this will help you memorize the concepts for longer. But if you need a bit of help, go back to review the content of the notebook.

Feedback - Click on the "..." below only once you have really tried to answer the question!

Random recommendation can lead to accidental exposure to harmful content.

Recommending content from a brand:

  • The product advertised could have negative impacts on people (e.g. health) that would be demultiplied with the larger exposure.
  • The type of ad (e.g. politically related) and the type of brand (e.g. its values) increase this issue further and can overall lead to mass manipulation.
  • The users need transparency in terms of which content is an ad and which is not.

Recommending content based on user preferences:

  • Exposing users only to content that matches their preferences might put them in a 'filter bubble', which has the potential to reinforce their views while separating users into groups of opposite views.
  • Increasing the diversity of the content that users see may backfire when the content is violent, negative and inflammatory.

Bad actors can influence the content that users see without tampering with the code, even with the random recommendation algorithm.


References¶

Check these research papers to learn more:

  • Practical Diversified Recommendations on YouTube with Determinantal Point Processes, Wilhelm et al., 2018,
  • Focusing on the Long-term: It's Good for Users and Business, Hohnhold et al., 2015
  • Experimental evidence of massive-scale emotional contagion through social networks, Kramer et al., 2014,
  • The Welfare Effects of Social Media, Allcott et al., 2020,
  • Do Recommender Systems Manipulate Consumer Preferences? A Study of Anchoring Effects, Adomavicius et al., 2013,
  • The search engine manipulation effect (SEME) and its possible impact on the outcomes of elections, Epstein et al., 2015.

More explanations on the simulation¶

At each step, a user is presented a slate (selection) of items from which they will choose 1 that will influence their preferences.

Choice of the item: Following this paper (Yao & al, 2021), the choice of the item among the slate follows a Beta distribution with respect of the rank of items, with parameters $\alpha = 0.5$ and $\beta = 3$.
Put it simpler terms: the user will selects higher ranked items in the slate more often than low ranked ones. For 5 items, this leads to the selection of the 1st item 75% of the time, the 2nd item 17% of the time, 3rd: 6%, 4th: < 2% and 5th: < 1%. We think this is pretty accurate of our usage of recommendation systems (Instagram, Youtube, etc.).

Dynamic of the preferences: This paper (Curmei & al, 2022) presents 2 psychological effects that have been experienced in the context of recommendation systems. Thoses effects have been the subject of studies and meta-studies that attest their existence, so we concluded they were a solid basis for our simulation.
The 2 effects are: the Mere-Exposure effect and the Operant conditioning effect.

  • Mere-Exposure: The more time you are confronted to something, the more likely you are going to like it
  • Operant conditioning: Past experiences and preferences influence positively/negatively how receptive you are to the presentation of an item.

Inspiring ourselves from the above paper, we decided to formally describe a mathematical formula for the dynamic of the preferences: $$\pi_{t+1} - \pi_t = \gamma(0.5 + \pi_t\nu_t + \epsilon)(\nu_t-\pi_t)$$ With:

  • $\pi_t$ : User preferences at time t of the simulation
  • $\nu_t$ : Item's conveyed sentiments
  • $\gamma$ : Speed of change of the preferences. Determines by how much the user are influenced by the item they see. For this simulation we set this value to 0.025
  • $\epsilon$ : Random noises ($\epsilon \sim \mathcal{N}(0, 0.15)$) that accounts for unpredictable reactions of the user.
  • $0.5 +\pi_t\nu_t$ : The $0.5$ accounts for the Mere-Exposure effect (no matter what happens, if you see the item you will be positively influenced by it) and $\pi_t\nu_t$ (scalar product) accounts for the Operant conditioning (if a preference is opposed to the item, the scalar product will be negative, lowering or even inverting the Mere-exposure)