Reinforcement Learning - UCB and Thompson Sampling

Multi-armed bandit problem

In probability theory, the multi-armed bandit problem is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice.This is a classic reinforcement learning problem that exemplifies the exploration–exploitation tradeoff dilemma.

Imagine that you are in a casino and there are 5 slot machines, and you have to decide on which machine to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine in order to maximize the profit or win percentage.
In the problem, each machine provides a random reward from a probability distribution specific to that machine. Your objective is to maximize the sum of rewards earned through a sequence of lever pulls.

Ok, so now you have some intuition about the multi-armed bandit problem. Now let's process further with the actual problem statement that we are focusing on in this tutorial.

Problem Statement

The marketing team of XYZ Company has prepared 10 different ads, and we have collected a dataset of Click Through Rates of all those ads. Looking at the dataset we have to find, which ad has the most probability of being clicked by the new user.

We will solve this problem using two different algorithms, first is Upper Confidence Bound which a deterministic approach and another is Thompson sampling which is probabilistic approach which uses probability distributions to determine which ad has the highest probability of being clicked by any user.

Upper Confidence Bound

Rather than performing exploration by simply selecting an arbitrary action, chosen with a probability that remains constant, the UCB algorithm changes its exploration-exploitation balance as it gathers more knowledge of the environment. It moves from being primarily focused on exploration, when actions that have been tried the least are preferred, to instead concentrate on exploitation, selecting the action with the highest estimated reward.

Random exploration gives us an opportunity to try out options that we have not known much about. However, due to the randomness, it is possible we end up exploring a bad action which we have confirmed in the past (bad luck!).

Feeling confused? Don't worry, Once we start the code, it will all make sense.

Importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Load Dataset

df = pd.read_csv('/content/Ads_CTR_Optimisation.csv')

Implementing UCB

There are 3 Steps to implement UCB algorithm, we will go step by step.

Our dataset has Click through rates of 10000 users and we have 10 different ads. So, let n be the number of users i.e. n=10000 and d be the number of ads i.e. d=10

we will also create a variable to accumulate ad selected at the end of each iteration and another one to accumulate the total reward.

# Creating variables
N = 1000 # Total number of users to show ads to
d = 10 # number of ads

selected_ads = [] # list of selected ads, will start empty

Ni = [0] * d # number of times ad i was selected upto N iteration
Ri = [0] * d # the sum of rewards of the ad i upto iteration

total_rewards = 0 # total rewards

Step 1. At each round n, we consider two numbers for each ad i:

N_i(n) - the number of times the ad i was selected up to round n,
R_i(n) - the sum of rewards of the ad i up to round n.

Step 2. From these two numbers we compute:

• The average reward of ad i up to round n

$\bar{r_{i}}(n) = \frac{R_{i}(n)}{N_{i}(n)}$

• The confidence interval $[\bar{r_{i}}(n) - \Delta_{i}(n), \bar{r_{i}}(n) + \Delta_{i}(n)]$ at round n with

$\Delta_{i}(n) = \sqrt{\frac{3}{2}\frac{log(n)}{N_{i}(n)}}$

Step 3. We select the ad i that has the maximum UCB $\bar{r_{i}} + \Delta_{i}(n)$

import math
for n in range(0, N): # iterate through all users
    ad = 0 
    max_upper_bound = 0
    for i in range(0, d): # iterate through all ads
        if Ni[i] == 0:
            upper_bound = 1e500 #if its the first ad set a high value
        else:
            average_reward = Ri[i] / Ni[i] # calculate average reward

            # now calculate confidence interval delta
            conf_interval = math.sqrt( (3/2) * (math.log(n+1) / Ni[i])) 

            upper_bound = average_reward + conf_interval

        if upper_bound > max_upper_bound:
            max_upper_bound = upper_bound
            ad = i
    
    selected_ads.append(ad) # append the ad with max_upper_bound to the list
    Ni[ad] += 1 # increase the number of times given ad was selected
    reward = df.values[n,ad] # get the reward from df for the selected ad
    Ri[ad] += reward # add the reward to sum of rewards
    total_rewards += reward # add this reward to total reward

Visualizing the results

plt.hist(selected_ads)
plt.title('Histogram of ads selections')
plt.xlabel('Ads')
plt.ylabel('Number of time each ads selected')
plt.show();

Most Rewarded Ad

Congratulations!!! You have successfully implemented UCB algorithm. And after looking at the plot, you can tell the company that ad number 4 has the highest likelyhood of being clicked.

Now moving on further, we will implement Thompson Sampling algorithm on the same problem and check if it can do better than UCB algorithm.

Thompson Sampling Algorithm
The basic idea behind Thompson sampling is to assume a simple prior distribution on the parameters of the reward distribution of every ad, and choose ads based on samples from that distribution. At each time step, the algorithm draws a sample from the posterior distribution of each ad, selects the best ad according to the samples, and updates the distribution based on the observed reward. Thompson Sampling is a randomised probability matching algorithm, making it robust to being trapped in an early bad decision, and is empirically proven to be effective in many problems. It can be directly applied to solve linear bandit problems.
Implementing Thompson Sampling Algorithm
We will use the same dataset for implementing this. Whole implementation is very similar to the UCB algorithm with very little changes.

There are 3 Steps to implement Thompson Sampling algorithm also, we will go step by step.
Our dataset has Click through rates of 10000 users and we have 10 different ads. So, let n be the number of users i.e. n=10000 and d be the number of ads i.e. d=10
 we will also create a variable to accumulate ad selected at the end of each iteration and another one to accumulate the total reward.
# Creating variables
N = 10000 # Total number of users to show ads to
d = 10 # number of ads

selected_ads = [] # list of selected ads, will start empty

Ni_0 = [0] * d # number of times ad i has reward 0
Ni_1 = [0] * d # number of times ad i has reward 1

total_rewards = 0 # total rewards

Step 1. At each round n, we consider two numbers for each ad i: 
N_i¹(n) - the number of times the ad i got reward 1 up to round n, 
N_i⁰(n)  - the number of times the ad i got reward 0 up to round n. 
Step 2. For each ad i, we take a random draw from the distribution below: 

Step 3. We select the ad that has the highest    

import math, random

for n in range(0, N): # iterate through all users
    ad = 0 
    max_theta = 0
    for i in range(0, d): # iterate through all 
        
        '''
            random.betavariate(alpha, beta) 
            Beta distribution.  Conditions on the parameters are alpha > 0 
            and beta > 0. Returned values range between 0 and 1.
            
            Returns a random float number between 0 and 1 based on the 
            Beta distribution (used in statistics)
        ''' 
        random_theta = random.betavariate(Ni_1[i]+1, Ni_0[i]+1)

        if random_theta > max_theta:
            max_theta = random_theta
            ad = i
    
    selected_ads.append(ad) # append the ad with max_upper_bound to the list
    reward = df.values[n,ad] # get the reward from df for the selected ad

    # if reward is 0, the update Ni_0 else update Ni_1
    if reward==0:
        Ni_0[ad]+=1 
    else:
        Ni_1[ad]+=1  

    total_rewards += reward # add this reward to total reward

Visualizing the results
plt.hist(selected_ads)
plt.title('Histogram of ads selections')
plt.xlabel('Ads')
plt.ylabel('Number of time each ads selected')
plt.show();






















Clearly, Thompson Sampling is has done better job than UCB.

Congratulations!!! You have now implemented two reinforcement learning algorithms, and I hope I have done a good job of explaining all the concepts.

Search This Blog

Sanket Gadge's Blog