Reinforcement Learning - UCB and Thompson Sampling
Multi-armed bandit problem
In probability theory, the multi-armed bandit problem is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice.This is a classic reinforcement learning problem that exemplifies the exploration–exploitation tradeoff dilemma.
Imagine that you are in a casino and there are 5 slot machines, and you have
to decide on which machine to play, how many times to play each machine and in
which order to play them, and whether to continue with the current machine or
try a different machine in order to maximize the profit or win percentage.
In
the problem, each machine provides a random reward from a probability
distribution specific to that machine. Your objective is to maximize the sum
of rewards earned through a sequence of lever pulls.
Ok, so now you have some intuition about the multi-armed bandit problem. Now let's process further with the actual problem statement that we are focusing on in this tutorial.
Problem Statement
The marketing team of XYZ Company has prepared 10 different ads, and we have collected a dataset of Click Through Rates of all those ads. Looking at the dataset we have to find, which ad has the most probability of being clicked by the new user.
We will solve this problem using two different algorithms, first is Upper Confidence Bound which a deterministic approach and another is Thompson sampling which is probabilistic approach which uses probability distributions to determine which ad has the highest probability of being clicked by any user.
Upper Confidence Bound
Rather than performing exploration by simply selecting an arbitrary action, chosen with a probability that remains constant, the UCB algorithm changes its exploration-exploitation balance as it gathers more knowledge of the environment. It moves from being primarily focused on exploration, when actions that have been tried the least are preferred, to instead concentrate on exploitation, selecting the action with the highest estimated reward.
Random exploration gives us an opportunity to try out options that we have not known much about. However, due to the randomness, it is possible we end up exploring a bad action which we have confirmed in the past (bad luck!).
Feeling confused? Don't worry, Once we start the code, it will all make sense.
Importing libraries
import numpy as np import pandas as pd import matplotlib.pyplot as plt
Load Dataset
df = pd.read_csv('/content/Ads_CTR_Optimisation.csv')
Implementing UCB
There are 3 Steps to implement UCB algorithm, we will go step by step.
Our dataset has Click through rates of 10000 users and we have 10 different ads. So, let n be the number of users i.e. n=10000 and d be the number of ads i.e. d=10
we will also create a variable to accumulate ad selected at the end of each iteration and another one to accumulate the total reward.
# Creating variables N = 1000 # Total number of users to show ads to d = 10 # number of ads selected_ads = [] # list of selected ads, will start empty Ni = [0] * d # number of times ad i was selected upto N iteration Ri = [0] * d # the sum of rewards of the ad i upto iteration total_rewards = 0 # total rewards
Step 1. At each round n, we consider two numbers for each ad i:
- Ni(n) - the number of times the ad i was selected up to round n,
- Ri(n) - the sum of rewards of the ad i up to round n.
Step 2. From these two numbers we compute:
• The average reward of ad i up to round n
Step 3. We select the ad i that has the maximum UCB
import math
for n in range(0, N): # iterate through all users
ad = 0
max_upper_bound = 0
for i in range(0, d): # iterate through all ads
if Ni[i] == 0:
upper_bound = 1e500 #if its the first ad set a high value
else:
average_reward = Ri[i] / Ni[i] # calculate average reward
# now calculate confidence interval delta
conf_interval = math.sqrt( (3/2) * (math.log(n+1) / Ni[i]))
upper_bound = average_reward + conf_interval
if upper_bound > max_upper_bound:
max_upper_bound = upper_bound
ad = i
selected_ads.append(ad) # append the ad with max_upper_bound to the list
Ni[ad] += 1 # increase the number of times given ad was selected
reward = df.values[n,ad] # get the reward from df for the selected ad
Ri[ad] += reward # add the reward to sum of rewards
total_rewards += reward # add this reward to total reward
Visualizing the results
plt.hist(selected_ads) plt.title('Histogram of ads selections') plt.xlabel('Ads') plt.ylabel('Number of time each ads selected') plt.show();
Most Rewarded Ad |
Congratulations!!! You have successfully implemented UCB algorithm. And after looking at the plot, you can tell the company that ad number 4 has the highest likelyhood of being clicked.
Now moving on further, we will implement Thompson Sampling algorithm on the same problem and check if it can do better than UCB algorithm.
Thompson Sampling Algorithm
The basic idea behind Thompson sampling is to assume a simple prior distribution on the parameters of the reward distribution of every ad, and choose ads based on samples from that distribution. At each time step, the algorithm draws a sample from the posterior distribution of each ad, selects the best ad according to the samples, and updates the distribution based on the observed reward. Thompson Sampling is a randomised probability matching algorithm, making it robust to being trapped in an early bad decision, and is empirically proven to be effective in many problems. It can be directly applied to solve linear bandit problems.Implementing Thompson Sampling Algorithm
We will use the same dataset for implementing this. Whole implementation is very similar to the UCB algorithm with very little changes.There are 3 Steps to implement Thompson Sampling algorithm also, we will go step by step.Our dataset has Click through rates of 10000 users and we have 10 different ads. So, let n be the number of users i.e. n=10000 and d be the number of ads i.e. d=10
we will also create a variable to accumulate ad selected at the end of each iteration and another one to accumulate the total reward.
# Creating variables N = 10000 # Total number of users to show ads to d = 10 # number of ads selected_ads = [] # list of selected ads, will start empty Ni_0 = [0] * d # number of times ad i has reward 0 Ni_1 = [0] * d # number of times ad i has reward 1 total_rewards = 0 # total rewardsStep 1. At each round n, we consider two numbers for each ad i:
- Ni1(n) - the number of times the ad i got reward 1 up to round n,
- Ni0(n) - the number of times the ad i got reward 0 up to round n.
Step 2. For each ad i, we take a random draw from the distribution below:
import math, random for n in range(0, N): # iterate through all users ad = 0 max_theta = 0 for i in range(0, d): # iterate through all ''' random.betavariate(alpha, beta) Beta distribution. Conditions on the parameters are alpha > 0 and beta > 0. Returned values range between 0 and 1. Returns a random float number between 0 and 1 based on the Beta distribution (used in statistics) ''' random_theta = random.betavariate(Ni_1[i]+1, Ni_0[i]+1) if random_theta > max_theta: max_theta = random_theta ad = i selected_ads.append(ad) # append the ad with max_upper_bound to the list reward = df.values[n,ad] # get the reward from df for the selected ad # if reward is 0, the update Ni_0 else update Ni_1 if reward==0: Ni_0[ad]+=1 else: Ni_1[ad]+=1 total_rewards += reward # add this reward to total rewardVisualizing the results
plt.hist(selected_ads) plt.title('Histogram of ads selections') plt.xlabel('Ads') plt.ylabel('Number of time each ads selected') plt.show();Clearly, Thompson Sampling is has done better job than UCB.Congratulations!!! You have now implemented two reinforcement learning algorithms, and I hope I have done a good job of explaining all the concepts.
Comments
Post a Comment