Reinforcement Learning - UCB and Thompson Sampling
Multi-armed bandit problem In probability theory, the multi-armed bandit problem is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice.This is a classic reinforcement learning problem that exemplifies the exploration–exploitation tradeoff dilemma. Imagine that you are in a casino and there are 5 slot machines, and you have to decide on which machine to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine in order to maximize the profit or win percentage. In the problem, each machine provides a random reward from a probability distribution specific to that machine. Your objective is to ...