Multi-Armed Bandits

Posted by James Rigby on 8/25/2020

Hi everyone! It has been a minute. I recently took a break from posting here because I have been interning at Amazon. The internship is over but it has left we with a ton of interesting ideas. I wanted to share what I have been thinking about, particularly related to policy evaluation, causality, and experimentation. Efficient and robust experimental design can require a whole new set of methodological techniques which are equally interesting and often more difficult to implement effectively when compared to predictive models.

This post is motivated by a fantastic paper written by Kaibel & Biemmann (2019) in ORM. In it, they challenge a tried and true method relied on by I-O psychologists, economists, and operations researchers: the randomized control trial (RCT). They compare RCTs to Bayesian multi-armed bandits - a technique for adaptive experimentation.

When I first read the paper I was fascinated - such a cool approach that aligns with some of the optimization methods that I have been thinking about. I had to learn more, so I downloaded Sutton and Barto’s book on reinforcement learning (here) and dove into the the world of multi-armed bandits

Context

The multi-armed bandit is an intuitive method premised around a mathematically inclined gambler. The gambler is faced with several slot machines (AKA bandits) each with a different reward function. The gambler wants to come up with a strategy to pull the levers in a way that maximizes their reward (or minimizes their loss).

This generalizes to many other stochastic process that involves discrete choices. For example, leaders may be deciding between a set of several leadership training programs that may be deferentially effective in increasing follower motivation. In this example, the leader is the gambler and the training programs are the slot machines. Even traditionally continuous problems (i.e., price setting) can be discretized and approached from a bandit framework.

Let’s define the gambler’s objective more formally.

Let R1, ..., Rk define a set of real distributions that provide some reward when one of K levers are selected by the gambler’s actions (a). Each reward function has a expected value when selected (q*(a)). The gamblers objective is to solve the following optimization problem:


$$\underset{a}{\operatorname{argmax}} q_*(a)$$
The primary challenge of this problem is that the gambler does not know q*(a) - they have to estimate it using sample statistics.

Defining Reward Functions

Lets create the reward functions for our gambler’s machines. Both the probability of payout and the payout are random variables.

Bandit 1: 50% chance of about a $10 reward (E(Bandit 1) = $5.00)

Bandit 2: 25% chance of about a $17 reward (E(Bandit 2) = $4.25)

Bandit 3: 10% chance of about a $45 reward (E(Bandit 3) = $4.50)

Bandit 4: 70% chance of about a $6 reward (E(Bandit 4) = $4.20)

Gambler Option #1: Randomized Control Trial

The classically trained person analyst may view the gambler’s dilemma as a perfect opportunity for a randomized control trial (RCT). They may recommend the following:

  1. Select a machine at random.
  2. Pull the lever.
  3. Record the cost.
  4. Repeat until target sample size is reached.

Since people analysts are typically student’s of Cohen, they would determine the ball-park sample size necessary to differentiate between groups. To do this, they would need to estimate the effect size associated with between slot machiene differences. By observing the another less strategic gambler, they determine the difference between machines should have a relatively small effect-size (cohen’s d = .2). Although the experimenter’s goal is to find the maximum expected value instead of make formal group comparisons (i.e., hypothesis test), power analysis provides insight into the reliability of differentiation between groups which in this case is still useful.

## 
  ##      Conventional effect size from Cohen (1982) 
  ## 
  ##            test = t
  ##            size = small
  ##     effect.size = 0.2
## 
  ##      Two-sample t test power calculation 
  ## 
  ##               n = 393.4057
  ##               d = 0.2
  ##       sig.level = 0.05
  ##           power = 0.8
  ##     alternative = two.sided
  ## 
  ## NOTE: n is number in *each* group

The power analysis suggests that the experimenter would need 393 trials per machine before they would be able to detect effects 80% of the time given that they exist (and are “small”).

Let’s run a simulation to see how well the experimenter’s strategy works. We will evaluate the experiment’s performance by 1) the proportion of times the experimenter selects the correct machines & 2) the average reward attained by the end of the experiment.

Since we know the population parameters we can see how often the experimenter’s conclusions are correct. Machine 1 has the highest expected value. As the plot below shows we were able to identify the optimal decision policy about 70% of the time.

What about our rewards throughout the experimentation process? Across all simulations what was the average reward incurred? Unsurprisingly, the RCT produces a lovely approximately normal distribution with a mean of 4.49.

The RCT does a reasonable job at identifying the optimal policy (or at least near optimal), but was there a better way to maximize the reward during the experimentation process? The gambler could have:

1. Adapted the machine selection to exploit existing information throughout the trial. 
  2. Implemented an exploration strategy to keep gathering new information.
  3. Not be satisfied with 1 off experiments - continue exploring and exploiting!

Gambler Option 2: Epsilon Strategy

To improve upon the RCT, the gambler could implement a semi-random decision strategy deciding which machine to play at each time point. For example, the vast majority of the time, the gambler could exploit their knowledge and play the machine with the highest rewards. However, a small percentage of the time (in this case 10%), the gambler could explore other machines. We start this process with a 20 pull warm-up where the first 20 pulls are distributed equally across all machines. Let’s see how this compares to the RCT above!

The average reward per trial is higher using the epsilon strategy instead of the randomized control trial!

Unlike the RCT, the expected reward for the bandit algorithm changes over time. This is because the experiment adapts as more information comes in. Let’s take a look at how the bandit algorithm learns over time. The plot below shows how the algorithm converges over time to the expected reward for the highest pay out lever

Gambler Option 3: Upper Confidence Limit Selection

The epsilon strategy seems like a smart way to go about selecting the levers to pull, however, it doesn’t recognize uncertainty in the expected reward estimates. How can we improve on it? What if we calculated the confidence interval of the expected reward and selected the machine with the highest upper value? That way if we don’t have a lot of information about a certain machine, the algorithm will try it out until the confidence interval shrinks.

Gambler Option 4: Optimistic Starting Values

All of the above strategies require the gambler to do a warm-up period to initialize the algorithm. What if the gambler wanted to give the machines the benefit of the doubt and assume that the machines pay out at an abnormally high rate. While doubtlessly setting themselves up for disappointment, this technique encourages early exploration. After each iteration, the optimistic starting values are updated to more realistic expectations. This happens for all levers until the algorithm begins to converge. Let’s checkout how this strategy performs relative to the others by setting the expected reward to 6 after 10 lever pulls for each machine.

Note that this strategy is converging to the 4th machine about 20% of the time. This is super interesting and occurs in experiments when the first machine has a low payout frequency (just by chance). Since there are no random draws in the optimistic starting value stratgey unlike the epsilon strategy, if the expected value drops below the expected value of machine 1, the algorithm can get stuck. These cases can be seen in the bimodal expected value histograms and the divergent expected value trends.

Option 5: Mix and Match

There are other strategies that can accomidate different scenarios, such as if the reward functions change over time, but the gambler has a lot of options already! Let’s try to combine the epsilon strategy and the UCB strategy to encourage more exploration while recognizing uncertainty in the expected reward estimate.

Notice how the standard error of the expected rewards is much smaller than previous strategies. This means that the algorithm is converging earlier and more consistently as evidenced by the unimodal distribution and narrower distribution of expected value trends. The increased exploration comes at a cost though - off policy exploration can lead to suboptimal decisions. This can be seen in the lower expected reward.

Concluding Thoughts

A Comparison of the Different Strategies
Algorithm Percent Correct Average Expected Reward
RCT 0.69 4.49
Epsilon Strategy 0.79 4.73
UCB Strategy 0.76 4.77
Optimistic Starting Values 0.75 4.77
UCB/Epsilon 0.78 4.70

Further tuning of the parameters for each of these algorithms may lead to improved performance. I am pretty impressed with the off-the-shelf starting parameters! Using any of the bandit strategies improved upon RCT relative to both the probability of selecting the optimal policy and expected value of incurred rewards. While the relative benefits of bandit algorithms are useful in applied settings, academics interested in precise estimation of non-optimal effects may not find them as useful. For example, the algorithm may exploit the highest rewarding solution to the omission of lower rewarding alternatives resulting in less information about these policy options! As always, it is super important to consider the purpose of the study prior to implementing any algorithm!

Code for Post

# --------------------------------------- Bandit Functions ----------------------------------- #
  bandit_1<-function(){
    payout<-rbinom(1,1, .5)
    if(payout){
      rnorm(1, 10, 2)
    }else{
      0
    }
  }
  
  bandit_2<-function(){
    payout<-rbinom(1,1, .25)
    if(payout){
      rnorm(1, 17, 2)
    }else{
      0
    }
  }
  
  bandit_3<-function(){
    payout<-rbinom(1,1, .1)
    if(payout){
      rnorm(1, 45, 2)
    }else{
      0
    }
  }
  
  bandit_4<-function(){
    payout<-rbinom(1,1, .7)
    if(payout){
      rnorm(1, 6, 2)
    }else{
      0
    }
  }
  
  # --------------------------------------- Sample Size Determination ----------------------------------- #
  library(pwr)
  
  pwr::cohen.ES("t", "small")
  
  pwr::pwr.t.test(d = .2, 
                  sig.level = .05, 
                  power = .8, 
                  type = "two.sample", 
                  alternative = "two.sided")
  
  # --------------------------------------- RCT Simulation ----------------------------------- #
  
  
  ### Simulating randomized control trial with Gambler's Dilemma ###
  library(tidyverse)
  
  ### Defining set-up parameters
  set.seed(555555)
  
  bandit_list<-list(
    bandit_1, 
    bandit_2, 
    bandit_3, 
    bandit_4
  )
  
  
  
  # What is the target within group sample size?
  target_n<-393
  # simulation count
  m<-1000
  
  simulation_storage<-list()
  
  for (i in 1:m){
    
    # how many times has each machine been sampled?
    pull_count<-rep(0, times = 4)
    
    # A trigger to stop the simulation
    bandit_index<-pull_count<target_n
    
    # result storage
    reward_df<-tibble(
      machine = 1:4,
      reward = NA,
      var = NA
    )
    
    while(any(bandit_index)){
      
      # select a machine at random if they have a lower target n than the target sample size
     machines_left<- c(1:4)[bandit_index]
     
     # nessary logic because of how sample handles integers that are passed as x (i.e., passing 4 samples 1 to 4)
     if(length(machines_left)>1){
           selected<-sample(machines_left, size = 1)
     }else{
       selected<-machines_left
     }
        
      # make a pull from the machine to see the reward
       new_reward<- bandit_list[[selected]]()
       
      # update the pull count to reflect recent pull
       pull_count[[selected]]<-pull_count[[selected]]+1
       
      # update the expected mean and variance
       ## if this is not the first draw calculate the mean
       if(pull_count[[selected]]!=1){
         
         old_ave_reward<-reward_df$reward[[selected]]
         new_ave_reward<-old_ave_reward+(new_reward-old_ave_reward)/pull_count[[selected]]
         reward_df$reward[selected]<-new_ave_reward
         
         old_var<-reward_df$var[[selected]]
         reward_df$var[[selected]]<-(old_var*(pull_count[[selected]]-2)+(new_reward-old_ave_reward)*(new_reward-new_ave_reward))/(pull_count[[selected]]-1)
         
         # if this is the first draw, no calculations necessary, just use the observed score
       }else{
          reward_df$reward[selected]<-new_reward
           reward_df$var[[selected]]<-0
       }
       bandit_index<-pull_count<target_n
    }
    simulation_storage[[i]]<-reward_df%>%mutate(simulation = i)
  }
  
  # --------------------------------------- RCT Summary ----------------------------------- #
  max_count<-map_dfr(simulation_storage, ~.x%>%
                       group_by(simulation)%>%
                       mutate(max_var = reward == max(reward))%>%
                       ungroup())%>%
    count(machine, max_var)%>%
    ungroup()%>%
    mutate(machine = as.factor(machine),
           n = n/m)
  
  
  max_count%>%
    filter(max_var == TRUE)%>%
    ggplot(aes(x = machine, y = n))+
    geom_bar(stat = "identity")+
    theme(panel.background = element_blank(),
          panel.grid  = element_blank(),
          axis.line = element_line())+
    labs(x = "Machine", 
         y = "Probability",
         title = "Probability of Selecting Machine at the End of the Experiment")+
    lims(y = c(0,1))
  
  rct_pc<-max_count%>%
    filter(machine == "1", max_var == TRUE)%>%
    pull("n")
  
  reward_dat<-map_dfr(simulation_storage, ~.x%>%
                                group_by(simulation)%>%
                                summarise(avg_reward_per_trial = mean(reward), .groups = "keep"))
  reward_dat%>%
    ggplot(aes(x = avg_reward_per_trial))+
    geom_histogram(color = "black", 
                   bins = 20)+
    annotate(geom = "text",
             label =  paste(" Mean =", 
                                 round(
                                   mean(reward_dat$avg_reward_per_trial), 
                                   2)
                            ),
             x = -Inf,
             y = Inf,
             vjust = 1, 
             hjust = "left")+
    labs(x = "Average Reward per Trial",
         y = "Frequency",
         title = "Distribution of Average Reward per Trial for RCT")+
      theme(panel.background = element_blank(),
          panel.grid  = element_blank(),
          axis.line = element_line())
  
  rct_mean<-mean(reward_dat$avg_reward_per_trial)
  
  # --------------------------------------- Defining Helper Functions ----------------------------------- #
  #### Defining Helper Functions ####
  efficient_bandits<-function(bandit_df, eps, funs, upper = FALSE){
    # Implementing the incremental estimation of the multi-armed bandit
    n_bandits<-nrow(bandit_df)
    
    if (!upper){
      # randomly selecting if this should be an off policy draw with probability = epsilon
      if(rbinom(1, 1, eps)){
        selected<-sample(1:n_bandits, size = 1)
      }else{
        selected<-which.max(bandit_df$reward)
      }
    } else{
      if(rbinom(1, 1, eps)){
        selected<-sample(1:n_bandits, size = 1)
      }else{
        selected<-which.max(bandit_df$upper)
      }
    }
    
    # Drawing from the selected function
    r<-funs[[selected]]()
    
    # updating the expected reward for the bandit
    t<-bandit_df$t[[selected]]+1
    q_tm1<-bandit_df$reward[[selected]]
    q_t<-q_tm1+(r-q_tm1)/t
    bandit_df$reward[[selected]]<-q_t
    
    # Updating the time times the reward was drawn
    bandit_df$t[[selected]]<-t
    
    if(upper){
      # implementing online algorithm for sd
    # SSk = SSk-1 + (xk – Mk-1)*(xk – Mk)
    bandit_df$sd[[selected]]<-sqrt((bandit_df$sd[[selected]]^2*(t-2)+(r-q_tm1)*(r-q_t))/(t-1))
    
    # calculating the upper confidence interval
        bandit_df$upper[[selected]]<-bandit_df$reward[[selected]]+qt(.95, t-1)*bandit_df$sd[[selected]]/sqrt(t)
    }
    
    bandit_df
  }
  
  
  efficient_bandits_loop<-function(initial_draws, bandit_storage = NULL, reward_funs, iterations, time, epsilon=.05, upper = FALSE){
    
    #inner loop helper function
    inner_loop<- function(m){
      message("\rIteration ", m)
     
      # Initializing Draws 
      if(is.null(bandit_storage)){
      bandit_storage<-tibble()
      for (f in seq_along(reward_funs)){
        for (draws in 1:initial_draws){
          bandit_storage<-bind_rows(
            bandit_storage,
                    tibble(bandit = f,
                 reward = reward_funs[[f]]())
          )
        }
      }
      
      bandit_storage<-bandit_storage%>%
        group_by(bandit)%>%
        summarise(reward = mean(reward),
               sd = sd(reward),
               t = n(),
               .groups = "keep")%>%
        ungroup()%>%
        mutate(upper = reward+qt(.95, t-1)*sd/sqrt(t))
      }
  
      
      # initializing the inner loop storage for time point storage
      bandit_storage$time<-1
      bdf_list<-list(bandit_storage)
      
      # inner loop for bandit trials within time
      for (i in sum(bandit_storage$t):time){
        t_count<-sum(bandit_storage$t)
        #implementing efficient bandits
        eb_out<-efficient_bandits(bdf_list[[i-(t_count-1)]], epsilon, reward_funs, upper)
        eb_out$time<-i-(t_count-2)
        bdf_list[[i-(t_count-2)]]<-eb_out
      }
      bind_rows(bdf_list)%>%
        mutate(iter = m)
    }
    
    # outer loop for time within simulation
    out_list<-lapply(1:iterations, inner_loop)
    
    bind_rows(out_list) 
  }
  
  # --------------------------------------- Epsilon Strategy ----------------------------------- #
  
  #### Running bandit algorithm ####
  efficient_bandits_df<-efficient_bandits_loop(bandit_storage = bdf, 
                                               reward_funs = list(bandit_1,
                                                                  bandit_2,
                                                                  bandit_3,
                                                                  bandit_4),
                                               iterations = m, 
                                               time = target_n*4, # setting to 4 times since trials are not distributed equally across levers
                                               epsilon = .10)
  
  # --------------------------------------- Epsilon Summary ----------------------------------- #
  ### Plotting 
  max_count<-efficient_bandits_df%>%
    ungroup()%>%
    group_by(iter)%>%
    filter(time == max(time))%>%
    filter(reward == max(reward))%>%
    ungroup()%>%
    count(bandit)%>%
    mutate(p = n/m)
  
  max_count%>%
    ggplot(aes(x = bandit, y = p))+
    geom_bar(stat = "identity")+
      theme(panel.background = element_blank(),
          panel.grid  = element_blank(),
          axis.line = element_line())+
    labs(x = "Machine", 
         y = "Probability",
         title = "Probability of Selecting Machine at the End of the Experiment")+
    lims(y = c(0,1))
  
  reward_dat<-efficient_bandits_df%>%
    ungroup()%>%
    group_by(iter)%>%
    filter(time == max(time))%>%
    summarise(avg_reward_per_trial = sum(reward*t/sum(t)), .groups = "keep")
  
  reward_dat%>%
    ggplot(aes(x = avg_reward_per_trial))+
    geom_histogram(color = "black", 
                   bins = 20)+
   annotate(geom = "text",
             label =  paste(" Mean =", 
                                 round(
                                   mean(reward_dat$avg_reward_per_trial), 
                                   2)
                            ),
             x = -Inf,
             y = Inf,
             vjust = 1, 
             hjust = "left")+
    labs(x = "Average Reward per Trial",
         y = "Frequency",
         title = "Distribution of Average Rewards per Trial for Muli-Armed \nBandit with Epsilon Strategy")+
      theme(panel.background = element_blank(),
          panel.grid  = element_blank(),
          axis.line = element_line())
  
  eps_pc<-max_count%>%
    filter(bandit == "1")%>%
    pull("p")
  
  eps_mean<-mean(reward_dat$avg_reward_per_trial)
  
  bandit_summary<-efficient_bandits_df%>%
    arrange(iter, time)%>%
    group_by(iter, time)%>%
    mutate(cum_reward = sum(reward*t),
           time = sum(t),
           ave_reward = cum_reward/time)%>%
    ungroup()
  
  bandit_means <- bandit_summary%>%
    group_by(time)%>%
    summarise(ave_reward = mean(ave_reward))
  
  bandit_summary%>%
    ggplot(aes(x = time, y = ave_reward))+
    geom_line(alpha = .05, aes(group = iter))+
    geom_line(data = bandit_means, aes(x = time, y = ave_reward), color = "blue")+
      labs(y = "Average Reward per Trial",
         x = "Trial",
         title = "Convergence of Bandit Alogrithm with Epsilon Strategy")+
      theme(panel.background = element_blank(),
          panel.grid  = element_blank(),
          axis.line = element_line())
  
  # --------------------------------------- UCB Strategy ----------------------------------- #
  
  #### Running bandit algorithm ####
  efficient_bandits_df<-efficient_bandits_loop(bandit_storage = bdf, 
                                               reward_funs = list(bandit_1,
                                                                  bandit_2,
                                                                  bandit_3,
                                                                  bandit_4),
                                               iterations = m, 
                                               time = target_n*4, 
                                               epsilon = 0,
                                               upper = TRUE)
  
  
  # --------------------------------------- UCB Summary ----------------------------------- #
  
  max_count<-efficient_bandits_df%>%
    mutate(bandit = as.factor(bandit))%>%
    ungroup()%>%
    group_by(iter)%>%
    filter(time == max(time))%>%
    filter(reward == max(reward))%>%
    ungroup()%>%
    count(bandit)%>%
    mutate(p = n/m)
  
  missing<-c(1:4)[!1:4 %in% max_count$bandit]
  
  max_count<-max_count%>%
    add_row(bandit = missing, p = 0)
  
  
  max_count%>%
    ggplot(aes(x = bandit, y = p))+
    geom_bar(stat = "identity")+
      theme(panel.background = element_blank(),
          panel.grid  = element_blank(),
          axis.line = element_line())+
    labs(x = "Machine", 
         y = "Probability",
         title = "Probability of Selecting Machine at the End of the Experiment")+
    lims(y = c(0,1))
  
  reward_dat<-efficient_bandits_df%>%
    ungroup()%>%
    group_by(iter)%>%
    filter(time == max(time))%>%
    summarise(avg_reward_per_trial = sum(reward*t/sum(t)), .groups = "keep")
  
  reward_dat%>%
    ggplot(aes(x = avg_reward_per_trial))+
    geom_histogram(color = "black", 
                   bins = 20)+
    annotate(geom = "text",
               label =  paste(" Mean =", 
                                   round(
                                     mean(reward_dat$avg_reward_per_trial), 
                                     2)
                              ),
               x = -Inf,
               y = Inf,
               vjust = 1, 
               hjust = "left")+
    labs(x = "Average Reward per Trial",
         y = "Frequency",
         title = "Distribution of Average Rewards per Trial for Muli-Armed \nBandit with UCB Strategy")+
      theme(panel.background = element_blank(),
          panel.grid  = element_blank(),
          axis.line = element_line())
  
  ucb_pc<-max_count%>%
    filter(bandit == "1")%>%
    pull("p")
  
  ucb_mean<-mean(reward_dat$avg_reward_per_trial)
  
  bandit_summary<-efficient_bandits_df%>%
    arrange(iter, time)%>%
    group_by(iter, time)%>%
    mutate(cum_reward = sum(reward*t),
           time = sum(t),
           ave_reward = cum_reward/time)%>%
    ungroup()
  
  bandit_means <- bandit_summary%>%
    group_by(time)%>%
    summarise(ave_reward = mean(ave_reward))
  
  bandit_summary%>%
    ggplot(aes(x = time, y = ave_reward))+
    geom_line(alpha = .05, aes(group = iter))+
    geom_line(data = bandit_means, aes(x = time, y = ave_reward), color = "blue")+
      labs(y = "Average Reward per Trial",
         x = "Trial",
         title = "Convergence of Bandit Alogrithm with Upper Confidence Band Strategy")+
      theme(panel.background = element_blank(),
          panel.grid  = element_blank(),
          axis.line = element_line())
  
  
  # --------------------------------------- Optimistic Starting Values Strategy ----------------------------------- #
  
  bdf<-tibble(bandit = 1:4,
              sd = c(10, 10, 10, 10),
              reward = c(6, 6, 6, 6), 
              t = c(10,10,10,10))%>%
    mutate(upper = reward+qt(.95, t-1)*sd/sqrt(t))
  
  #### Running bandit algorithm ####
  efficient_bandits_df<-efficient_bandits_loop(bandit_storage = bdf, 
                                               reward_funs = list(bandit_1,
                                                                  bandit_2,
                                                                  bandit_3,
                                                                  bandit_4),
                                               iterations = m, 
                                               time = target_n*4, 
                                               epsilon = .05)
  
  
  # --------------------------------------- Optimistic Starting Values Summary ----------------------------------- #
  
  max_count<-efficient_bandits_df%>%
    ungroup()%>%
    group_by(iter)%>%
    filter(time == max(time))%>%
    filter(reward == max(reward))%>%
    ungroup()%>%
    count(bandit)%>%
    mutate(p = n/m)
  
  max_count%>%
    ggplot(aes(x = bandit, y = p))+
    geom_bar(stat = "identity")+
      theme(panel.background = element_blank(),
          panel.grid  = element_blank(),
          axis.line = element_line())+
    labs(x = "Machine", 
         y = "Probability",
         title = "Probability of Selecting Machine at the End of the Experiment")+
    lims(y = c(0,1))
  
  reward_dat<-efficient_bandits_df%>%
    ungroup()%>%
    group_by(iter)%>%
    filter(time == max(time))%>%
    summarise(avg_reward_per_trial = sum((reward*t-6*10)/sum(t-10)), .groups = "keep")
  
  
  reward_dat%>%
    ggplot(aes(x = avg_reward_per_trial))+
    geom_histogram(color = "black", 
                   bins = 20)+
    annotate(geom = "text",
             label =  paste(" Mean =", 
                                 round(
                                   mean(reward_dat$avg_reward_per_trial), 
                                   2)
                            ),
             x = -Inf,
             y = Inf,
             vjust = 1, 
             hjust = "left")+
    labs(x = "Average Reward per Trial",
         y = "Frequency",
         title = "Distribution of Average Rewards per Trial for Muli-Armed \nBandit with Optimistic Starting Values")+
      theme(panel.background = element_blank(),
          panel.grid  = element_blank(),
          axis.line = element_line())
  
  osv_pc<-max_count%>%
    filter(bandit == "1")%>%
    pull("p")
  
  osv_mean<-mean(reward_dat$avg_reward_per_trial)
  
  bandit_summary<-efficient_bandits_df%>%
    arrange(iter, time)%>%
    group_by(iter, time)%>%
    mutate(cum_reward = sum(reward*t-6*10),
           time = sum(t-10),
           ave_reward = cum_reward/time)%>%
    ungroup()
  
  bandit_means <- bandit_summary%>%
    group_by(time)%>%
    summarise(ave_reward = mean(ave_reward))
  
  bandit_summary%>%
    ggplot(aes(x = time, y = ave_reward))+
    geom_line(alpha = .05, aes(group = iter))+
    geom_line(data = bandit_means, aes(x = time, y = ave_reward), color = "blue")+
      labs(y = "Average Reward per Trial",
         x = "Trial",
         title = "Convergence of Bandit Alogrithm with Optimistic Starting Values")+
      theme(panel.background = element_blank(),
          panel.grid  = element_blank(),
          axis.line = element_line())
  
  
  # --------------------------------------- Combined Strategy ----------------------------------- #
  efficient_bandits_df<-efficient_bandits_loop(initial_draws = 5,
                                               reward_funs = list(bandit_1,
                                                                  bandit_2,
                                                                  bandit_3,
                                                                  bandit_4),
                                               iterations = 1000, 
                                               time = target_n*4, 
                                               epsilon = .10, 
                                               upper = TRUE)
  
  max_count<-efficient_bandits_df%>%
    ungroup()%>%
    group_by(iter)%>%
    filter(time == max(time))%>%
    filter(reward == max(reward))%>%
    ungroup()%>%
    count(bandit)%>%
    mutate(p = n/m)
  
  max_count%>%
    ggplot(aes(x = bandit, y = p))+
    geom_bar(stat = "identity")+
      theme(panel.background = element_blank(),
          panel.grid  = element_blank(),
          axis.line = element_line())+
    labs(x = "Machine", 
         y = "Probability",
         title = "Probability of Selecting Machine at the End of the Experiment")+
    lims(y = c(0,1))
  
  reward_dat<-efficient_bandits_df%>%
    ungroup()%>%
    group_by(iter)%>%
    filter(time == max(time))%>%
    summarise(avg_reward_per_trial = sum(reward*t/sum(t)), .groups = "keep")
  
  
  reward_dat%>%
    ggplot(aes(x = avg_reward_per_trial))+
    geom_histogram(color = "black", 
                   bins = 20)+
    annotate(geom = "text",
             label =  paste(" Mean =", 
                                 round(
                                   mean(reward_dat$avg_reward_per_trial), 
                                   2)
                            ),
             x = -Inf,
             y = Inf,
             vjust = 1, 
             hjust = "left")+
    labs(x = "Average Reward per Trial",
         y = "Frequency",
         title = "Distribution of Average Rewards per Trial for Muli-Armed \nBandit using the Epsilon/UCB Strategy")+
      theme(panel.background = element_blank(),
          panel.grid  = element_blank(),
          axis.line = element_line())
  
  cs_pc<-max_count%>%
    filter(bandit == "1")%>%
    pull("p")
  
  cs_mean<-mean(reward_dat$avg_reward_per_trial)
  
  bandit_summary<-efficient_bandits_df%>%
    arrange(iter, time)%>%
    group_by(iter, time)%>%
    mutate(cum_reward = sum(reward*t),
           time = sum(t),
           ave_reward = cum_reward/time)%>%
    ungroup()
  
  bandit_means <- bandit_summary%>%
    group_by(time)%>%
    summarise(ave_reward = mean(ave_reward))
  
  bandit_summary%>%
    ggplot(aes(x = time, y = ave_reward))+
    geom_line(alpha = .05, aes(group = iter))+
    geom_line(data = bandit_means, aes(x = time, y = ave_reward), color = "blue")+
      labs(y = "Average Reward per Trial",
         x = "Trial",
         title = "Convergence of Bandit Alogrithm with Epsion/UCB Strategy")+
      theme(panel.background = element_blank(),
          panel.grid  = element_blank(),
          axis.line = element_line())
  
  ### Final Table
  tibble(Algorithm = c("RCT", 
                       "Epsilon Strategy",
                       "UCB Strategy",
                       "Optimistic Starting Values",
                       "UCB/Epsilon"
                       ),
         `Percent Correct` = c(rct_pc,
                               eps_pc,
                               ucb_pc,
                               osv_pc, 
                               cs_pc),
         `Average Expected Reward` = c(rct_mean,
                                       eps_mean,
                                       ucb_mean,
                                       osv_mean, 
                                       cs_mean))%>%
    knitr::kable(digits = 2, caption = "A Comparison of the Different Strategies")