R: Replacing for loops with data frames

Mark NeedhamApril 27th, 2015Last Updated: April 23rd, 2015

0 13 4 minutes read

In my last blog post I showed how to derive posterior probabilities for the Think Bayes dice problem:

Suppose I have a box of dice that contains a 4-sided die, a 6-sided die, an 8-sided die, a 12-sided die, and a 20-sided die. If you have ever played Dungeons & Dragons, you know what I am talking about.
Suppose I select a die from the box at random, roll it, and get a 6. What is the probability that I rolled each die?

To recap, this was my final solution:

likelihoods = function(names, observations) {
  scores = rep(1.0 / length(names), length(names))  
  names(scores) = names
 
  for(name in names) {
      for(observation in observations) {
        if(name < observation) {
          scores[paste(name)]  = 0
        } else {
          scores[paste(name)] = scores[paste(name)] *  (1.0 / name)
        }        
      }
    }  
  return(scores)
}
 
dice = c(4,6,8,12,20)
l1 = likelihoods(dice, c(6))
 
> l1 / sum(l1)
        4         6         8        12        20 
0.0000000 0.3921569 0.2941176 0.1960784 0.1176471

Although it works we have nested for loops which aren’t very idiomatic R so let’s try and get rid of them.

The first thing we want to do is return a data frame rather than a vector so we tweak the first two lines to read like this:

scores = rep(1.0 / length(names), length(names))  
df = data.frame(score = scores, name = names)

Next we can get rid of the inner for loop and replace it with a call to ifelse wrapped inside a dplyr mutate call:

library(dplyr)
likelihoods2 = function(names, observations) {
  scores = rep(1.0 / length(names), length(names))  
  df = data.frame(score = scores, name = names)
 
  for(observation in observations) {
    df = df %>% mutate(score = ifelse(name < observation, 0, score * (1.0 / name)) )
  }
 
  return(df)
}
 
dice = c(4,6,8,12,20)
l1 = likelihoods2(dice, c(6))
 
> l1
       score name
1 0.00000000    4
2 0.03333333    6
3 0.02500000    8
4 0.01666667   12
5 0.01000000   20

Finally we’ll tidy up the scores so they’re relatively weighted against each other:

likelihoods2 = function(names, observations) {
  scores = rep(1.0 / length(names), length(names))  
  df = data.frame(score = scores, name = names)
 
  for(observation in observations) {
    df = df %>% mutate(score = ifelse(name < observation, 0, score * (1.0 / name)) )
  }
 
  return(df %>% mutate(weighted = score / sum(score)) %>% select(name, weighted))
}
 
dice = c(4,6,8,12,20)
l1 = likelihoods2(dice, c(6))
 
> l1
  name  weighted
1    4 0.0000000
2    6 0.3921569
3    8 0.2941176
4   12 0.1960784
5   20 0.1176471

Now we’re down to just the one for loop. Getting rid of that one is a bit trickier. First we’ll create a data frame which contains a row for every (observation, dice) pair, simulating the nested for loops:

likelihoods3 = function(names, observations) {
  l = list(observation = observations, roll = names)
  obsDf = do.call(expand.grid,l) %>% 
    mutate(likelihood = 1.0 / roll, 
           score = ifelse(roll < observation, 0, likelihood))   
 
  return(obsDf)
}
 
dice = c(4,6,8,12,20)
l1 = likelihoods3(dice, c(6))
 
> l1
  observation roll likelihood      score
1           6    4 0.25000000 0.00000000
2           6    6 0.16666667 0.16666667
3           6    8 0.12500000 0.12500000
4           6   12 0.08333333 0.08333333
5           6   20 0.05000000 0.05000000
 
l2 = likelihoods3(dice, c(6, 4, 8, 7, 7, 2))
> l2
   observation roll likelihood      score
1            6    4 0.25000000 0.00000000
2            4    4 0.25000000 0.25000000
3            8    4 0.25000000 0.00000000
4            7    4 0.25000000 0.00000000
5            7    4 0.25000000 0.00000000
6            2    4 0.25000000 0.25000000
7            6    6 0.16666667 0.16666667
8            4    6 0.16666667 0.16666667
9            8    6 0.16666667 0.00000000
10           7    6 0.16666667 0.00000000
11           7    6 0.16666667 0.00000000
12           2    6 0.16666667 0.16666667
13           6    8 0.12500000 0.12500000
14           4    8 0.12500000 0.12500000
15           8    8 0.12500000 0.12500000
16           7    8 0.12500000 0.12500000
17           7    8 0.12500000 0.12500000
18           2    8 0.12500000 0.12500000
19           6   12 0.08333333 0.08333333
20           4   12 0.08333333 0.08333333
21           8   12 0.08333333 0.08333333
22           7   12 0.08333333 0.08333333
23           7   12 0.08333333 0.08333333
24           2   12 0.08333333 0.08333333
25           6   20 0.05000000 0.05000000
26           4   20 0.05000000 0.05000000
27           8   20 0.05000000 0.05000000
28           7   20 0.05000000 0.05000000
29           7   20 0.05000000 0.05000000
30           2   20 0.05000000 0.05000000

Now we need to iterate over the data frame, grouping by ‘roll’ so that we end up with one row for each one.

We’ll add a new column which stores the posterior probability for each dice. This will be calculated by multiplying the prior probability by the product of the ‘score’ entries.

This is what our new likelihood function looks like:

likelihoods3 = function(names, observations) {
  l = list(observation = observations, roll = names)
  obsDf = do.call(expand.grid,l) %>% 
    mutate(likelihood = 1.0 / roll, 
           score = ifelse(roll < observation, 0, likelihood))   
 
  return (obsDf %>% 
    group_by(roll) %>% 
    summarise(s = (1.0/length(names)) * prod(score) ) %>%
    ungroup() %>% 
    mutate(weighted = s / sum(s)) %>%
    select(roll, weighted))
}
 
l1 = likelihoods3(dice, c(6))
> l1
Source: local data frame [5 x 2]
 
  roll  weighted
1    4 0.0000000
2    6 0.3921569
3    8 0.2941176
4   12 0.1960784
5   20 0.1176471
 
l2 = likelihoods3(dice, c(6, 4, 8, 7, 7, 2))
> l2
Source: local data frame [5 x 2]
 
  roll    weighted
1    4 0.000000000
2    6 0.000000000
3    8 0.915845272
4   12 0.080403426
5   20 0.003751302

We’ve now got the same result as we did with our nested for loops so I think the refactoring has been a success.

Reference:

R: Replacing for loops with data frames from our JCG partner Mark Needham at the Mark Needham Blog blog.