Control Flow

In this lab, we will learn how to use ifelse() for vectorized control flow, and to avoid writing for loops.

Goal: by the end of this lab, you will be able to assign values conditionally and re-write a for loop using map().

`ifelse()`

The if () ... else syntax is for control flow. However, ifelse() is a function that returns a vector of the same length as the vector you put in, based on some logical conditions. These are often useful inside mutate().

In the starwars data set, most characters have a species. However, there are many different species.

starwars %>%
  group_by(species) %>%
  count() %>%
  arrange(desc(n))

Suppose that we wanted to lump all of the non-human and non-droid species together. We can use ifelse() to create a new variable.

sw2 <- starwars %>%
  mutate(
    species_update = ifelse(
      !species %in% c("Human", "Droid"), 
      "Other", 
      species
    )
  ) %>%
  select(name, species, species_update)

Note the behavior around NA. Some characters have unknown species.

starwars %>%
  filter(is.na(species))

Our previous construction led to everyone non-human or non-droid being classified as Other, when maybe some should be left as NA.

sw2 %>%
  group_by(species_update) %>%
  count() %>%
  arrange(desc(n))

By capturing NAs in our condition, we can leave them as NAs.

starwars %>%
  mutate(
    species_update = ifelse(
      !species %in% c("Human", "Droid", NA),
      "Other", species
    )
  ) %>%
  filter(is.na(species)) %>%
  select(name, species, species_update)

Create a new variable called is_bald and set it to FALSE if the character has hair of any color, TRUE if the character has no hair, and NA if the character is a droid.

# SAMPLE SOLUTION

starwars <- starwars %>%
  mutate(
    is_bald = ifelse(species == "Droid", NA, TRUE), 
    is_bald = ifelse(is_bald & hair_color != "none", 
                     FALSE, is_bald)
  )

Use the following code to check your previous answer. Pay careful attention to NAs. Do you have them in all the right places?

starwars %>%
  select(hair_color, is_bald) %>%
  table(useNA = "always")

`for` loops

As noted in the book, there are many reasons to avoid writing loops in R. I have never written a repeat loop. There are only rare occasions when a while loop is necessary. Unless you need to explicitly access indices, you can and should rewrite a for loop as a map() statement. I will strongly encourage you to do this!!

Vectorized operations

Many operations in R are vectorized already, so you often don’t need a loop at all.

Considering generating the first 10 number in some integer sequences. For the perfect squares, you don’t need a loop at all, because the square operator is vectorized. Recall that vectors are built into the fundamental design of R, so things are supposed to work this way!

x <- 1:10

x^2

##  [1]   1   4   9  16  25  36  49  64  81 100

However, consider generating the Fibbonaci sequence. This can’t be vectorized, because each entry depends on the previous two entries. You could write a for loop.

fib <- c(1, 1)
for (i in 3:length(x)) {
  fib[i] <- fib[i-1] + fib[i-2]
}
fib

##  [1]  1  1  2  3  5  8 13 21 34 55

If we had the Fibbonacci sequence already, we could use R’s vector-based operation lag() to decompose the sequence.

fib_df <- tibble(
  fib, 
  prev_x = lag(fib), 
  prev_prev_x = lag(fib, 2)
)
fib_df

But this won’t help us generate new values in the sequence.

Using `map()`

Instead, we can write a recursive function to generate the \(n\)th value in the sequence, and then map() over that function.

fibonacci <- function(x) {
  if (x == 1 | x == 2) {
    return(1L);
  } else {
    return(fibonacci(x - 1) + fibonacci(x - 2));
  }
}

map_int(x, fibonacci)

##  [1]  1  1  2  3  5  8 13 21 34 55

Choosing a paradigm

Generally, when you have a vector x as input, and you want to produce a vector y of the same length as output, you can use one of two paradigms:

If the operation can be vectorized, write a function that will take the whole input vector x and compute the whole y vector at once. I suspect that this will be the most efficient method in nearly every case.
If the operation can’t be vectorized, write a function that will compute a single value of y for a single value of x, and then map() that function over x.

Only if neither of these is possible, should you write a for loop.

Recall that we saw map() previously in the context of list-columns.

Use the vectorized nchar() function to compute the number of characters in each character’s name, without writing any kind of loop.

# SAMPLE SOLUTION
nchar(starwars$name)

##  [1] 14  5  5 11 11  9 18  5 17 14 16 14  9  8  6 21 14 16  4  9  9  5  5 16  5
## [26]  6 10 12 21  9 12 11 13 13 12 10  8  5  7 13 14 10 11 11  8  7 14 10 12  9
## [51]  9 10 11 11  8 10 12  5 11 17 15 13  5  5 19 10 10 15  7  7 10 13  6 10  8
## [76]  8  8  7 15  9 10  4  3 11  3 14 13

Now compute the same output, but using map_int() and nchar(). Make sure you understand the difference between these two approaches.

# SAMPLE SOLUTION

map_int(starwars$name, nchar)

##  [1] 14  5  5 11 11  9 18  5 17 14 16 14  9  8  6 21 14 16  4  9  9  5  5 16  5
## [26]  6 10 12 21  9 12 11 13 13 12 10  8  5  7 13 14 10 11 11  8  7 14 10 12  9
## [51]  9 10 11 11  8 10 12  5 11 17 15 13  5  5 19 10 10 15  7  7 10 13  6 10  8
## [76]  8  8  7 15  9 10  4  3 11  3 14 13

Now use map_int() and length() to compute a numeric vector of the number of vehicles associated with each character.

# SAMPLE SOLUTION

map_int(starwars$vehicles, length)

##  [1] 2 0 0 0 1 0 0 0 0 1 2 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## [39] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
## [77] 1 0 0 0 0 0 0 0 0 0 0

Use map() and nchar() to compute the total number of characters in the number of starships associated with each character. For example, Luke Skywalker primarily flew an X-wing fighter, but also briefly piloted an Imperial shuttle in Return of the Jedi. So the number of characters in his starships list is 6 + 16 = 22.

# SAMPLE SOLUTION

map_int(starwars$starships, ~sum(nchar(.x)))

##  [1] 22  0  0 15  0  0  0  0  6 96 53  0 33 33  0  0  6  6  0  0  7  0  0 17  0
## [26]  0  0  6  0 17  0  0  0  0  0  0 20  0  0  0  0  8  0  0  0  0  0  0  0  0
## [51]  0  0  0  0 16  0 13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## [76]  0 24  0  0  0  0  0  0 19  0  0 48

# SAMPLE SOLUTION

starwars %>%
  pull(starships) %>%
  map(nchar) %>%
  map_int(sum)

##  [1] 22  0  0 15  0  0  0  0  6 96 53  0 33 33  0  0  6  6  0  0  7  0  0 17  0
## [26]  0  0  6  0 17  0  0  0  0  0  0 20  0  0  0  0  8  0  0  0  0  0  0  0  0
## [51]  0  0  0  0 16  0 13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## [76]  0 24  0  0  0  0  0  0 19  0  0 48

Rewrite the following for loop as a call to map(). The output should be a list of length 2.

mpg_by_year <- group_split(mpg, year)

mods <- list()

for (i in seq_along(mpg_by_year)) {
  mods[[i]] <- lm(hwy ~ displ + cyl, data = mpg_by_year[[i]])
}

# SAMPLE SOLUTION

map(mpg_by_year, ~lm(hwy ~ displ + cyl, data = .x))

## [[1]]
## 
## Call:
## lm(formula = hwy ~ displ + cyl, data = .x)
## 
## Coefficients:
## (Intercept)        displ          cyl  
##    35.95548     -3.67442     -0.08285  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = hwy ~ displ + cyl, data = .x)
## 
## Coefficients:
## (Intercept)        displ          cyl  
##     40.5275      -0.4355      -2.5437

# SAMPLE SOLUTION
map(mpg_by_year, lm, formula = "hwy ~ displ + cyl")

## [[1]]
## 
## Call:
## .f(formula = "hwy ~ displ + cyl", data = .x[[i]])
## 
## Coefficients:
## (Intercept)        displ          cyl  
##    35.95548     -3.67442     -0.08285  
## 
## 
## [[2]]
## 
## Call:
## .f(formula = "hwy ~ displ + cyl", data = .x[[i]])
## 
## Coefficients:
## (Intercept)        displ          cyl  
##     40.5275      -0.4355      -2.5437

Engagement

Prompt: What #questions to you still have about control flow and/or loops?