In this lab, we will learn how to subset vectors and lists

Goal: by the end of this lab, you will be able to understand why different subset operators return objects of different types.

Operators

The three main subset operators are [, [[ and $. In addition, you will see functions that perform subsetting operations like:

  • dplyr::filter(): the tidyverse way to select a subset of the rows of a data frame.
  • dplyr::select(): the tidyverse way to select a subset of the columns of a data frame.
  • dplyr::pull(): a tidyverse function analogous to [[.data.frame.
  • purrr::pluck(): a tidyverse function analogous to [[.

It’s probably best to avoid these other similar functions:

  • subset(): the base R way to select a subset of the rows of a data frame.
  • rvest::pluck(): similar to purrr::pluck() but not as good
  • magrittr::extract(): a wrapper to [
  • magrittr::extract2(): a wrapper to [[

This image is helpful:

Indexing

Key idea: there are six different ways to index a vector (or list).

The three main most commonly used ways are:

  • with a numeric vector that selects the elements by index
  • with a logical vector that selects the elements that are TRUE
  • with a character vector that selects the elements by name

Note that indexing by logical vector will generally return an object of the same length as the original (or smaller), whereas indexing by numeric vector can return an object of any length.

Using dplyr, we would normally find the blue-eyed characters using filter().

starwars %>%
  filter(eye_color == "blue")

Instead, we’ll use the base R functionality for subsetting vectors. First, we compute a logical vector that indicates whether each character has blue eyes.

lgl <- starwars$eye_color == "blue"
lgl
##  [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE
## [13]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [25]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
## [37]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
## [61]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE
  1. Use [ and lgl to compute the subset of starwars characters who have blue eyes.
# SAMPLE SOLUTION

starwars[lgl, ]

Alternatively, we could use the which() function to return an integer vector of the corresponding indices.

num <- which(starwars$eye_color == "blue")
num
##  [1]  1  6  7 11 12 13 18 25 27 31 33 37 52 56 59 61 62 71 78
  1. Use [ and num to compute the subset of starwars characters who have blue eyes.
# SAMPLE SOLUTION

starwars[num, ]
  1. Compute the length of lgl and num. Are they the same? Why or why not?

  2. Make sure that you understand the difference between what is happening in the first two exercises.

Resampling

In addition to subsetting, you can also use index vectors to resample, or even oversample, a vector.

For example, we could double the previous results by repeating the index vector.

# note error!
starwars[c(lgl, lgl), ]
## Error in `vectbl_as_row_location()`:
## ! Can't subset rows with `c(lgl, lgl)`.
## ✖ Logical subscript `c(lgl, lgl)` must be size 1 or 87, not 174.
# works, but not necessarily as intended -- output suppressed
# as.data.frame(starwars)[c(lgl, lgl), ]

# no warning
starwars[c(num, num), ]

Application

Remember that a data frame is just a list of vectors (of the same length)! Thus, the subsetting rules governing lists also apply to data frames.

  1. What is the type of the result of starwars["name"]?
# SAMPLE SOLUTION

class(starwars["name"])
## [1] "beanumber"  "tbl_df"     "tbl"        "data.frame"
  1. What is the type of the result of starwars[["name"]]?
# SAMPLE SOLUTION

class(starwars[["name"]])
## [1] "character"
  1. What is the type of the result of starwars$name?
# SAMPLE SOLUTION

class(starwars$name)
## [1] "character"

Storing the names of variables in vectors can be counter-intuitive. Note that [ will work, $ will not, and [[ will work only with vectors of length 1.

vars <- c("name", "height")

# works
starwars[, vars]
# doesn't work
starwars[[vars]]
## Error in `vectbl_as_col_location2()`:
## ! Can't extract column with `vars`.
## ✖ Subscript `vars` must be size 1, not 2.
# doesn't work
starwars$vars
## Warning: Unknown or uninitialised column: `vars`.
## NULL

The behavior is also different when the vector of names is of length one.

my_var <- c("name")

# works
starwars[, my_var]

# works!
starwars[[my_var]]

# doesn't work
starwars$my_var

These inconsistencies are some of the many reasons to use the selection operators in select() instead.

?tidyselect::select_helpers
  1. Why does starwars[[vars]] throw an error, but starwars[[my_var]] works? What is the logical inconsistency in the first case?

Engagement

Take a minute to think about what questions you still have about subsetting. Review what questions have been posted (in the #questions channel) recently by other students and either:

  • respond (e.g., react, comment, clarify, or answer)
  • post a new question