In this lab, we will learn how to subset vectors and lists
Goal: by the end of this lab, you will be able to understand why different subset operators return objects of different types.
The three main subset operators are [, [[
and $. In addition, you will see functions that
perform subsetting operations like:
dplyr::filter(): the tidyverse way to
select a subset of the rows of a data frame.dplyr::select(): the tidyverse way to
select a subset of the columns of a data frame.dplyr::pull(): a tidyverse function
analogous to [[.data.frame.purrr::pluck(): a tidyverse function
analogous to [[.It’s probably best to avoid these other similar functions:
subset(): the base R way to select a subset of the
rows of a data frame.rvest::pluck(): similar to purrr::pluck()
but not as goodmagrittr::extract(): a wrapper to [magrittr::extract2(): a wrapper to [[This image is helpful:

Key idea: there are six different ways to index a vector (or list).
The three main most commonly used ways are:
TRUENote that indexing by logical vector will generally return an object of the same length as the original (or smaller), whereas indexing by numeric vector can return an object of any length.
Using dplyr, we would normally find the blue-eyed
characters using filter().
starwars %>%
  filter(eye_color == "blue")Instead, we’ll use the base R functionality for subsetting vectors. First, we compute a logical vector that indicates whether each character has blue eyes.
lgl <- starwars$eye_color == "blue"
lgl##  [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE
## [13]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [25]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
## [37]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
## [61]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE
[ and lgl to compute the subset of
starwars characters who have blue eyes.# SAMPLE SOLUTION
starwars[lgl, ]Alternatively, we could use the which() function to
return an integer vector of the corresponding
indices.
num <- which(starwars$eye_color == "blue")
num##  [1]  1  6  7 11 12 13 18 25 27 31 33 37 52 56 59 61 62 71 78
[ and num to compute the subset of
starwars characters who have blue eyes.# SAMPLE SOLUTION
starwars[num, ]Compute the length of lgl and num. Are
they the same? Why or why not?
Make sure that you understand the difference between what is happening in the first two exercises.
In addition to subsetting, you can also use index vectors to resample, or even oversample, a vector.
For example, we could double the previous results by repeating the index vector.
# note error!
starwars[c(lgl, lgl), ]## Error in `vectbl_as_row_location()`:
## ! Can't subset rows with `c(lgl, lgl)`.
## ✖ Logical subscript `c(lgl, lgl)` must be size 1 or 87, not 174.
# works, but not necessarily as intended -- output suppressed
# as.data.frame(starwars)[c(lgl, lgl), ]
# no warning
starwars[c(num, num), ]Remember that a data frame is just a list of vectors (of
the same length)! Thus, the subsetting rules governing lists also apply
to data frames.
starwars["name"]?# SAMPLE SOLUTION
class(starwars["name"])## [1] "beanumber"  "tbl_df"     "tbl"        "data.frame"
starwars[["name"]]?# SAMPLE SOLUTION
class(starwars[["name"]])## [1] "character"
starwars$name?# SAMPLE SOLUTION
class(starwars$name)## [1] "character"
Storing the names of variables in vectors can be counter-intuitive.
Note that [ will work, $ will not, and
[[ will work only with vectors of length 1.
vars <- c("name", "height")
# works
starwars[, vars]# doesn't work
starwars[[vars]]## Error in `vectbl_as_col_location2()`:
## ! Can't extract column with `vars`.
## ✖ Subscript `vars` must be size 1, not 2.
# doesn't work
starwars$vars## Warning: Unknown or uninitialised column: `vars`.
## NULL
The behavior is also different when the vector of names is of length one.
my_var <- c("name")
# works
starwars[, my_var]
# works!
starwars[[my_var]]
# doesn't work
starwars$my_varThese inconsistencies are some of the many reasons to use the
selection operators in select() instead.
?tidyselect::select_helpersstarwars[[vars]] throw an error, but
starwars[[my_var]] works? What is the logical inconsistency
in the first case?Take a minute to think about what questions you still have about
subsetting. Review what questions have been posted (in the
#questions channel) recently by other students and
either: