In this lab, we will learn how to subset vectors and lists
Goal: by the end of this lab, you will be able to understand why different subset operators return objects of different types.
The three main subset operators are [
, [[
and $
. In addition, you will see functions that
perform subsetting operations like:
dplyr::filter()
: the tidyverse
way to
select a subset of the rows of a data frame.dplyr::select()
: the tidyverse
way to
select a subset of the columns of a data frame.dplyr::pull()
: a tidyverse
function
analogous to [[.data.frame
.purrr::pluck()
: a tidyverse
function
analogous to [[
.It’s probably best to avoid these other similar functions:
subset()
: the base R way to select a subset of the
rows of a data frame.rvest::pluck()
: similar to purrr::pluck()
but not as goodmagrittr::extract()
: a wrapper to [
magrittr::extract2()
: a wrapper to [[
This image is helpful:
Key idea: there are six different ways to index a vector (or list).
The three main most commonly used ways are:
TRUE
Note that indexing by logical vector will generally return an object of the same length as the original (or smaller), whereas indexing by numeric vector can return an object of any length.
Using dplyr
, we would normally find the blue-eyed
characters using filter()
.
%>%
starwars filter(eye_color == "blue")
Instead, we’ll use the base R functionality for subsetting vectors. First, we compute a logical vector that indicates whether each character has blue eyes.
<- starwars$eye_color == "blue"
lgl lgl
## [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE
## [13] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
## [37] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
## [61] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE
[
and lgl
to compute the subset of
starwars
characters who have blue eyes.# SAMPLE SOLUTION
starwars[lgl, ]
Alternatively, we could use the which()
function to
return an integer vector of the corresponding
indices.
<- which(starwars$eye_color == "blue")
num num
## [1] 1 6 7 11 12 13 18 25 27 31 33 37 52 56 59 61 62 71 78
[
and num
to compute the subset of
starwars
characters who have blue eyes.# SAMPLE SOLUTION
starwars[num, ]
Compute the length of lgl
and num
. Are
they the same? Why or why not?
Make sure that you understand the difference between what is happening in the first two exercises.
In addition to subsetting, you can also use index vectors to resample, or even oversample, a vector.
For example, we could double the previous results by repeating the index vector.
# note error!
c(lgl, lgl), ] starwars[
## Error in `vectbl_as_row_location()`:
## ! Can't subset rows with `c(lgl, lgl)`.
## ✖ Logical subscript `c(lgl, lgl)` must be size 1 or 87, not 174.
# works, but not necessarily as intended -- output suppressed
# as.data.frame(starwars)[c(lgl, lgl), ]
# no warning
c(num, num), ] starwars[
Remember that a data frame is just a list
of vectors (of
the same length)! Thus, the subsetting rules governing lists also apply
to data frames.
starwars["name"]
?# SAMPLE SOLUTION
class(starwars["name"])
## [1] "beanumber" "tbl_df" "tbl" "data.frame"
starwars[["name"]]
?# SAMPLE SOLUTION
class(starwars[["name"]])
## [1] "character"
starwars$name
?# SAMPLE SOLUTION
class(starwars$name)
## [1] "character"
Storing the names of variables in vectors can be counter-intuitive.
Note that [
will work, $
will not, and
[[
will work only with vectors of length 1.
<- c("name", "height")
vars
# works
starwars[, vars]
# doesn't work
starwars[[vars]]
## Error in `vectbl_as_col_location2()`:
## ! Can't extract column with `vars`.
## ✖ Subscript `vars` must be size 1, not 2.
# doesn't work
$vars starwars
## Warning: Unknown or uninitialised column: `vars`.
## NULL
The behavior is also different when the vector of names is of length one.
<- c("name")
my_var
# works
starwars[, my_var]
# works!
starwars[[my_var]]
# doesn't work
$my_var starwars
These inconsistencies are some of the many reasons to use the
selection operators in select()
instead.
::select_helpers ?tidyselect
starwars[[vars]]
throw an error, but
starwars[[my_var]]
works? What is the logical inconsistency
in the first case?Take a minute to think about what questions you still have about
subsetting. Review what questions have been posted (in the
#questions
channel) recently by other students and
either: