map() family of
functionsIn this lab, we will learn how to use variants of map()
to iterate tasks.
Goal: by the end of this lab, you should be able to
use variants of map() to do complex operations with a few
lines of code.
map()In this lab, we are going to be generating a bunch of random data.
First, we’re going to create a vector that holds n,
which represents how much random data we are going to generate in each
simulation. We’re going to store this as a tibble with one
column (for now).
random_dist <- tibble(
n = 1:20 * 100
)Next, we use the map() function to generate a
list of random vectors, each drawn from a normal
distribution with a different number of samples drawn in each case.
random_data <- map(random_dist$n, rnorm)
str(random_data)## List of 20
## $ : num [1:100] 0.793 0.522 1.746 -1.271 2.197 ...
## $ : num [1:200] -0.8109 -2.9438 -0.0188 -0.3547 0.0356 ...
## $ : num [1:300] -1.181 -1.228 0.457 -0.251 -0.338 ...
## $ : num [1:400] -0.00654 2.08749 -0.38644 0.75195 1.63085 ...
## $ : num [1:500] 0.0952 -0.0375 0.9213 1.6603 0.7964 ...
## $ : num [1:600] 0.0171 2.5526 -0.0252 0.4658 -0.0748 ...
## $ : num [1:700] -0.857 0.145 0.376 -0.454 -0.247 ...
## $ : num [1:800] -0.899 0.66 -0.212 0.492 0.198 ...
## $ : num [1:900] 0.156 -0.666 -1.518 0.614 -1.704 ...
## $ : num [1:1000] 1.69 -1.048 0.994 -1.896 -0.806 ...
## $ : num [1:1100] 0.733 -0.779 -1.366 -0.726 0.87 ...
## $ : num [1:1200] 0.346 0.487 0.718 1.327 1.016 ...
## $ : num [1:1300] 1.079 -0.963 -0.828 1.548 0.144 ...
## $ : num [1:1400] -1.4 -2.11 1.04 -2.69 -1.07 ...
## $ : num [1:1500] 0.484 0.888 -1.834 1.014 -0.062 ...
## $ : num [1:1600] 1.026 -1.665 -1.001 -0.537 -0.419 ...
## $ : num [1:1700] -1.456 1.223 -0.462 0.588 0.828 ...
## $ : num [1:1800] 1.599 -0.364 -0.612 -1.786 1.263 ...
## $ : num [1:1900] 0.202 -0.291 -0.709 -0.285 -0.485 ...
## $ : num [1:2000] -0.2624 1.5296 -0.0824 -0.4526 1.2523 ...
To keep things organized, let’s save the list of random
data to our original tibble. Note that data is
now a list-column.
random_dist$data <- random_data
random_distThis is compact way (20 rows and 2 columns) to store what are really
21,000 random numbers! You can see the long version by
unnest()-ing.
random_dist %>%
unnest(cols = data) %>%
nrow()## [1] 21000
Next, we can perform some rudimentary analysis on each randomly generated data set. Here, we compute the means and standard deviations of each simulated data set.
random_dist <- random_dist %>%
mutate(
means = map_dbl(data, mean),
sds = map_dbl(data, sd)
)median() function to compute the median of each data set.
Add them to the random_dist data frame.# SAMPLE SOLUTION
random_dist <- random_dist %>%
mutate(medians = map_dbl(data, median))# SAMPLE SOLUTION
ggplot(random_dist, aes(x = n, y = sds)) +
geom_point()
In the previous examples, the vector that we were iterating over was the first argument to the function that we wanted to apply. This is typical, but not the only situation we may find ourselves in.
Here, we want to experiment with different trimmed means. That is, we want to compute the mean of a single simulated data set after throwing away either 10% or 25% of the data.
In this case, we want to iterate over the trim argument
to mean() instead of the x argument. To do
this, we make use of the .x syntax made possible by the
~.
map_dbl(
c(0.1, 0.25),
~mean(x = random_dist$data[[1]], trim = .x)
)## [1] 0.08418324 0.08603153
map_dbl()
insufficient for this task? What kind of function would we need?Now we’re going to write each one of our data sets to a separate CSV
file. We’ll name each file using the sample size, so that we can tell
them apart later. The path() function from the
fs package will specify the full path. The
tempdir() function returns the path to the temp directory
on your computer.
random_dist <- random_dist %>%
mutate(
filename = paste0("random_data_", n, ".csv"),
file = fs::path(tempdir(), filename)
)Note that each entry in the data list-column is a double
vector. In order to write a CSV, we need this to be a
data.frame. The enframe() function converts a
vector into a tibble. We can use map() to do
this to each data set.
random_dist <- random_dist %>%
mutate(data_frame = map(data, enframe))Finally, we’re ready to write the files. Note that we need to know
both the path to the CSV file that we want to create, and the data set
itself. The pwalk() function will step through our data
frame row-by-row. To make things easier, we’ll rename the
data column x, so it matches the argument name
write_csv() is expecting.
args(write_csv)## function (x, file, na = "NA", append = FALSE, col_names = !append,
## quote = c("needed", "all", "none"), escape = c("double",
## "backslash", "none"), eol = "\n", num_threads = readr_threads(),
## progress = show_progress(), path = deprecated(), quote_escape = deprecated())
## NULL
random_dist %>%
select(x = data_frame, file) %>%
pwalk(write_csv)pwalk() statement
without the renaming (i.e., select())
step. To do this you will need to use the ~
formulation.To confirm that this worked, use fs::dir_ls() to
retrieve the list of CSVs in your temp directory.
csv <- fs::dir_ls(tempdir(), regexp = "\\.csv")
csv## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_100.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1000.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1100.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1200.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1300.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1400.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1500.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1600.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1700.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1800.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1900.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_200.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_2000.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_300.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_400.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_500.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_600.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_700.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_800.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_900.csv
Another useful application of map() is to read in a
bunch of files. Here, we use the list of CSV paths and
read_csv() to build a large data set of all of the
simulated data. Note that the .id argument to
map_dfr() ensures that the filename is included as a column
in the resulting data frame. Otherwise, we wouldn’t know which data set
came from which file!
all_data <- csv %>%
map_dfr(read_csv, .id = "filename")
all_dataAn application of nest() brings the data back into a
compact form.
nested_data <- all_data %>%
nest(data = c(name, value))
nested_dataPrompt: What questions do you still have about
map()?