map()
family of
functionsIn this lab, we will learn how to use variants of map()
to iterate tasks.
Goal: by the end of this lab, you should be able to
use variants of map()
to do complex operations with a few
lines of code.
map()
In this lab, we are going to be generating a bunch of random data.
First, we’re going to create a vector that holds n
,
which represents how much random data we are going to generate in each
simulation. We’re going to store this as a tibble
with one
column (for now).
<- tibble(
random_dist n = 1:20 * 100
)
Next, we use the map()
function to generate a
list
of random vectors, each drawn from a normal
distribution with a different number of samples drawn in each case.
<- map(random_dist$n, rnorm)
random_data str(random_data)
## List of 20
## $ : num [1:100] 0.793 0.522 1.746 -1.271 2.197 ...
## $ : num [1:200] -0.8109 -2.9438 -0.0188 -0.3547 0.0356 ...
## $ : num [1:300] -1.181 -1.228 0.457 -0.251 -0.338 ...
## $ : num [1:400] -0.00654 2.08749 -0.38644 0.75195 1.63085 ...
## $ : num [1:500] 0.0952 -0.0375 0.9213 1.6603 0.7964 ...
## $ : num [1:600] 0.0171 2.5526 -0.0252 0.4658 -0.0748 ...
## $ : num [1:700] -0.857 0.145 0.376 -0.454 -0.247 ...
## $ : num [1:800] -0.899 0.66 -0.212 0.492 0.198 ...
## $ : num [1:900] 0.156 -0.666 -1.518 0.614 -1.704 ...
## $ : num [1:1000] 1.69 -1.048 0.994 -1.896 -0.806 ...
## $ : num [1:1100] 0.733 -0.779 -1.366 -0.726 0.87 ...
## $ : num [1:1200] 0.346 0.487 0.718 1.327 1.016 ...
## $ : num [1:1300] 1.079 -0.963 -0.828 1.548 0.144 ...
## $ : num [1:1400] -1.4 -2.11 1.04 -2.69 -1.07 ...
## $ : num [1:1500] 0.484 0.888 -1.834 1.014 -0.062 ...
## $ : num [1:1600] 1.026 -1.665 -1.001 -0.537 -0.419 ...
## $ : num [1:1700] -1.456 1.223 -0.462 0.588 0.828 ...
## $ : num [1:1800] 1.599 -0.364 -0.612 -1.786 1.263 ...
## $ : num [1:1900] 0.202 -0.291 -0.709 -0.285 -0.485 ...
## $ : num [1:2000] -0.2624 1.5296 -0.0824 -0.4526 1.2523 ...
To keep things organized, let’s save the list
of random
data to our original tibble
. Note that data
is
now a list-column.
$data <- random_data
random_dist random_dist
This is compact way (20 rows and 2 columns) to store what are really
21,000 random numbers! You can see the long version by
unnest()
-ing.
%>%
random_dist unnest(cols = data) %>%
nrow()
## [1] 21000
Next, we can perform some rudimentary analysis on each randomly generated data set. Here, we compute the means and standard deviations of each simulated data set.
<- random_dist %>%
random_dist mutate(
means = map_dbl(data, mean),
sds = map_dbl(data, sd)
)
median()
function to compute the median of each data set.
Add them to the random_dist
data frame.# SAMPLE SOLUTION
<- random_dist %>%
random_dist mutate(medians = map_dbl(data, median))
# SAMPLE SOLUTION
ggplot(random_dist, aes(x = n, y = sds)) +
geom_point()
In the previous examples, the vector that we were iterating over was the first argument to the function that we wanted to apply. This is typical, but not the only situation we may find ourselves in.
Here, we want to experiment with different trimmed means. That is, we want to compute the mean of a single simulated data set after throwing away either 10% or 25% of the data.
In this case, we want to iterate over the trim
argument
to mean()
instead of the x
argument. To do
this, we make use of the .x
syntax made possible by the
~
.
map_dbl(
c(0.1, 0.25),
~mean(x = random_dist$data[[1]], trim = .x)
)
## [1] 0.08418324 0.08603153
map_dbl()
insufficient for this task? What kind of function would we need?Now we’re going to write each one of our data sets to a separate CSV
file. We’ll name each file using the sample size, so that we can tell
them apart later. The path()
function from the
fs
package will specify the full path. The
tempdir()
function returns the path to the temp directory
on your computer.
<- random_dist %>%
random_dist mutate(
filename = paste0("random_data_", n, ".csv"),
file = fs::path(tempdir(), filename)
)
Note that each entry in the data
list-column is a double
vector. In order to write a CSV, we need this to be a
data.frame
. The enframe()
function converts a
vector into a tibble
. We can use map()
to do
this to each data set.
<- random_dist %>%
random_dist mutate(data_frame = map(data, enframe))
Finally, we’re ready to write the files. Note that we need to know
both the path to the CSV file that we want to create, and the data set
itself. The pwalk()
function will step through our data
frame row-by-row. To make things easier, we’ll rename the
data
column x
, so it matches the argument name
write_csv()
is expecting.
args(write_csv)
## function (x, file, na = "NA", append = FALSE, col_names = !append,
## quote = c("needed", "all", "none"), escape = c("double",
## "backslash", "none"), eol = "\n", num_threads = readr_threads(),
## progress = show_progress(), path = deprecated(), quote_escape = deprecated())
## NULL
%>%
random_dist select(x = data_frame, file) %>%
pwalk(write_csv)
pwalk()
statement
without the renaming (i.e., select()
)
step. To do this you will need to use the ~
formulation.To confirm that this worked, use fs::dir_ls()
to
retrieve the list of CSVs in your temp directory.
<- fs::dir_ls(tempdir(), regexp = "\\.csv")
csv csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_100.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1000.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1100.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1200.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1300.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1400.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1500.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1600.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1700.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1800.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_1900.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_200.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_2000.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_300.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_400.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_500.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_600.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_700.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_800.csv
## /var/folders/d_/6fhql14j7hd93zfzcn1x3ttxn8st2y/T/Rtmpopa0Ng/random_data_900.csv
Another useful application of map()
is to read in a
bunch of files. Here, we use the list of CSV paths and
read_csv()
to build a large data set of all of the
simulated data. Note that the .id
argument to
map_dfr()
ensures that the filename is included as a column
in the resulting data frame. Otherwise, we wouldn’t know which data set
came from which file!
<- csv %>%
all_data map_dfr(read_csv, .id = "filename")
all_data
An application of nest()
brings the data back into a
compact form.
<- all_data %>%
nested_data nest(data = c(name, value))
nested_data
Prompt: What questions do you still have about
map()
?