In this lab, we will learn how to use the lobstr
package
to get information about objects in our environment.
Goal: by the end of this lab, you will be able to determine whether an operation makes a copy, and compute the amount of memory each object occupies.
First, load the lobstr
package. This package contains
many functions that will make it easier for us to figure out what R is
doing under
the hood.
library(lobstr)
Your workspace should be empty.
ls()
to list the objects in your workspace. If it
is not empty, use the broom icon to empty it.# SAMPLE SOLUTION
ls()
## [1] "all_data" "catch_the_dots" "csv"
## [4] "fib" "fib_df" "fibonacci"
## [7] "filter" "func_env" "global_var"
## [10] "i" "li" "mods"
## [13] "mpg_by_year" "my_fun" "my_fun2"
## [16] "my_fun3" "nested_data" "posted"
## [19] "print.my_factor" "projects" "random_data"
## [22] "random_dist" "scale_color_smith" "starwars"
## [25] "sw2" "talks" "unique_values"
## [28] "unique_values_safe" "x"
Before we do anything, how much memory is being used by our R session?
mem_used()
## 106.14 MB
Recall that a byte is eight bits. A byte is a very small amount of information, typically used to store one character. A kilobyte is 1000 bytes, and a megabyte is 1000 kilobyte, etc. You should familiarize yourself briefly with the orders of magnitude of data.
Now suppose we add some things to our workspace. We can add objects, functions, or load packages. Does loading a package increase the memory used by our session?
library()
command to load the
broom
package. Then check the memory usage with
mem_used()
. Does loading a package increase the amount of
memory used?# SAMPLE SOLUTION
library(broom)
mem_used()
## 106.23 MB
What about loading a data set?
data()
command to load the iris
data set. Does that increase the memory usage?# SAMPLE SOLUTION
data(iris)
mem_used()
## 106.45 MB
obj_size()
to measure the amount of memory that
iris
takes up. Was the increase you observed previously
equal to this amount?# SAMPLE SOLUTION
obj_size(iris)
## 7.20 kB
As much fun as it is to make copies, each copy occupies memory. Generally, we want to minimize the amount of memory that our code needs to run.
Let’s store the amount of memory we are currently using.
<- mem_used() before
Note the memory location of the iris
data frame.
ref(iris)
## █ [1:0x7fe576e9c9f8] <df[,5]>
## ├─Sepal.Length = [2:0x7fe573864c00] <dbl>
## ├─Sepal.Width = [3:0x7fe573930e00] <dbl>
## ├─Petal.Length = [4:0x7fe56d383a00] <dbl>
## ├─Petal.Width = [5:0x7fe573b80c00] <dbl>
## └─Species = [6:0x7fe56c9ea7d0] <fct>
Note that we can bind a second name my_iris
to the
iris
data frame, without making a copy.
<- iris
my_iris ref(my_iris)
## █ [1:0x7fe576e9c9f8] <df[,5]>
## ├─Sepal.Length = [2:0x7fe573864c00] <dbl>
## ├─Sepal.Width = [3:0x7fe573930e00] <dbl>
## ├─Petal.Length = [4:0x7fe56d383a00] <dbl>
## ├─Petal.Width = [5:0x7fe573b80c00] <dbl>
## └─Species = [6:0x7fe56c9ea7d0] <fct>
Now let’s change the data frame in a way that forces a copy to be made.
<- my_iris %>%
my_iris mutate(sepal_area = Sepal.Length * Sepal.Width)
ref(my_iris)
## █ [1:0x7fe575f55b68] <df[,6]>
## ├─Sepal.Length = [2:0x7fe573864c00] <dbl>
## ├─Sepal.Width = [3:0x7fe573930e00] <dbl>
## ├─Petal.Length = [4:0x7fe56d383a00] <dbl>
## ├─Petal.Width = [5:0x7fe573b80c00] <dbl>
## ├─Species = [6:0x7fe56c9ea7d0] <fct>
## └─sepal_area = [7:0x7fe571972600] <dbl>
Note that the memory locations of my_iris
and
iris
are not the same anymore. However, the memory
locations of the underlying vectors are the same!
before
and mem_used()
to calculate how
much extra memory the copy of my_iris
occupies.# SAMPLE SOLUTION
mem_used() - before
## 226.98 kB
# SAMPLE SOLUTION
obj_size(my_iris$sepal_area)
## 1.25 kB
obj_size(iris)
## 7.20 kB
Different representations of the same data may have different memory
footprints. Suppose we change the iris
data set into its
long format.
<- iris %>%
iris_long pivot_longer(-Species, names_to = "type", values_to = "measurement")
iris_long
take up the same amount of memory as
iris
? Why or why not?# SAMPLE SOLUTION
obj_size(iris_long)
## 13.93 kB
obj_size(iris)
## 7.20 kB
We know that tibbles are like data.frames. Do they take up the same amount of memory?
<- mem_used()
before <- iris_long %>%
iris_tbl as_tibble()
mem_used() - before
## 592 B
class(iris_tbl)
## [1] "tbl_df" "tbl" "data.frame"
Does converting a data.frame
to a tbl
force
a copy?
ref(iris_tbl)
## █ [1:0x7fe576ed5ad8] <tibble[,3]>
## ├─Species = [2:0x7fe56c485a00] <fct>
## ├─type = [3:0x7fe56c1f0200] <chr>
## └─measurement = [4:0x7fe574ae2800] <dbl>
ref(iris_long)
## █ [1:0x7fe577acbcf8] <tibble[,3]>
## ├─Species = [2:0x7fe56c485a00] <fct>
## ├─type = [3:0x7fe56c1f0200] <chr>
## └─measurement = [4:0x7fe574ae2800] <dbl>
Unfortunately, due to various complications and optimizations, it’s
not always possible to reason ahead of time about whether R will make a
copy of an object. Instead, we can use the tracemem()
function to have R tell us whether it makes a copy and why.
First, note the memory location of iris
.
tracemem(iris)
## [1] "<0x7fe576e9c9f8>"
We are now tracing this memory location. Some types of computations
we make on iris
do not require making a copy.
%>%
iris pull(Petal.Length) %>%
mean()
## [1] 3.758
%>%
iris select(contains("Petal")) %>%
head()
However, if we modify iris
using mutate()
,
a copy does get made.
%>%
iris mutate(petal_area = Petal.Length * Petal.Width) %>%
as_tibble()
## tracemem[0x7fe576e9c9f8 -> 0x7fe576dde058]: initialize <Anonymous> mutate_cols mutate.data.frame mutate as_tibble %>% eval eval eval_with_user_handlers withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> withCallingHandlers suppressMessages render_one FUN lapply sapply <Anonymous> <Anonymous>
## tracemem[0x7fe576dde058 -> 0x7fe576dde0c8]: initialize <Anonymous> mutate_cols mutate.data.frame mutate as_tibble %>% eval eval eval_with_user_handlers withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> withCallingHandlers suppressMessages render_one FUN lapply sapply <Anonymous> <Anonymous>
## tracemem[0x7fe576e9c9f8 -> 0x7fe576dde918]: new_data_frame vec_data dplyr_vec_data as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate as_tibble %>% eval eval eval_with_user_handlers withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> withCallingHandlers suppressMessages render_one FUN lapply sapply <Anonymous> <Anonymous>
## tracemem[0x7fe576dde918 -> 0x7fe576dde988]: new_data_frame dplyr_vec_data as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate as_tibble %>% eval eval eval_with_user_handlers withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> withCallingHandlers suppressMessages render_one FUN lapply sapply <Anonymous> <Anonymous>
## tracemem[0x7fe576dde988 -> 0x7fe576dde9f8]: as.list.data.frame as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate as_tibble %>% eval eval eval_with_user_handlers withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> withCallingHandlers suppressMessages render_one FUN lapply sapply <Anonymous> <Anonymous>
tracemem()
. Can you get a feel for what operations induce
copies?Garbage
collection is the process of reclaiming memory that is no longer
being used. R does the automatically, but you can force the issue with
gc()
.
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 1502722 80.3 2281527 121.9 NA 2281527 121.9
## Vcells 2849784 21.8 8388608 64.0 51200 5154059 39.4
Take a minute to think about what questions you still have about
names, values, and copies. Review what questions have been posted (in
the #questions
channel) recently by other students and
either: