In this lab, we will learn how to use the lobstr package to get information about objects in our environment.

Goal: by the end of this lab, you will be able to determine whether an operation makes a copy, and compute the amount of memory each object occupies.

Measuring memory

First, load the lobstr package. This package contains many functions that will make it easier for us to figure out what R is doing under the hood.

library(lobstr)

Your workspace should be empty.

  1. Use ls() to list the objects in your workspace. If it is not empty, use the broom icon to empty it.
# SAMPLE SOLUTION

ls()
##  [1] "all_data"           "catch_the_dots"     "csv"               
##  [4] "fib"                "fib_df"             "fibonacci"         
##  [7] "filter"             "func_env"           "global_var"        
## [10] "i"                  "li"                 "mods"              
## [13] "mpg_by_year"        "my_fun"             "my_fun2"           
## [16] "my_fun3"            "nested_data"        "posted"            
## [19] "print.my_factor"    "projects"           "random_data"       
## [22] "random_dist"        "scale_color_smith"  "starwars"          
## [25] "sw2"                "talks"              "unique_values"     
## [28] "unique_values_safe" "x"

Before we do anything, how much memory is being used by our R session?

mem_used()
## 106.14 MB

Recall that a byte is eight bits. A byte is a very small amount of information, typically used to store one character. A kilobyte is 1000 bytes, and a megabyte is 1000 kilobyte, etc. You should familiarize yourself briefly with the orders of magnitude of data.

Now suppose we add some things to our workspace. We can add objects, functions, or load packages. Does loading a package increase the memory used by our session?

  1. Use the library() command to load the broom package. Then check the memory usage with mem_used(). Does loading a package increase the amount of memory used?
# SAMPLE SOLUTION

library(broom)
mem_used()
## 106.23 MB

What about loading a data set?

  1. Use the data() command to load the iris data set. Does that increase the memory usage?
# SAMPLE SOLUTION

data(iris)
mem_used()
## 106.45 MB
  1. Use obj_size() to measure the amount of memory that iris takes up. Was the increase you observed previously equal to this amount?
# SAMPLE SOLUTION

obj_size(iris)
## 7.20 kB

Making copies

As much fun as it is to make copies, each copy occupies memory. Generally, we want to minimize the amount of memory that our code needs to run.

Let’s store the amount of memory we are currently using.

before <- mem_used()

Note the memory location of the iris data frame.

ref(iris)
## █ [1:0x7fe576e9c9f8] <df[,5]> 
## ├─Sepal.Length = [2:0x7fe573864c00] <dbl> 
## ├─Sepal.Width = [3:0x7fe573930e00] <dbl> 
## ├─Petal.Length = [4:0x7fe56d383a00] <dbl> 
## ├─Petal.Width = [5:0x7fe573b80c00] <dbl> 
## └─Species = [6:0x7fe56c9ea7d0] <fct>

Note that we can bind a second name my_iris to the iris data frame, without making a copy.

my_iris <- iris
ref(my_iris)
## █ [1:0x7fe576e9c9f8] <df[,5]> 
## ├─Sepal.Length = [2:0x7fe573864c00] <dbl> 
## ├─Sepal.Width = [3:0x7fe573930e00] <dbl> 
## ├─Petal.Length = [4:0x7fe56d383a00] <dbl> 
## ├─Petal.Width = [5:0x7fe573b80c00] <dbl> 
## └─Species = [6:0x7fe56c9ea7d0] <fct>

Now let’s change the data frame in a way that forces a copy to be made.

my_iris <- my_iris %>%
  mutate(sepal_area = Sepal.Length * Sepal.Width)
ref(my_iris)
## █ [1:0x7fe575f55b68] <df[,6]> 
## ├─Sepal.Length = [2:0x7fe573864c00] <dbl> 
## ├─Sepal.Width = [3:0x7fe573930e00] <dbl> 
## ├─Petal.Length = [4:0x7fe56d383a00] <dbl> 
## ├─Petal.Width = [5:0x7fe573b80c00] <dbl> 
## ├─Species = [6:0x7fe56c9ea7d0] <fct> 
## └─sepal_area = [7:0x7fe571972600] <dbl>

Note that the memory locations of my_iris and iris are not the same anymore. However, the memory locations of the underlying vectors are the same!

  1. Use before and mem_used() to calculate how much extra memory the copy of my_iris occupies.
# SAMPLE SOLUTION

mem_used() - before
## 226.98 kB
  1. Is the difference you observed above equal to the size of the new column we created? Why or why not?
# SAMPLE SOLUTION

obj_size(my_iris$sepal_area)
## 1.25 kB
obj_size(iris)
## 7.20 kB

Different representations of the same data may have different memory footprints. Suppose we change the iris data set into its long format.

iris_long <- iris %>%
  pivot_longer(-Species, names_to = "type", values_to = "measurement")
  1. Does iris_long take up the same amount of memory as iris? Why or why not?
# SAMPLE SOLUTION

obj_size(iris_long)
## 13.93 kB
obj_size(iris)
## 7.20 kB

We know that tibbles are like data.frames. Do they take up the same amount of memory?

before <- mem_used()
iris_tbl <- iris_long %>%
  as_tibble()
mem_used() - before
## 592 B
class(iris_tbl)
## [1] "tbl_df"     "tbl"        "data.frame"

Does converting a data.frame to a tbl force a copy?

ref(iris_tbl)
## █ [1:0x7fe576ed5ad8] <tibble[,3]> 
## ├─Species = [2:0x7fe56c485a00] <fct> 
## ├─type = [3:0x7fe56c1f0200] <chr> 
## └─measurement = [4:0x7fe574ae2800] <dbl>
ref(iris_long)
## █ [1:0x7fe577acbcf8] <tibble[,3]> 
## ├─Species = [2:0x7fe56c485a00] <fct> 
## ├─type = [3:0x7fe56c1f0200] <chr> 
## └─measurement = [4:0x7fe574ae2800] <dbl>
  1. Discuss how using a tibble changes the memory footprint relative to using a data.frame.

Tracing memory

Unfortunately, due to various complications and optimizations, it’s not always possible to reason ahead of time about whether R will make a copy of an object. Instead, we can use the tracemem() function to have R tell us whether it makes a copy and why.

First, note the memory location of iris.

tracemem(iris)
## [1] "<0x7fe576e9c9f8>"

We are now tracing this memory location. Some types of computations we make on iris do not require making a copy.

iris %>%
  pull(Petal.Length) %>%
  mean()
## [1] 3.758
iris %>%
  select(contains("Petal")) %>%
  head()

However, if we modify iris using mutate(), a copy does get made.

iris %>%
  mutate(petal_area = Petal.Length * Petal.Width) %>%
  as_tibble()
## tracemem[0x7fe576e9c9f8 -> 0x7fe576dde058]: initialize <Anonymous> mutate_cols mutate.data.frame mutate as_tibble %>% eval eval eval_with_user_handlers withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> withCallingHandlers suppressMessages render_one FUN lapply sapply <Anonymous> <Anonymous> 
## tracemem[0x7fe576dde058 -> 0x7fe576dde0c8]: initialize <Anonymous> mutate_cols mutate.data.frame mutate as_tibble %>% eval eval eval_with_user_handlers withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> withCallingHandlers suppressMessages render_one FUN lapply sapply <Anonymous> <Anonymous> 
## tracemem[0x7fe576e9c9f8 -> 0x7fe576dde918]: new_data_frame vec_data dplyr_vec_data as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate as_tibble %>% eval eval eval_with_user_handlers withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> withCallingHandlers suppressMessages render_one FUN lapply sapply <Anonymous> <Anonymous> 
## tracemem[0x7fe576dde918 -> 0x7fe576dde988]: new_data_frame dplyr_vec_data as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate as_tibble %>% eval eval eval_with_user_handlers withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> withCallingHandlers suppressMessages render_one FUN lapply sapply <Anonymous> <Anonymous> 
## tracemem[0x7fe576dde988 -> 0x7fe576dde9f8]: as.list.data.frame as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate as_tibble %>% eval eval eval_with_user_handlers withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> withCallingHandlers suppressMessages render_one FUN lapply sapply <Anonymous> <Anonymous>
  1. Experiment with different operations after invoking tracemem(). Can you get a feel for what operations induce copies?

Garbage collection

Garbage collection is the process of reclaiming memory that is no longer being used. R does the automatically, but you can force the issue with gc().

gc()
##           used (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
## Ncells 1502722 80.3    2281527 121.9         NA  2281527 121.9
## Vcells 2849784 21.8    8388608  64.0      51200  5154059  39.4

Engagement

Take a minute to think about what questions you still have about names, values, and copies. Review what questions have been posted (in the #questions channel) recently by other students and either:

  • respond (e.g., react, comment, clarify, or answer)
  • post a new question