class: center, middle, inverse, title-slide .title[ # Vectors ] .subtitle[ ## Mini-Lecture 2 ] .author[ ### Albert Y. Kim ] .date[ ###
SDS 270
2022-09-15
] --- --- ## Garbage collection > when does [R] determine that a file is no longer needed to be considered for **garbage collection**? -- - Garbage collection is *lazy* -- - Your OS reclaims memory from R only when it needs it --- ## Tibbles > I can’t seem to figure out why there is such a difference in the memory usage of a **tibble vs a dataframe**? Is it because a dataframe is more like a proper file where as a tibble is like preview? -- ```r library(lobstr) obj_size(as_tibble(iris)) - obj_size(iris) ``` ``` ## 136 B ``` -- ```r obj_size(attr(iris, "class")) ``` ``` ## 120 B ``` ```r obj_size(attr(as_tibble(iris), "class")) ``` ``` ## 256 B ``` --- ## Wide vs. long > I understand why "long" and "wide" data would take up different amounts of memory in Exercise 7 but am curious **why long format data takes up more**? Intuitively I would have thought that fewer vectors would take up less space than more vectors, even if they're longer -- ```r dim(iris) ``` ``` ## [1] 150 5 ``` -- ```r iris_long <- iris %>% pivot_longer(-Species, names_to = "type", values_to = "measurement") dim(iris_long) ``` ``` ## [1] 600 3 ``` ```r obj_size(iris_long) / obj_size(iris) ``` ``` ## 1.93 B ``` ```r prod(dim(iris_long)) / prod(dim(iris)) ``` ``` ## [1] 2.4 ``` -- - There is overhead because `pivot_longer()` adds a new variable --- ## Wide vs. long (cont'd) - But **you're right** otherwise (about factors, at least)! ```r iris %>% map_dbl(obj_size) ``` ``` ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1248 1248 1248 1248 1248 ``` ```r iris_long %>% map_dbl(obj_size) ``` ``` ## Species type measurement ## 3048 5104 4848 ``` ```r class(iris_long$Species) ``` ``` ## [1] "factor" ``` --- ## When does it matter? > When does object size start to make a noticeable difference in the efficiency/speed of the code? For example, if you had a long data frame vs. a wide one, is there a # of rows/columns that would make long tables slow your code down significantly with a long frame instead of a wide one, or is it just completely dependent on what kind of program you're running? -- - Sounds like a great project! --- class: center, inverse, middle # Vectors --- ## Clarification - Lists *always* store references to other objects .center[] .footnote[https://adv-r.hadley.nz/names-values.html#list-references] --- ## Coercion > character → double → integer → logical .footnote[https://adv-r.hadley.nz/vectors-chap.html#testing-and-coercion] -- - Makes more sense to me that the arrows go the other way! --- --- class: center, middle background-image: url(https://d33wubrfki0l68.cloudfront.net/baa19d0ebf9b97949a7ad259b29a1c4ae031c8e2/8e9b8/diagrams/vectors/summary-tree-s3-1.png) background-size: contain background-position: center --- ## `data.frame`s and `tibble`s .center[] .footnote[https://adv-r.hadley.nz/vectors-chap.html#tibble] --- ## Differences - `tibble()` never coerces an input - `tibble()` won't transform non-syntactic names - `tibble()` only recycles vectors of length 1 - `tibble()` allows references to created variables - `[` always returns a tibble - `$` doesn't do partial matching --- ## Method-oriented programming? - Suppose we have an `instrument` object called `violin` and a method called `play()` -- - in Java: ```java instrument MyViolin = new Violin(); MyViolin.play(); ``` -- - but in R: ```r # constructor sets class attribute to "instrument" my_violin <- violin() # generic function dispatches on class attribute play(my_violin) ``` -- ```r # what actually happens!!! *play.instrument(my_violin) ``` --- ## List-columns - Where have you seen this before? -- - `sf` objects have `geometry` list-column -- - fitting many models --- ## `sf` list-columns ```r library(sf) library(macleish) boundary <- macleish_layers[["boundary"]] boundary %>% as_tibble() ``` ``` ## # A tibble: 1 × 2 ## area geometry ## [acre] <POLYGON [°]> ## 1 255. ((-72.68133 42.45536, -72.68108 42.45539, -72.68111 42.45549, -72.6811… ``` ```r boundary$geometry %>% class() ``` ``` ## [1] "sfc_POLYGON" "sfc" ``` ```r boundary$geometry %>% typeof() ``` ``` ## [1] "list" ``` --- ## `nest()` and `unnest()` ```r library(tidyr) nrow(starwars) ``` ``` ## [1] 87 ``` ```r starwars_person_film <- starwars %>% unnest(films) nrow(starwars_person_film) ``` ``` ## [1] 173 ``` ```r starwars_person_film %>% nest(films) %>% nrow() ``` ``` ## Warning: All elements of `...` must be named. ## Did you want `data = films`? ``` ``` ## [1] 87 ```