In this lab, we will learn how to investigate the underlying data structures of R objects.
Goal: by the end of this lab, you will be able to determine the base class of any object.
Objects in R can have attributes.
Use the attributes()
function to figure out what they
are.
attributes(starwars)
## $names
## [1] "name" "height" "mass" "hair_color" "skin_color"
## [6] "eye_color" "birth_year" "sex" "gender" "homeworld"
## [11] "species" "films" "vehicles" "starships" "is_bald"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## [51] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## [76] 76 77 78 79 80 81 82 83 84 85 86 87
##
## $class
## [1] "beanumber" "tbl_df" "tbl" "data.frame"
Unlike in many other programming languages, attributes in R – including the class of an object – are changeable!
<-
) and the
attr()
function to change the class of
starwars
to sds_is_awesome
.# SAMPLE SOLUTION
attr(starwars, "class") <- "sds_is_awesome"
attributes(starwars)
## $names
## [1] "name" "height" "mass" "hair_color" "skin_color"
## [6] "eye_color" "birth_year" "sex" "gender" "homeworld"
## [11] "species" "films" "vehicles" "starships" "is_bald"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## [51] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## [76] 76 77 78 79 80 81 82 83 84 85 86 87
##
## $class
## [1] "sds_is_awesome"
starwars
a data.frame
now? How do you
know? Try to select()
a column.# SAMPLE SOLUTION
%>%
starwars select(name)
rm(starwars)
to delete the bad copy. Now run
starwars
again. Why does this work?# SAMPLE SOLUTION
rm(starwars)
starwars
S3 is the name of the simplest and most common object-oriented paradigm in R. We’ll learn more about S3 later. For now, we’ll explore common vector classes that are not atomic.
Note first that starwars
has multiple classes, and these
classes are ordered.
class(starwars)
## [1] "tbl_df" "tbl" "data.frame"
The basic data type of starwars
is a list
,
because all tbl_df
s and data.frame
s are
lists.
typeof(starwars)
## [1] "list"
When you type starwars
at the console, what actually
gets called is print(starwars)
. That is, the default action
when you type the name of an object is to run the print()
command on that object.
Thus, when you type starwars
, R runs
print(starwars)
, and since it knows that
print()
is a generic function, and
starwars
is a tbl_df
, it looks for a method
called print.tbl_df()
. If it can’t find one, it will look
for a method called print.tbl()
. If it can’t find one, it
will look for print.data.frame()
. If it can’t find that it
will look for print.default()
.
In this case, there are print()
methods defined for
tbl
and data.frame
. Note the difference
between:
starwarsprint.data.frame(starwars)
Examine the output of print.data.frame(starwars)
and
as.data.frame(starwars)
. Are they the same? What is the
difference between what is actually executed?
Examine the output of as.numeric(starwars$name)
and
as.numeric(factor(starwars$name))
. What is going
on?
# SAMPLE SOLUTION
<- factor(starwars$name)
x attributes(x)
## $levels
## [1] "Ackbar" "Adi Gallia" "Anakin Skywalker"
## [4] "Arvel Crynyd" "Ayla Secura" "Bail Prestor Organa"
## [7] "Barriss Offee" "BB8" "Ben Quadinaros"
## [10] "Beru Whitesun lars" "Bib Fortuna" "Biggs Darklighter"
## [13] "Boba Fett" "Bossk" "C-3PO"
## [16] "Captain Phasma" "Chewbacca" "Cliegg Lars"
## [19] "Cordé" "Darth Maul" "Darth Vader"
## [22] "Dexter Jettster" "Dooku" "Dormé"
## [25] "Dud Bolt" "Eeth Koth" "Finis Valorum"
## [28] "Finn" "Gasgano" "Greedo"
## [31] "Gregar Typho" "Grievous" "Han Solo"
## [34] "IG-88" "Jabba Desilijic Tiure" "Jango Fett"
## [37] "Jar Jar Binks" "Jek Tono Porkins" "Jocasta Nu"
## [40] "Ki-Adi-Mundi" "Kit Fisto" "Lama Su"
## [43] "Lando Calrissian" "Leia Organa" "Lobot"
## [46] "Luke Skywalker" "Luminara Unduli" "Mace Windu"
## [49] "Mas Amedda" "Mon Mothma" "Nien Nunb"
## [52] "Nute Gunray" "Obi-Wan Kenobi" "Owen Lars"
## [55] "Padmé Amidala" "Palpatine" "Plo Koon"
## [58] "Poe Dameron" "Poggle the Lesser" "Quarsh Panaka"
## [61] "Qui-Gon Jinn" "R2-D2" "R4-P17"
## [64] "R5-D4" "Ratts Tyerell" "Raymus Antilles"
## [67] "Rey" "Ric Olié" "Roos Tarpals"
## [70] "Rugor Nass" "Saesee Tiin" "San Hill"
## [73] "Sebulba" "Shaak Ti" "Shmi Skywalker"
## [76] "Sly Moore" "Tarfful" "Taun We"
## [79] "Tion Medon" "Wat Tambor" "Watto"
## [82] "Wedge Antilles" "Wicket Systri Warrick" "Wilhuff Tarkin"
## [85] "Yarael Poof" "Yoda" "Zam Wesell"
##
## $class
## [1] "factor"
as.numeric(x)
## [1] 46 15 62 21 44 54 10 64 12 53 3 84 17 33 30 35 82 38 86 56 13 34 14 43 45
## [26] 1 50 4 83 51 61 52 27 37 69 70 68 81 73 60 75 20 11 5 25 29 9 48 40 41
## [51] 26 2 71 85 57 49 31 19 18 59 47 7 24 23 6 36 87 22 42 78 39 65 63 80 72
## [76] 74 32 77 66 76 79 28 67 58 8 16 55
as.numeric(starwars$name)
## Warning: NAs introduced by coercion
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [76] NA NA NA NA NA NA NA NA NA NA NA NA
Since data.frame
s are lists, their columns can be
objects of arbitrary type. In particular, they can be lists.
The films
column in starwars
is a
list-column. Each entry contains a list of the movies that the
corresponding character has appeared in.
<- starwars %>%
films pull(films)
films
Note that the length()
of films
is 87, but
that each entry in films
contains a list of arbitrary
length. To see these lengths, we have to map()
over the entries in films
.
length(films)
## [1] 87
map_int(films, length)
## [1] 5 6 7 4 5 3 3 1 1 6 3 2 5 4 1 3 3 1 5 5 3 1 1 2 1 2 1 1 1 1 1 3 1 2 1 1 1 2
## [39] 1 1 2 1 1 3 1 1 1 3 3 3 2 2 2 1 3 2 1 1 1 2 2 1 1 2 2 1 1 1 1 1 1 1 2 1 1 2
## [77] 1 1 2 2 1 1 1 1 1 1 3
nest()
and unnest()
List-columns can be expanded by unnest()
. This has the
effect of lengthening the data frame (sort of like an accordian). Each row
is duplicated for each unique value of each entry in the
list-column.
Note that each row in starwars
corresponds to one
character, while films
stores the list of films that
character has appeared in. If we unnest()
the data frame by
expanding out the films
, we get a data frame that is much
longer, because each row now represents one character in one
film.
library(tidyr)
%>%
starwars unnest(films)
Note that films
is no longer a list-column – it’s now a
character vector.
The nest()
function performs the opposite operation of
“rolling up” the data frame to create a new list-column.
starwars
data frame.Suppose now we want to add the numbers of films for each character to
the starwars
data set. A simple mutate()
like
this will not throw an error, but also won’t do what we want.
<- starwars %>%
oops mutate(num_films = length(films)) %>%
arrange(desc(num_films)) %>%
select(name, num_films)
oops
This just made all of the entries equal to
length(films)
.
all(oops$num_films == length(starwars$films))
## [1] TRUE
To get this right, we need to map()
inside our
mutate()
.
%>%
starwars mutate(num_films_actual = map_int(films, length)) %>%
arrange(desc(num_films_actual)) %>%
select(name, num_films_actual, films)
Take a minute to think about what questions you still have about
vectors. Review what questions have been posted (in the
#questions
channel) recently by other students and
either:
Here is prompt to prime your thinking:
Where did you stuck in this lab?