Fri Oct 21, 2016

Last Time: 5MV

  1. select() columns by variable name: front of cheatsheet, bottom right
  2. filter() rows matching criteria: front of cheatsheet, bottom middle.
  3. summarise() numerical variables that are group_by() categorical variables
  4. mutate() existing variables to create new ones
  5. arrange() rows

Today: Finishing 5MV

  1. select() columns by variable name: front of cheatsheet, bottom right
  2. filter() rows matching criteria: front of cheatsheet, bottom middle.
  3. summarise() numerical variables that are group_by() categorical variables
  4. mutate() existing variables to create new ones
  5. arrange() rows

and _join(): backside of cheatsheet, right hand column, top.

Arrange

Really simple. Either

  • DATASET_NAME %>% arrange(VARIABLE_NAME) or
  • DATASET_NAME %>% arrange(desc(VARIABLE_NAME))

Arrange Example

library(dplyr)

# Create data frame with two variables
test_data <- data_frame(
  name=c("Abbi", "Abbi ", "Ilana", "Ilana", "Ilana"),
  value_1=c(0, 1, 0, 1, 0),
  value_2=c(4, 6, 3, 2, 5)
)

# See contents in console
test_data

Arrange Example

Run these bits of code. Notice the subtle diff between 2 and 3:

# Bit1: Arrange in ascending order of value_1 (the default)
test_data %>% arrange(value_1)

# Bit2: Arrange in descending order of value_1
test_data %>% arrange(desc(value_1))

# Bit3: Arrange in decending order of value_1, and then within
# value_1, arrange in ascending order of value_2
test_data %>% arrange(desc(value_1), value_2)

Combining Data Sets via Joins

And now the last component of data manipulation, and the data portion of this class: joining/merging two data sets.

Combining Data Sets via Joins

Imagine we have two data frames x and y. Run the following:

x <- data_frame(x1=c("A","B","C"), x2=c(1,2,3))
y <- data_frame(x1=c("A","B","D"), x3=c(TRUE,FALSE,TRUE))
x
y

Combining Data Sets via Joins

We join by the "x1" variable. Note how it is in quotation marks.

left_join(x, y, by = "x1")
full_join(x, y, by = "x1")

Different Join Variable Name?

What the variable we want to join by has a different name in each data set? Here a vs b:

i <- data_frame(a=c("A","B","C"), x2=c(1,2,3))
j <- data_frame(b=c("A","B","D"), x3=c(TRUE,FALSE,TRUE))
i
j

Different Join Variable Name?

We use the c() function.

left_join(i, j, by = c("a" = "b"))

It is a bit confusing unfortunately. It's

  • by = c("a"="b") and not
  • by = c("a"=="b")

Different Join Variable Name?

There was an example of such a join in Chapter 4.7.1 on Barplots where we join

  • flights data, which had airport codes in origin
  • airports data, which had airport codes in faa
inner_join(flights, airports, by = c("origin" = "faa"))

This creates a single data set with all information combined.

Types of Joins

  • There are many types of join (right-hand column of back of cheatsheet)
  • To keep things simple, we'll try to only use:
    • left_join
    • full_join
  • This illustration succinctly summarizes all of them.