Massive shout out to Dr. Jenny Smetzer, Lecturer in Statistical & Data Sciences, for her work setting this up! Let’s load a movies dataset, pare down the rows and columns a bit, and then show the first 10 rows using slice()

library(tidyverse)
filter(type %in% c("action", "comedy", "drama", "animated", "fantasy", "rom comedy")) %>%
select(-over200)

movies %>%
slice(1:10)

# How do I deal with missing values?

You see the revenue in millions value for the movie “2 Fast 2 Furious” is NA (missing). So the following occurs when computing the median revenue:

movies %>%
summarize(mean_profit = median(millions))

You should always think about why a data value might be missing and what that missingness may mean. For example, imagine you are conducting a study on the effects of smoking on lung cancer and a lot of your patients’ data is missing because they died of lung cancer. If you just “sweep these patients under the rug” and ignore them, you are clearly biasing the results.

While there are statistical methods to deal with missing data they are beyond the reach of this class. The easiest thing to do is to remove all missing cases, but you should always at the very least report to the reader if you do so, as by removing the missing values you may be biasing your results.

You can do this with a na.rm = TRUE argument like so:

movies %>%
summarize(mean_profit = median(millions, na.rm = T))

If you decide you want to remove the row with the missing data, you can use the filter function like so:

movies <- movies %>%
filter(!is.na(millions))

movies %>%
slice(1:10)

We see “2 Fast 2 Furious” is now gone.

# How do I reorder bars in a barplot?

Let’s compute the total revenue for each movie type and plot a barplot.

revenue_by_type <- movies %>%
group_by(type) %>%
summarize(total_revenue = sum(millions))
revenue_by_type
ggplot(revenue_by_type, aes(x = type, y = total_revenue)) +
geom_col() +
labs(x = "Movie genre", y = "Total boxoffice revenue (in millions of $)") Say we want to reorder the categorical variable type so that the bars show in a different order. We can reorder the bars by manually defining the order of the levels in the factor() command: type_levels <- c("rom comedy", "action", "drama", "animated", "comedy", "fantasy") revenue_by_type <- revenue_by_type %>% mutate(type = factor(type, levels = type_levels)) ggplot(revenue_by_type, aes(x = type, y = total_revenue)) + geom_col() + labs(x = "Movie genre", y = "Total boxoffice revenue (in millions of$)") Or if you want to reorder type in ascencding order of total_revenue, we use reorder()

revenue_by_type <- revenue_by_type %>%
mutate(type = reorder(type, total_revenue))

ggplot(revenue_by_type, aes(x = type, y = total_revenue)) +
geom_col() +
labs(
x = "Movie genre", y = "Total boxoffice revenue (in millions of $)" ) Or if you want to reorder type in descencding order of total_revenue, just put a - sign in front of -total_revenue in reorder(): revenue_by_type <- revenue_by_type %>% mutate(type = reorder(type, -total_revenue)) ggplot(revenue_by_type, aes(x = type, y = total_revenue)) + geom_col() + labs( x = "Movie genre", y = "Total boxoffice revenue (in millions of$)"
) For more advanced categorical variable (i.e. factor) manipulations, check out the forcats package. Note: forcats is an anagram of factors # How do I show money on an axis?

movies <- movies %>%
mutate(revenue = millions * 10^6)

ggplot(data = movies, aes(x = rating, y = revenue)) +
geom_boxplot() +
labs(x = "rating", y = "Revenue in $", title = "Profits for different movie ratings") Google “ggplot2 axis scale dollars” and click on the first link and search for the word “dollars”. You’ll find: # Don't forget to load the scales package first! library(scales) ggplot(data = movies, aes(x = rating, y = revenue)) + geom_boxplot() + labs(x = "rating", y = "Profit in millions of$", title = "Profits for different movie ratings") +
scale_y_continuous(labels = dollar) # How do I change values inside cells?

dplyr::rename() renames column/variable names. To “rename” values inside cells of a particular column, you need to mutate() the column using one of the three functins below. There might be other ones too, but these are the three I’ve seen the most. In these examples, we’ll change values in the variable type.

1. ifelse()
2. recode()
3. case_when()

## ifelse()

Switch all instances of rom comedy with romantic comedy using ifelse(). If a particular row has type == "rom comedy", then return "romantic comedy", else return whatever was originally in type. Save everything in a new variable type_new:

movies %>%
mutate(type_new = ifelse(type == "rom comedy", "romantic comedy", type)) %>%
slice(1:10)

Do the same here, but return "not romantic comedy" if type is not "rom comedy" and this time overwrite the original type variable

movies %>%
mutate(type = ifelse(type == "rom comedy", "romantic comedy", "not romantic comedy")) %>%
slice(1:10)

## recode()

ifelse() is rather limited however. What if we want to “rename” all type so that they start with uppercase? Use recode():

movies %>%
mutate(type_new = recode(type,
"action" = "Action",
"animated" = "Animated",
"comedy" = "Comedy",
"drama" = "Drama",
"fantasy" = "Fantasy",
"rom comedy" = "Romantic Comedy"
)) %>%
slice(1:10)

## case_when()

case_when() is a little trickier, but allows you to evaluate boolean operations using ==, >, >=, &, |, etc:

movies %>%
mutate(type_new =
case_when(
type == "action" & millions > 40 ~ "Big budget action",
type == "rom comedy" & millions < 40 ~ "Small budget romcom",
# Need this for everything else that aren't the two cases above:
TRUE ~ "Rest"
))

# How do I convert a numerical variable to a categorical one?

Sometimes we want to turn a numerical, continuous variable into a categorical variable. For instance, what if we wanted to have a variable that tells us if a movie made hundred million dollars or more. In other words a binary variable in other words a categorical variable with 2 levels. We can again use the mutate function:

movies %>%
mutate(big_budget = millions > 100) %>%
slice(1:10)

What if you want to convert a numerical variable into a categorical variable with more than 2 levels? One way is to use the cut() command. For instance, below, we cut() the score variable, to recode it into 4 categories:

1. 0 - 40 = bad
2. 40.1 - 60 = so-so
3. 60.1 - 80 = good
4. 80.1+ = great

We set the breaking points for cutting the numerical variable with the c(0, 40, 60, 80, 100) part, and set the labels for each of these bins with the labels=c("bad","so-so","good","great") part. All this action happens inside the mutate command, so the new categorical variable scorecats is added to the data frame.

movies %>%
mutate(score_categ = cut(score,
breaks = c(0, 40, 60, 80, 100),
labels = c("bad", "so-so", "good", "great")
)) %>%
slice(1:10)

Other options with the cut function:

• By default, if the value is exacly the upper bound of an interval, it’s included in the lessor category (e.g. 60.0 is ‘so-so’ not ‘good’), to flip this, include the argument right = FALSE.
• You could also have Rstudio equally divide the variable into a balanced number of groups. For example, specifying breaks=3 would create 4 groups with approximately the same number of values in each group.

# How do I compute proportions?

By using a group_by() followed not by a summarize() as is often the case, but rather a mutate(). So say we compute the total revenue millions for each movie rating and type:

rating_by_type_millions <- movies %>%
group_by(rating, type) %>%
summarize(millions = sum(millions)) %>%
arrange(rating, type)

rating_by_type_millions

Say within each movie rating (G, PG, PG-13, R), we want to know the proportion of total_millions that made by each movie type (animated, action, comedy, etc). We can:

rating_by_type_millions %>%
group_by(rating) %>%
mutate(
# Compute a new column of the sum of millions split by rating:
total_millions = sum(millions),
# Compute the proportion within each rating:
prop = millions/total_millions
)

So for example, the 4 proportions corresponding to R rated movies are 0.596 + 0.142 + 0.213 + 0.0491 = 1.

# How do I deal with %, commas, and $? Say you have numerical data that are recorded as percentages, have commas, or are in dollar form and hence are character strings. How do you convert these to numerical values? Using readr::parse_number inside a mutate()! Shout out to Stack Overflow library(readr) parse_number("10.5%") ##  10.5 parse_number("145,897") ##  145897 parse_number("$1,234.5")
##  1234.5

What about the other way around? Use the scales package!

library(scales)
percent(0.105)
##  "10%"
comma(145897)
##  "145,897"
dollar(1234.5)
##  "\$1,234.50"

Congratulations. You are now an R Ninja 