Massive shout out to Dr. Jenny Smetzer, Lecturer in Statistical & Data Sciences, for her work setting this up!
Let’s load a movies dataset, pare down the rows and columns a bit, and then show the first 10 rows using
You see the revenue in
millions value for the movie “2 Fast 2 Furious” is
NA (missing). So the following occurs when computing the median revenue:
You should always think about why a data value might be missing and what that missingness may mean. For example, imagine you are conducting a study on the effects of smoking on lung cancer and a lot of your patients’ data is missing because they died of lung cancer. If you just “sweep these patients under the rug” and ignore them, you are clearly biasing the results.
While there are statistical methods to deal with missing data they are beyond the reach of this class. The easiest thing to do is to remove all missing cases, but you should always at the very least report to the reader if you do so, as by removing the missing values you may be biasing your results.
You can do this with a
na.rm = TRUE argument like so:
If you decide you want to remove the row with the missing data, you can use the filter function like so:
We see “2 Fast 2 Furious” is now gone.
Let’s compute the total revenue for each movie type and plot a barplot.
Say we want to reorder the categorical variable
type so that the bars show in a different order. We can reorder the bars by manually defining the order of the
levels in the
type_levels <- c("rom comedy", "action", "drama", "animated", "comedy", "fantasy") revenue_by_type <- revenue_by_type %>% mutate(type = factor(type, levels = type_levels)) ggplot(revenue_by_type, aes(x = type, y = total_revenue)) + geom_col() + labs(x = "Movie genre", y = "Total boxoffice revenue (in millions of $)")
Or if you want to reorder
type in ascencding order of
total_revenue, we use
Or if you want to reorder
type in descencding order of
total_revenue, just put a
- sign in front of
For more advanced categorical variable (i.e. factor) manipulations, check out the
forcats package. Note:
forcats is an anagram of
Google “ggplot2 axis scale dollars” and click on the first link and search for the word “dollars”. You’ll find:
dplyr::rename() renames column/variable names. To “rename” values inside cells of a particular column, you need to
mutate() the column using one of the three functins below. There might be other ones too, but these are the three I’ve seen the most. In these examples, we’ll change values in the variable
Switch all instances of
rom comedy with
romantic comedy using
ifelse(). If a particular row has
type == "rom comedy", then return
"romantic comedy", else return whatever was originally in
type. Save everything in a new variable
Do the same here, but return
"not romantic comedy" if
type is not
"rom comedy" and this time overwrite the original
ifelse() is rather limited however. What if we want to “rename” all
type so that they start with uppercase? Use
case_when() is a little trickier, but allows you to evaluate boolean operations using
Sometimes we want to turn a numerical, continuous variable into a categorical variable. For instance, what if we wanted to have a variable that tells us if a movie made hundred million dollars or more. In other words a binary variable in other words a categorical variable with 2 levels. We can again use the
What if you want to convert a numerical variable into a categorical variable with more than 2 levels? One way is to use the
cut() command. For instance, below, we
score variable, to recode it into 4 categories:
We set the breaking points for cutting the numerical variable with the
c(0, 40, 60, 80, 100) part, and set the labels for each of these bins with the
labels=c("bad","so-so","good","great") part. All this action happens inside the
mutate command, so the new categorical variable
scorecats is added to the data frame.
Other options with the
right = FALSE.
breaks=3would create 4 groups with approximately the same number of values in each group.
By using a
group_by() followed not by a
summarize() as is often the case, but rather a
mutate(). So say we compute the total revenue millions for each movie rating and type:
Say within each movie rating (G, PG, PG-13, R), we want to know the proportion of
total_millions that made by each movie type (animated, action, comedy, etc). We can:
So for example, the 4 proportions corresponding to R rated movies are 0.596 + 0.142 + 0.213 + 0.0491 = 1.
Say you have numerical data that are recorded as percentages, have commas, or are in dollar form and hence are character strings. How do you convert these to numerical values? Using
readr::parse_number inside a
mutate()! Shout out to Stack Overflow
##  10.5
##  145897
##  1234.5
What about the other way around? Use the
##  "10%"
##  "145,897"
##  "$1,234.50"
Congratulations. You are now an R Ninja