Massive shout out to Dr. Jenny Smetzer, Lecturer in Statistical & Data Sciences, for her work setting this up!

Let’s load a movies dataset, pare down the rows and columns a bit, and then show the first 10 rows using slice()

How do I deal with missing values?

You see the revenue in millions value for the movie “2 Fast 2 Furious” is NA (missing). So the following occurs when computing the median revenue:

You should always think about why a data value might be missing and what that missingness may mean. For example, imagine you are conducting a study on the effects of smoking on lung cancer and a lot of your patients’ data is missing because they died of lung cancer. If you just “sweep these patients under the rug” and ignore them, you are clearly biasing the results.

While there are statistical methods to deal with missing data they are beyond the reach of this class. The easiest thing to do is to remove all missing cases, but you should always at the very least report to the reader if you do so, as by removing the missing values you may be biasing your results.

You can do this with a na.rm = TRUE argument like so:

If you decide you want to remove the row with the missing data, you can use the filter function like so:

We see “2 Fast 2 Furious” is now gone.

How do I reorder bars in a barplot?

Let’s compute the total revenue for each movie type and plot a barplot.

Say we want to reorder the categorical variable type so that the bars show in a different order. We can reorder the bars by manually defining the order of the levels in the factor() command:

Or if you want to reorder type in ascencding order of total_revenue, we use reorder()

Or if you want to reorder type in descencding order of total_revenue, just put a - sign in front of -total_revenue in reorder():

For more advanced categorical variable (i.e. factor) manipulations, check out the forcats package. Note: forcats is an anagram of factors

How do I change values inside cells?

dplyr::rename() renames column/variable names. To “rename” values inside cells of a particular column, you need to mutate() the column using one of the three functins below. There might be other ones too, but these are the three I’ve seen the most. In these examples, we’ll change values in the variable type.

  1. ifelse()
  2. recode()
  3. case_when()

ifelse()

Switch all instances of rom comedy with romantic comedy using ifelse(). If a particular row has type == "rom comedy", then return "romantic comedy", else return whatever was originally in type. Save everything in a new variable type_new:

Do the same here, but return "not romantic comedy" if type is not "rom comedy" and this time overwrite the original type variable

recode()

ifelse() is rather limited however. What if we want to “rename” all type so that they start with uppercase? Use recode():

case_when()

case_when() is a little trickier, but allows you to evaluate boolean operations using ==, >, >=, &, |, etc:

How do I convert a numerical variable to a categorical one?

Sometimes we want to turn a numerical, continuous variable into a categorical variable. For instance, what if we wanted to have a variable that tells us if a movie made hundred million dollars or more. In other words a binary variable in other words a categorical variable with 2 levels. We can again use the mutate function:

What if you want to convert a numerical variable into a categorical variable with more than 2 levels? One way is to use the cut() command. For instance, below, we cut() the score variable, to recode it into 4 categories:

  1. 0 - 40 = bad
  2. 40.1 - 60 = so-so
  3. 60.1 - 80 = good
  4. 80.1+ = great

We set the breaking points for cutting the numerical variable with the c(0, 40, 60, 80, 100) part, and set the labels for each of these bins with the labels=c("bad","so-so","good","great") part. All this action happens inside the mutate command, so the new categorical variable scorecats is added to the data frame.

Other options with the cut function:

  • By default, if the value is exacly the upper bound of an interval, it’s included in the lessor category (e.g. 60.0 is ‘so-so’ not ‘good’), to flip this, include the argument right = FALSE.
  • You could also have Rstudio equally divide the variable into a balanced number of groups. For example, specifying breaks=3 would create 4 groups with approximately the same number of values in each group.

How do I compute proportions?

By using a group_by() followed not by a summarize() as is often the case, but rather a mutate(). So say we compute the total revenue millions for each movie rating and type:

Say within each movie rating (G, PG, PG-13, R), we want to know the proportion of total_millions that made by each movie type (animated, action, comedy, etc). We can:

So for example, the 4 proportions corresponding to R rated movies are 0.596 + 0.142 + 0.213 + 0.0491 = 1.

How do I deal with %, commas, and $?

Say you have numerical data that are recorded as percentages, have commas, or are in dollar form and hence are character strings. How do you convert these to numerical values? Using readr::parse_number inside a mutate()! Shout out to Stack Overflow

## [1] 10.5
## [1] 145897
## [1] 1234.5

What about the other way around? Use the scales package!

## [1] "10%"
## [1] "145,897"
## [1] "$1,234.50"

Congratulations. You are now an R Ninja