3.1

Learning Check

Consider the following data of the price of three stocks (with names x, y, z) over 5 days. This data is not in tidy data format. How would you re-format it so that it is?

date x y z
2009-01-01 0.318 0.287 -2.292
2009-01-02 -0.173 3.169 0.938
2009-01-03 -0.081 -0.999 -4.668
2009-01-04 -0.521 -0.900 -7.417
2009-01-05 -0.463 0.026 -1.009

Solution

We want

  • Each row to represent one value, in this case one stock price
  • Each column to represent one variable of information. In our case, we have three: date, price, and the name of the stock

“Tidy data” format is also known as long format, unlike the original data which was in wide format.

date stock_name price
2009-01-01 x 0.318
2009-01-02 x -0.173
2009-01-03 x -0.081
2009-01-04 x -0.521
2009-01-05 x -0.463
2009-01-01 y 0.287
2009-01-02 y 3.169
2009-01-03 y -0.999
2009-01-04 y -0.900
2009-01-05 y 0.026
2009-01-01 z -2.292
2009-01-02 z 0.938
2009-01-03 z -4.668
2009-01-04 z -7.417
2009-01-05 z -1.009

3.2

Learning Check

What among the following choices does any ONE row in this flights dataset refer to?

  • A. Data on an airline
  • B. Data on a flight
  • C. Data on an airport
  • D. Data on multiple flights

Solution

Run the following in your console to View() the flights data set again:

library(nycflights13)
data(flights)
View(flights)

Each row represents data on one flight.

3.3-3.7

Learning Check

  • How many different columns are in this dataset?
  • How many different rows are in this dataset?
  • What are some other examples in this dataset of numerical variables?
  • What are some other examples in this dataset of categorical variables? What makes them different than quantitative variables?
  • What do you think int, dbl, chr, and dttm mean in the output above?

Solution

This is where the glimpse() function within the dplyr pacage for data manipulation is handy. Remember, to use this function you need to load the dplyr package first.

library(dplyr)
glimpse(flights)
## Observations: 336,776
## Variables: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
  • There are 19 columns (i.e. variables) and 336776 rows (i.e. observations, which in this case are individual flights).
  • Numerical variables include: any variables of type
    • <int>: integer AKA whole numbers like 1, 2, 3. Used for counting discrete quantities. Ex: year, day, etc.
    • <dbl>: double AKA decimal numbers like 1.0, 2.0, 3.1. Used for measuring continuous quantities, like height, weight, time. Other programming languages refer to this as “floats”. Ex: air_time, dep_delay.
  • Categorial variables include (in this case) any variables of type <chr>: character AKA text data. Ex: carrier, tailnum, and dest. Categorical variables separate groups i.e. categories.
  • <dttm> is a special kind of “date/time” variable. We won’t go in depth with these in this class, but leave it to a more advanced class.

Note This dataset is sloppily coded:

  1. flight is coded as <int>, but it is not a numerical variable but rather a categorial variable where the labels are numbers. Ex: a flight numbered “400” is not “twice as much” as a flight numbered “200”.
  2. Some of the time-related variables are coded as <int> and others as <dbl>. R treats them the same when doing math, so its not a big deal, but to be thorough, we should treat anything time related as <dbl> i.e. you measure time, not count time.

3.8-3.9

Learning Check

  • For each of the datasets weather, planes, airports, and airlines, load them using the data() function and then View() them. Identify what the observational unit is.
  • Say you wanted to see if weather patterns are associated with delays. Sketch out how would you use the flights and weather data frames to test this?

Solution

The observational units are:

  • weather: year/month/day/hour for one of the three NYC airports
  • planes: a physical aircraft
  • airports: an airport
  • airlines: a carrier

If you want to see which weather patterns are associated with delays, you need to join (AKA merge) the flights data set with the weather data set. We leave specifics until later, but for now, we need to key the join by the following variables to match the two data sets:

View(flights)
View(weather)
  • In flights: year, month, day, hour, origin
  • In weather: year, month, day, hour, origin