Consider the following data of the price of three stocks (with names x
, y
, z
) over 5 days. This data is not in tidy data format. How would you re-format it so that it is?
date | x | y | z |
---|---|---|---|
2009-01-01 | 0.318 | 0.287 | -2.292 |
2009-01-02 | -0.173 | 3.169 | 0.938 |
2009-01-03 | -0.081 | -0.999 | -4.668 |
2009-01-04 | -0.521 | -0.900 | -7.417 |
2009-01-05 | -0.463 | 0.026 | -1.009 |
We want
“Tidy data” format is also known as long format, unlike the original data which was in wide format.
date | stock_name | price |
---|---|---|
2009-01-01 | x | 0.318 |
2009-01-02 | x | -0.173 |
2009-01-03 | x | -0.081 |
2009-01-04 | x | -0.521 |
2009-01-05 | x | -0.463 |
2009-01-01 | y | 0.287 |
2009-01-02 | y | 3.169 |
2009-01-03 | y | -0.999 |
2009-01-04 | y | -0.900 |
2009-01-05 | y | 0.026 |
2009-01-01 | z | -2.292 |
2009-01-02 | z | 0.938 |
2009-01-03 | z | -4.668 |
2009-01-04 | z | -7.417 |
2009-01-05 | z | -1.009 |
What among the following choices does any ONE row in this flights
dataset refer to?
Run the following in your console to View()
the flights
data set again:
library(nycflights13)
data(flights)
View(flights)
Each row represents data on one flight.
int
, dbl
, chr
, and dttm
mean in the output above?This is where the glimpse()
function within the dplyr
pacage for data manipulation is handy. Remember, to use this function you need to load the dplyr
package first.
library(dplyr)
glimpse(flights)
## Observations: 336,776
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
<int>
: integer AKA whole numbers like 1, 2, 3. Used for counting discrete quantities. Ex: year
, day
, etc.<dbl>
: double AKA decimal numbers like 1.0, 2.0, 3.1. Used for measuring continuous quantities, like height, weight, time. Other programming languages refer to this as “floats”. Ex: air_time
, dep_delay
.<chr>
: character AKA text data. Ex: carrier
, tailnum
, and dest
. Categorical variables separate groups i.e. categories.<dttm>
is a special kind of “date/time” variable. We won’t go in depth with these in this class, but leave it to a more advanced class.Note This dataset is sloppily coded:
flight
is coded as <int>
, but it is not a numerical variable but rather a categorial variable where the labels are numbers. Ex: a flight numbered “400” is not “twice as much” as a flight numbered “200”.<int>
and others as <dbl>
. R treats them the same when doing math, so its not a big deal, but to be thorough, we should treat anything time related as <dbl>
i.e. you measure time, not count time.weather
, planes
, airports
, and airlines
, load them using the data()
function and then View()
them. Identify what the observational unit is.flights
and weather
data frames to test this?The observational units are:
weather
: year/month/day/hour for one of the three NYC airportsplanes
: a physical aircraftairports
: an airportairlines
: a carrierIf you want to see which weather patterns are associated with delays, you need to join (AKA merge) the flights
data set with the weather
data set. We leave specifics until later, but for now, we need to key the join by the following variables to match the two data sets:
View(flights)
View(weather)
flights
: year
, month
, day
, hour
, origin
weather
: year
, month
, day
, hour
, origin