Let’s look at a random sample of 5 of the movies:
| title | budget | rating | 
|---|---|---|
| Swamp Thing | 3000000 | 5.0 | 
| Incident at Loch Ness | 1400000 | 6.4 | 
| Princess and the Pirate, The | 2000000 | 6.7 | 
| Prime Time, The | 150000 | 1.5 | 
| Monty Python and the Holy Grail | 250000 | 8.4 | 
Both variables are numerical. Here are the components of the Grammar of Graphics:
| datavariable | aes()thetic attribute | geom_etric object | 
|---|---|---|
| budget | x | point | 
| rating | y | point | 
Does spending more on a movie yield higher IMDB ratings?
Let’s look at a random sample of 5 of the dates:
| date | n | 
|---|---|
| 2013-01-13 | 828 | 
| 2013-01-07 | 933 | 
| 2013-01-19 | 674 | 
| 2013-01-25 | 922 | 
| 2013-01-29 | 890 | 
Both variables are numerical (dates are technically numerical). Here are the components of the Grammar of Graphics:
| datavariable | aes()thetic attribute | geom_etric object | 
|---|---|---|
| date | x | line | 
| n | y | line | 
Note: Why did we use line as the geom_etric object? Because lines suggest sequence/relationship, and points don’t.
Why are there drops in the number of flights?
Let’s look at a random sample of 5 of the car year/make/model matchings:
| name | trans | hwy | 
|---|---|---|
| 1996 Acura NSX | Manual | 22 | 
| 2013 Buick LaCrosse eAssist | Automatic | 36 | 
| 1996 Chevrolet C1500 Pickup 2WD | Manual | 18 | 
| 2002 Volkswagen Jetta Wagon | Manual | 26 | 
| 1984 Chevrolet G10/20 Sport Van 2WD | Automatic | 15 | 
trans type is categorical, whereas hwy is numerical. Here are the components of the Grammar of Graphics:
| datavariable | aes()thetic attribute | geom_etric object | 
|---|---|---|
| trans | x | boxplot | 
| hwy | y | boxplot | 
About what proportion of manual car models sold between 1984 and 2015 got 20 mpg or worse mileage?
Let’s look at all the data:
| name | n | 
|---|---|
| Carlos | 155711 | 
| Ethan | 359506 | 
| Hayden | 105716 | 
Name is categorical. Here are the components of the Grammar of Graphics:
| datavariable | aes()thetic attribute | geom_etric object | 
|---|---|---|
| name | x | bar | 
| n | y | bar | 
About how many babies were named “Hayden” between 1990-2014?
Let’s look at a random sample of 5 of the users:
| sex | height | 
|---|---|
| f | 65 | 
| m | 75 | 
| m | 65 | 
| f | 64 | 
| m | 69 | 
Height is numerical. Here are the components of the Grammar of Graphics:
| datavariable | aes()thetic attribute | geom_etric object | 
|---|---|---|
| height | x | histogram | 
Note: We’ll see later there is no explicit y aesthetic here, because there is no explicit variable that maps to it, but rather it is computed internally.
What are the smallest and largest visible heights and what do you think of them? Also, think of one graph improvement to better convey information about SF OkCupid users.