Datasets

Neat and approachable datasets

Iris

This dataset is the go to dataset for pedagogical use. It’s not particularly interesting but you should become familiar with it as many lessons depend on it.

The dataset is preloaded in R

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

 

Penguins

Description and link TBD

 

Fish

Description and link TBD

 

Survey datasets

American Community Survey

Like the Census, but much more detailed.

census.gov/programs-surveys/acs)

 

American Time Use Survey

Measures the amount of time Americans spend on various activities in a given day. Combination of survey and diary data.

bls.gov/tus

 

The General Social Survey

80-year survey covering sociological and attitudinal trends in the United States. Topics include “civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.”

gss.norc.org

 

National Health and Nutrition Examination Survey

Designed to assess the health and nutritional status of adults and children in the United States. Combination of interviews, physical exams, and lab tests.

cdc.gov/nchs/nhanes

 

Great but messy data

NYC subway turnstile data

A large, messy dataset containing every entry and exit to the NYC subway system.

web.mta.info/…/Turnstile

 

NYC cab data

Another huge dataset from the city. Contains every yellow and green cab ride. See blogger Todd Schneider’s post analyzing this dataset for ideas.

nyc.gov/…/trip record data