Neat and approachable datasets
Iris
This dataset is the go to dataset for pedagogical use. It’s not particularly interesting but you should become familiar with it as many lessons depend on it.
The dataset is preloaded in R
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Penguins
Description and link TBD
Fish
Description and link TBD
Survey datasets
American Community Survey
Like the Census, but much more detailed.
census.gov/programs-surveys/acs)
American Time Use Survey
Measures the amount of time Americans spend on various activities in a given day. Combination of survey and diary data.
The General Social Survey
80-year survey covering sociological and attitudinal trends in the United States. Topics include “civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.”
National Health and Nutrition Examination Survey
Designed to assess the health and nutritional status of adults and children in the United States. Combination of interviews, physical exams, and lab tests.
Great but messy data
NYC subway turnstile data
A large, messy dataset containing every entry and exit to the NYC subway system.
NYC cab data
Another huge dataset from the city. Contains every yellow and green cab ride. See blogger Todd Schneider’s post analyzing this dataset for ideas.