Getting and Cleaning Data
Course page: https://class.coursera.org/getdata-034 By Jeff Leek, PhD, Roger D. Peng, PhD, Brian Caffo, PhD
- Basic concepts:
- Find and extract raw data
- Tidy data principles
- practical R packages
- Interesting datasets from: https://data.baltimorecity.gov/
- Pipeline: Raw data -> Processing script -> tidy data -> data analysis -> data communication
- Components of tidy data
- Raw data: can have multiple levels
- Tidy data
- Should produce a code book (metadata):
- could be in markdown
- should have a section called “Study design” (eg. how raw data was collected)
- must have section “Code cook”: description of each variable and its units
- Explicit and exact recipe to go from raw to tidy (instruction list)
- R script
- input = raw data, output = processed data
- no parameter for script