forked from idc9/stor390
-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
217 lines (144 loc) · 11.2 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
---
title: "**STOR 390: Introduction to Data Science**"
output:
html_document:
theme: cosmo
toc: yes
toc_float: yes
---
This course is an application-driven introduction to data science. Statistical and computational tools are valued throughout the modern workplace from Silicon Valley startups, to marine biology labs, to Wall Street firms. These tools require technical skills such as programming and statistics. They also require professional skills such as communication, teamwork, problem solving, and critical thinking.
You will learn these tools and hone these skills through hands-on experience working with datasets such as: Museum of Modern Art records, TCGA Gene Expressions and the text script of Beauty and the Beast. The first half of the semester will cover R programming skills. The second half will cover a number of topics: exploratory data analysis, web scraping, text processing, and effective visualization through a series of modules.
- Instructor: [Iain Carmichael](https://idc9.github.io/)
- Instructional Assistant: [Brendan Brown](http://stat-or-old.oasis.unc.edu/people/graduate-students/brendan-brown)
- Graduate Research Consultant: [Varun Goel](https://varungoel.web.unc.edu/)
See the **[course syllabus](https://idc9.github.io/stor390/course_info/syllabus.html)** for more information.
# **Course Material**
| Date | Lecture | Notes | Slides |
|------|---------|-------|-------|
|January 12 |install R, basic commands | [getting started](notes/getting_started/getting_started.html) | [welcome](slides/welcome.pdf)|
|January 17 | R Markdown, working directory, R projects |[workflow](notes/workflow/workflow.html) | |
|January 19 | select, filter, mutate | [dplyr](notes/dplyr/dplyr.html)| |
|January 24 | pipes, group_by, summarise| [dplyr](notes/dplyr/dplyr.html)| [dplyr2](slides/dplyr2.html)|
| January 26 | if/else, loops, functions|[intro programming](notes/programming/prog_notes.html) |[intro prog](slides/intro_prog.html) |
| January 31 |vectors, lists and eda |[more prog](notes/programming/prog_notes.html) and [EDA](notes/EDA/EDA.html) | [intro prog](slides/intro_prog.html) and [EDA](slides/EDA.html)|
|February 2| spread, gather |[tidy data](notes/tidy_data/tidy_data.html) | |
|February 7| inner, outer joins ([Marshall Markham](https://www.linkedin.com/in/marshall-markham-840ab24a/)) | | [joins](slides/joins.pdf)|
|February 9| match, extract, replace with `stringr`| [regular expressions](notes/strings/strings.html) |[regex](slides/regex.html) |
|February 14| look ahead/behinds|[regular expressions](notes/strings/strings.html) | |
|February 16| | | |
|February 21| least squares, factors, `lm()`| [linear regression](notes/linear_regression/linear_regression.html)| |
|February 23|test set, nonlinear, interactions|[predictive modeling](notes/predictive_modeling/predictive_modeling.html) | [modeling](slides/predictive_modeling.html) |
|February 28| exploratory data analysis| | |
|March 2| more exploratory analysis| | |
|March 7| nearest centroid, KNN | [classification](notes/classification/classification.html)|[classification](slides/classification.html) |
|March 9| cross-validation| [cross-validation](notes/cross_validation/cross_validation.html) | [cv](slides/cv_slides.html)|
|March 21|support vector machine, kernels| [more classification](notes/more_classification/more_classification.html) | |
|March 23| |[more classification](notes/more_classification/more_classification.html) |[more classification](slides/slides_more_classification.html) |
|March 28| more classification| [custom viz](notes/custom_viz/custom_viz.html) | |
|March 30| APIs ([Marshall Markham](https://www.linkedin.com/in/marshall-markham-840ab24a/))|[APIs](notes/API/APIs.html) | |
|April 4| | | |
|April 6| [Shiny](https://shiny.rstudio.com/) ([Frances Tong](https://www.linkedin.com/in/francestong)| | |
|April 11| guest Dan Yang | | |
|April 13| | | |
|April 18| guest Lucia Gjeltema| | |
|April 20| guest [Ryan Thornburg](https://ryanthornburg.com/) | | |
|April 25| | | |
|April 27| | | |
- the datasets use in the class are on [data.world](https://data.world/iain/stor-390/) and [github](https://github.com/idc9/stor390/tree/master/data/)
- all of the course material is on the [github repo](https://github.com/idc9/stor390) including
- [.Rmd files](https://github.com/idc9/stor390/tree/master/notes/) for the notes
- [example code](https://github.com/idc9/stor390/tree/master/example_code/) are also on github
- most of the course material is in the lecture notes (linked to above) and reading -- the slides are visual aids for the lectures.
- options for [extra credit](course_info/extra_credit.html)
# **Reading**
Readings should be complete by the following class (r4ds = [R for Data Science](http://r4ds.had.co.nz/))
**January 12**
- read through the [getting started notes](https://idc9.github.io/stor390/notes/getting_started/getting_started.html)
- read [before we start](http://www.datacarpentry.org/R-ecology-lesson/00-before-we-start.html) from data carpentry
- sections 1, 2, 3.1-3.5 of r4ds
- I suggest copying/pasting and running some of the example code
- [Data science done well looks easy - and that is a big problem for data scientists](http://simplystatistics.org/2015/03/17/data-science-done-well-looks-easy-and-that-is-a-big-problem-for-data-scientists/)
- This tutorial may be helpful for [getting started with R Markdown](http://stat545.com/block007_first-use-rmarkdown.html)
- (Optional) [For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights](https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html) and [David Mimno's response](http://www.mimno.org/articles/carpentry/)
- [big data is like teenage sex](https://www.facebook.com/dan.ariely/posts/904383595868)
**January 17**
- read sections 3.5 - 3.10 ([data viz](http://r4ds.had.co.nz/data-visualisation.html)) and section 8 ([workflow](http://r4ds.had.co.nz/workflow-projects.html)) from r4ds
- read [Reproducibility is not just for researchers](http://www.dataschool.io/reproducibility-is-not-just-for-researchers/)
- (optionally) [The real reason reproducible research is important](http://simplystatistics.org/2014/06/06/the-real-reason-reproducible-research-is-important/) from Simply Statistics
**January 19**
- read section 5 ([data transformation](http://r4ds.had.co.nz/transform.html)) from r4ds
- (optionally) [the dplyr flights vignettes](https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html)
**January 24**
- read section 18 ([pipes](http://r4ds.had.co.nz/pipes.html)) and sections 19.1-19.4 ([functions](http://r4ds.had.co.nz/functions.html)) from r4ds
**January 26**
- read section 19.5-7 and sections 21.1-21.3 ([loops](http://r4ds.had.co.nz/iteration.html)) from r4ds
**January 31**
- read r4ds section 12 ([tidy data](http://r4ds.had.co.nz/tidy-data.html))
**February 2**
- read r4ds section 13.1-13.4 ([relational data](http://r4ds.had.co.nz/relational-data.html))
- read [about non-tidy data](http://simplystatistics.org/2016/02/17/non-tidy-data/)
- (optional) [how to share data with a statistician](https://github.com/jtleek/datasharing)
**February 7**
- read r4ds chapter 14 ([strings](http://r4ds.had.co.nz/strings.html#introduction-8))
- (optional) read the rest of the relational data chapter (13.5-13.7)
**February 16**
- read r4ds section 22, 23.1-23.3 ([models](http://r4ds.had.co.nz/model-intro.html))
**February 21**
- read r4ds sections 23.4-23.6 ([models](http://r4ds.had.co.nz/model-intro.html))
- alternate sources for regression (optional)
- [ISLR](http://www-bcf.usc.edu/~gareth/ISL/) sections 3.1-3.2
- [Machine Learning for Hackers](https://www.amazon.com/Machine-Learning-Hackers-Studies-Algorithms/dp/1449303714) chapter 5
- [Machine Learning a Probabilistic Perspective]() chapter 7
**February 23**
- read [ISLR](http://www-bcf.usc.edu/~gareth/ISL/) sections 2.1-2.2 for an overview of modeling
- read r4ds [chapter 24](http://r4ds.had.co.nz/model-building.html)
**February 28**
- read [ISLR](http://www-bcf.usc.edu/~gareth/ISL/) sections 6.1-6.1 on variable selection
**March 2**
- make sure you have read and understand linear regression and model selection from [ISLR](http://www-bcf.usc.edu/~gareth/ISL/) i.e. section 2.1, 2.2, 3.1, 3.2, 6.1, 6.2
- read ISLR sections 4.1-4.3 about classification
**March 9**
- read [What is artificial intelligence?](http://simplystatistics.org/2017/01/19/what-is-artificial-intelligence/) by Jeff Leek
**March 21**
- read [An example that isn't that artificial or intelligent
](http://simplystatistics.org/2017/01/20/not-artificial-not-intelligent/)
- be prepared to discuss this article and the other AI article from Jeff Leek in class on Thursday
- read [ISLR sections 9.1-9.2](http://www-bcf.usc.edu/~gareth/ISL/) on support vector machines
**March 28**
- read [ISLR section 9.3, 9.4](http://www-bcf.usc.edu/~gareth/ISL/) on SVMs.
- read [ISLR section 5.1](http://www-bcf.usc.edu/~gareth/ISL/) on cross validation.
- read through the [custom viz notes](https://idc9.github.io/stor390/notes/custom_viz/custom_viz.html)
# **Coding**
The due date is in the link.
| Assigned | Labs | Assignments | In class exercises |
|----------|------|-------------|--------------------|
| January 12 | [data.gov lab 1](labs/1/gov_data.html) | | |
| January 17 |[reproducible data.gov lab 2](labs/2/reproducible_gov_data.html) | | [command line](class_exercises/command_line/command_line.html) and [swirl](class_exercises/swirl/swirl.html)|
| January 19 | |[dplyr and unc departments](assignments/unc_depts/unc_depts.html) | |
| January 26 |[prog lab 3](labs/3/prog_lab.html) | | |
| February 2 | [whales and tidy data lab 4](labs/4/tidy_lab.html) | | |
| February 7 |[joins lab 5](labs/5/joins_lab.html)|| |
| February 9 |[strings lab 6](labs/6/strings_lab.html)| | |
| February 16 | | [harry potter](assignments/harry_potter/harry_potter.html)| |
| February 23 |[Ira Glass on overfitting](labs/7/overfitting_with_scissors) | | |
| February 28 | | [bikes, EDA and predictive modeling](assignments/bikes/bike_sharing.html) |
| March 21 | | [classification: does your iPhone know what you're doing?](assignments/iphone/classification.html) |
| March 21 | | [extra credit: naive bayes](assignments/iphone/naive_bayes.html) | |
| March 23 | | |[What is AI?](class_exercises/AI/what_is_AI.html) |
| March 30 |[APIs](labs/8/moives_api.html) | | |
# **Final project**
See [**this document for the final project description**](final_project/description.html).
| Deliverable | Due Date |
| ------------|----------|
|[proposal](final_project/proposal.html) | 4/6 |
|exploratory analysis | 4/20|
|techincal documents | 5/2|
|blog post | 5/9 |
I've posted some ideas for project ideas in [this document](final_project/project_ideas.html).
# **Additional resources**
- [the extra resources page](resources/extra_resources.html) conatins a number of books, blogs, MOOCS and other courses about data science
- [a collection of lot's of datasets](resources/dataset_collection.html)
- [finding a job/internship](resources/jobs.html)
# **Miscellaneous**
This course was made possible by a grant from the [Data@Carolina](http://data.web.unc.edu/) initiative and a ton of [**input from lots of very smart people**](course_info/acknowledgments.html).
This page was last updated on `r Sys.time()` Eastern Time.