-
Notifications
You must be signed in to change notification settings - Fork 2
/
Summary.Rmd
344 lines (231 loc) · 8.64 KB
/
Summary.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
---
title: "Data Science Preparation"
author: "[Rui Wang](http://www.rui-wang.com/)"
date: "2018/03/24"
output:
xaringan::moon_reader:
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE)
```
# Outline
### SQL
### Probability
### Statistics
### Coding
### Machine Learning
### Case Studies
---
# SQL
--
- PRIMARY KEY
* Uniquely identify each record in a database table
* Cannot contain NULL values
* Only one PRIMARY KEY
--
- FOREIGN KEY
* Used to link two tables together
* Can contain NULL values
* Refer to the PRIMARY KEY
--
- Structure
```
select ... from ...
join ... on ...
where ...
group by ... having ...
order by ...
```
---
# SQL
--
- [Intro To Database](https://lagunita.stanford.edu/courses/Engineering/db/2014_1/about)
* [Pratice Questions](https://github.com/wangruinju/SQL_Resources/blob/master/Stanford%20SQL%20practice/SQL%20exercise.Rmd)
--
- Online Platforms
* [Leetcode DataBase](https://leetcode.com/problemset/database/)
* [Hackerrank](https://www.hackerrank.com/)
* [Vertabelo Acedemy](https://academy.vertabelo.com/)
--
- [Analytical Cases](https://community.modeanalytics.com/sql/tutorial/sql-business-analytics-training/)
* Invertigating a Drop in User Engagement
* Understanding Search Functionality
* Validating A/B Test Results
--
- Reference
* [W3Schools](https://www.w3schools.com/SQl/default.asp)
---
# SQL
- Content Type
table: content_id | content_type (comment/ post) | target_id
* If it is comment, target_id is the user_id who posts it.
* If it is post, then target_id is NULL.
What is the distribution of comments?
--
```
select cnt, count(cnt) as freq
from
(select content_id, count(target_id) cnt
from table
group by content_id) a
group by cnt;
```
---
---
# Probability
--
- [Statistical Inference](https://www.amazon.com/Statistical-Inference-George-Casella/dp/0534243126)
* Probability and Basic Statistics: Chapter 1-5
* Hypothesis Testing: Chapter 6-12
--
- [MIT: Introduction to Probability and Statistics](https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/)
* [Youtube Video List](https://www.youtube.com/playlist?list=PLUl4u3cNGP61Oq3tWYp6V_F-5jb5L2iHb)
--
- [A Green Book: A Practical Guide to Quantitative Finance Interview](https://www.amazon.com/Practical-Guide-Quantitative-Finance-Interviews/dp/1438236662)
* Data Science: Chapter 4
--
- [Cheat Sheet](https://static1.squarespace.com/static/54bf3241e4b0f0d81bf7ff36/t/55e9494fe4b011aed10e48e5/1441352015658/probability_cheatsheet.pdf)
* Cover Most of Knowledge Points
---
# Probability
--
- Cards
What is probability of getting one pair of card from a deck of 52 cards?
--
3/51
--
What is probability of getting two cards in the same suit from a deck of 52 cards?
--
12/51
--
What’s the probability of getting two cards that are not in the same suit and not a pair from a deck of 52 cards?
--
36/51
--
- Coins
You randomly draw a coin from 100 coins - 1 unfair coin (head-head), 99 fair coins (head-tail) and roll it 10 times. If the result is 10 heads, whats the probability that the coin is unfair?
--
$\frac{1/100}{1/100*1 + 99/100*(1/2)^{10}} = \frac{1024}{1033}$
---
# Probability
- Seattle Raining
You’re about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it’s raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that “Yes” it is raining. What is the probability that it’s actually raining in Seattle?
--
p denotes the probablity of raining at Seattle, then we will have the conditional probablity as
$$\frac{(2/3)^3*p}{(2/3)^3*p + (1/3)^3*(1-p)}$$
--
If $p = 1/4$, we have the answer as 8/11.
---
# Probability
- Birthday
There are 30 people in a class. What is the probablity of at least two people who have the same birthday?
--
Think the opposite the questions about the probablity of all the people who have different birthdays.
$P = 1 - \frac{365*364*...*336}{365^{30}}$
What is the probabily of that there are exactly two people who have the same birthday? (similar to card problems like suit and pair)
--
Choose one pair out of 30 people and assign one out of 365 days to their birthday. Then consider the probablity of the rest 28 people who have different birthdays.
$P = \frac{\binom{30}{2}*365*364*...*337}{365^{30}}$
---
# Statistics
- Linear regression
- Assumptions: linearity, independence, homogeneity and normality
- Inference and interpretation
- Issues like outlier and multicollinearity
- Logistics regression
- Invalid assumptions compared with linear regression
- Interpretation
- Hypothesis testing
- p value, $H_0$, $H_1$, type I error, type II error, power
- Confidence intervals
- Online course
- [Biostatistics Module](http://sphweb.bumc.bu.edu/otlt/mph-modules/menu/)
---
# Coding - Data Wrangling Using Pandas
- Basics
- Selection
- Missing values
- String
- Merge
- Groupby
- Plot
- Time series
- Data in/out
- Resources
- [Python for Data Analysis](https://github.com/wesm/pydata-book)
- [Pandas Documentation](https://pandas.pydata.org/)
- [Pandas Exercises](https://github.com/guipsamora/pandas_exercises)
---
# Coding - Data Structure Using Python
- Python
* [Problem Solving with Algorithms and Data Structures using Python](http://interactivepython.org/runestone/static/pythonds/index.html)
* [Google Python Course](https://developers.google.com/edu/python/)
* [MIT: Introduction to Algorithms](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-spring-2008/)
* [Youtube Video List](https://www.youtube.com/playlist?list=PLUl4u3cNGP61Oq3tWYp6V_F-5jb5L2iHb)
--
- [Leetcode](https://leetcode.com/problemset/algorithms/)
* [My Python Solution](https://github.com/wangruinju/Rui_Python_Leetcode)
--
.pull-left[
* Part I
* Linked List
* Binary Search
* Two Points
* Bit Manipulation
* Math
* Stack
* Hash Table]
.pull-right[
* Part II
* Backtrack
* Tree
* DFS
* BFS
* String
* Array
* Dynamic Programming]
---
# Machine Learning
- Basics
- Supervise learning vs unsupervise learning
- Linear regression
- Logistics regression and Linear Discriminat Analysis
- Cross validation and bootstraping
- Regularization: l1/l2
- Tree models: CART/Random Forest/Gradient Boosting Tree
- SVM and kernels
- Clustering: k-means, gaussian mixture model and hierarchical methods
- PCA for data reductation
- Resources
- [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf)
- [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/Papers/ESLII.pdf)
- [Andrew Ng: Maching Learning on Coursera](https://www.coursera.org/learn/machine-learning/home/welcome) and [My solution](https://github.com/wangruinju/Machine-Learning-Coursera)
- [CS229: Maching Learning](http://cs229.stanford.edu/)
- [Hunag-yi Lee videos](https://www.youtube.com/playlist?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49)
- [Other great videos](https://www.youtube.com/playlist?list=PLaXDtXvwY-oDvedS3f4HW0b4KxqpJ_imw)
---
# Case Studies
- Structure
- Define the business problems
- Make plan for the analysis
- Segment data
- Summarize the insights
- Make business decisions
- Methods
- Segment analysis
- Trend analysis
- Funnel analysis
- User behavior analysis
- Retention analysis
- AB test: sample size/effect size/significant level/power/potential issues
- Resources
- [A/B Testing on Udacity](https://www.udacity.com/course/ab-testing--ud257)
- [Design of Experiments](https://onlinecourses.science.psu.edu/stat503/node/1)
- [A/B testing articles](https://engineering.linkedin.com/blog/topic/ab-testing)
---
# All of these are part of my preparation for data science interviews. I was encouraged by a lot of my friends at the beginning and now I am willing to introduce my limited experience to people who are in need. Cheers!