-
Notifications
You must be signed in to change notification settings - Fork 0
/
notebook.txt
1510 lines (1510 loc) · 67.8 KB
/
notebook.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to Statistics in Python\n",
"\n",
"## Summary Statistics\n",
"\n",
"### Mean and median\n",
"\n",
"In this chapter, you'll be working with the [2018 Food Carbon Footprint\n",
"Index](https://www.nu3.de/blogs/nutrition/food-carbon-footprint-index-2018)\n",
"from nu3. The `food_consumption` dataset contains information about the\n",
"kilograms of food consumed per person per year in each country in each\n",
"food category (`consumption`) as well as information about the carbon\n",
"footprint of that food category (`co2_emissions`) measured in kilograms\n",
"of carbon dioxide, or CO<sub>2</sub>, per person per year in each\n",
"country.\n",
"\n",
"In this exercise, you'll compute measures of center to compare food\n",
"consumption in the US and Belgium using your `pandas` and `numpy`\n",
"skills.\n",
"\n",
"`pandas` is imported as `pd` for you and `food_consumption` is\n",
"pre-loaded.\n",
"\n",
"**Instructions**\n",
"\n",
"- Import `numpy` with the alias `np`.\n",
"- Create two DataFrames: one that holds the rows of `food_consumption`\n",
" for `'Belgium'` and another that holds rows for `'USA'`. Call these\n",
" `be_consumption` and `usa_consumption`.\n",
"- Calculate the mean and median of kilograms of food consumed per person\n",
" per year for both countries.\n",
"\n",
"<!-- -->\n",
"\n",
"- Subset `food_consumption` for rows with data about Belgium and the\n",
" USA.\n",
"- Group the subsetted data by `country` and select only the\n",
" `consumption` column.\n",
"- Calculate the mean and median of the kilograms of food consumed per\n",
" person per year in each country using `.agg()`.\n",
"\n",
"**Answer**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import numpy with alias np\n",
"import numpy as np\n",
"\n",
"# Filter for Belgium\n",
"be_consumption = food_consumption[food_consumption['country'] == 'Belgium']\n",
"\n",
"# Filter for USA\n",
"usa_consumption = food_consumption[food_consumption['country'] == 'USA']\n",
"\n",
"# Calculate mean and median consumption in Belgium\n",
"print(np.mean(be_consumption['consumption']))\n",
"print(np.median(be_consumption['consumption']))\n",
"\n",
"# Calculate mean and median consumption in USA\n",
"print(np.mean(usa_consumption['consumption']))\n",
"print(np.median(usa_consumption['consumption']))\n",
"\n",
"\n",
"# Import numpy as np\n",
"import numpy as np\n",
"\n",
"# Subset for Belgium and USA only\n",
"be_and_usa = food_consumption[(food_consumption['country'] == \"Belgium\") | (food_consumption['country'] == 'USA')]\n",
"\n",
"# Group by country, select consumption column, and compute mean and median\n",
"print(be_and_usa.groupby('country')['consumption'].agg([np.mean, np.median]))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Mean vs. median\n",
"\n",
"In the video, you learned that the mean is the sum of all the data\n",
"points divided by the total number of data points, and the median is the\n",
"middle value of the dataset where 50% of the data is less than the\n",
"median, and 50% of the data is greater than the median. In this\n",
"exercise, you'll compare these two measures of center.\n",
"\n",
"`pandas` is loaded as `pd`, `numpy` is loaded as `np`, and\n",
"`food_consumption` is available.\n",
"\n",
"**Instructions**\n",
"\n",
"- Import `matplotlib.pyplot` with the alias `plt`.\n",
"- Subset `food_consumption` to get the rows where `food_category` is\n",
" `'rice'`.\n",
"- Create a histogram of `co2_emission` for rice and show the plot.\n",
"- Use `.agg()` to calculate the mean and median of `co2_emission` for rice.\n",
"\n",
"\n",
"**Answer**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import matplotlib.pyplot with alias plt\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Subset for food_category equals rice\n",
"rice_consumption = food_consumption[food_consumption['food_category'] == 'rice']\n",
"\n",
"# Histogram of co2_emission for rice and show plot\n",
"rice_consumption['co2_emission'].hist()\n",
"plt.show()\n",
"\n",
"\n",
"# Subset for food_category equals rice\n",
"rice_consumption = food_consumption[food_consumption['food_category'] == 'rice']\n",
"\n",
"# Calculate mean and median of co2_emission with .agg()\n",
"print(rice_consumption['co2_emission'].agg([np.mean, np.median]))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Quartiles, quantiles, and quintiles\n",
"\n",
"Quantiles are a great way of summarizing numerical data since they can\n",
"be used to measure center and spread, as well as to get a sense of where\n",
"a data point stands in relation to the rest of the data set. For\n",
"example, you might want to give a discount to the 10% most active users\n",
"on a website.\n",
"\n",
"In this exercise, you'll calculate quartiles, quintiles, and deciles,\n",
"which split up a dataset into 4, 5, and 10 pieces, respectively.\n",
"\n",
"Both `pandas` as `pd` and `numpy` as `np` are loaded and\n",
"`food_consumption` is available.\n",
"\n",
"**Instructions**\n",
"\n",
"- Calculate the quartiles of the `co2_emission` column of\n",
" `food_consumption`.\n",
"\n",
"<!-- -->\n",
"\n",
"- Calculate the six quantiles that split up the data into 5 pieces\n",
" (quintiles) of the `co2_emission` column of `food_consumption`.\n",
"\n",
"<!-- -->\n",
"\n",
"- Calculate the eleven quantiles of `co2_emission` that split up the\n",
" data into ten pieces (deciles).\n",
"\n",
"**Answer**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Calculate the quartiles of co2_emission\n",
"print(np.quantile(food_consumption['co2_emission'], [0, 0.25, 0.5, 0.75, 1]))\n",
"\n",
"# Calculate the quintiles of co2_emission\n",
"print(np.quantile(food_consumption['co2_emission'], [0, 0.2, 0.4, 0.6, 0.8, 1]))\n",
"\n",
"# Calculate the deciles of co2_emission\n",
"print(np.quantile(food_consumption['co2_emission'], np.linspace(0, 1, 11)))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Variance and standard deviation\n",
"\n",
"Variance and standard deviation are two of the most common ways to\n",
"measure the spread of a variable, and you'll practice calculating these\n",
"in this exercise. Spread is important since it can help inform\n",
"expectations. For example, if a salesperson sells a mean of 20 products\n",
"a day, but has a standard deviation of 10 products, there will probably\n",
"be days where they sell 40 products, but also days where they only sell\n",
"one or two. Information like this is important, especially when making\n",
"predictions.\n",
"\n",
"Both `pandas` as `pd` and `numpy` as `np` are loaded, and\n",
"`food_consumption` is available.\n",
"\n",
"**Instructions**\n",
"\n",
"- Calculate the variance and standard deviation of `co2_emission` for\n",
" each `food_category` by grouping and aggregating.\n",
"- Import `matplotlib.pyplot` with alias `plt`.\n",
"- Create a histogram of `co2_emission` for the `beef` `food_category`\n",
" and show the plot.\n",
"- Create a histogram of `co2_emission` for the `eggs` `food_category`\n",
" and show the plot.\n",
"\n",
"**Answer**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Print variance and sd of co2_emission for each food_category\n",
"print(food_consumption.groupby('food_category')['co2_emission'].agg([np.var, np.std]))\n",
"\n",
"# Import matplotlib.pyplot with alias plt\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Create histogram of co2_emission for food_category 'beef'\n",
"food_consumption[food_consumption['food_category'] == 'beef']['co2_emission'].hist()\n",
"# Show plot\n",
"plt.show()\n",
"\n",
"# Create histogram of co2_emission for food_category 'eggs'\n",
"food_consumption[food_consumption['food_category'] == 'eggs']['co2_emission'].hist()\n",
"# Show plot\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Finding outliers using IQR\n",
"\n",
"Outliers can have big effects on statistics like mean, as well as\n",
"statistics that rely on the mean, such as variance and standard\n",
"deviation. Interquartile range, or IQR, is another way of measuring\n",
"spread that's less influenced by outliers. IQR is also often used to\n",
"find outliers. If a value is less than \\\\\\text{Q1} - 1.5 \\times\n",
"\\text{IQR}\\\\ or greater than \\\\\\text{Q3} + 1.5 \\times \\text{IQR}\\\\, it's\n",
"considered an outlier. In fact, this is how the lengths of the whiskers\n",
"in a `matplotlib` box plot are calculated.\n",
"\n",
"![Diagram of a box plot showing median, quartiles, and\n",
"outliers](https://assets.datacamp.com/production/repositories/5758/datasets/ca7e6e1832be7ec1842f62891815a9b0488efa83/Screen%20Shot%202020-04-28%20at%2010.04.54%20AM.png)\n",
"\n",
"In this exercise, you'll calculate IQR and use it to find some outliers.\n",
"`pandas` as `pd` and `numpy` as `np` are loaded and `food_consumption`\n",
"is available.\n",
"\n",
"**Instructions**\n",
"\n",
"- Calculate the total `co2_emission` per country by grouping by country\n",
" and taking the sum of `co2_emission`. Store the resulting DataFrame as\n",
" `emissions_by_country`.\n",
"- Compute the first and third quartiles of `emissions_by_country` and store these as `q1` and `q3`.\n",
"- Calculate the interquartile range of `emissions_by_country` and store it as `iqr`.\n",
"- Calculate the lower and upper cutoffs for outliers of `emissions_by_country`, and store these as lower and `upper`.\n",
"- Subset `emissions_by_country` to get countries with a total emission greater than the `upper` cutoff or a total emission less than the `lower` cutoff.\n",
"\n",
"**Answer**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Calculate total co2_emission per country: emissions_by_country\n",
"emissions_by_country = food_consumption.groupby('country')['co2_emission'].sum()\n",
"\n",
"print(emissions_by_country)\n",
"\n",
"\n",
"# Calculate total co2_emission per country: emissions_by_country\n",
"emissions_by_country = food_consumption.groupby('country')['co2_emission'].sum()\n",
"\n",
"# Compute the first and third quartiles and IQR of emissions_by_country\n",
"q1 = np.quantile(emissions_by_country, 0.25)\n",
"q3 = np.quantile(emissions_by_country, 0.75)\n",
"iqr = q3 - q1\n",
"\n",
"\n",
"# Calculate total co2_emission per country: emissions_by_country\n",
"emissions_by_country = food_consumption.groupby('country')['co2_emission'].sum()\n",
"\n",
"# Compute the first and third quantiles and IQR of emissions_by_country\n",
"q1 = np.quantile(emissions_by_country, 0.25)\n",
"q3 = np.quantile(emissions_by_country, 0.75)\n",
"iqr = q3 - q1\n",
"\n",
"# Calculate the lower and upper cutoffs for outliers\n",
"lower = q1 - 1.5 * iqr\n",
"upper = q3 + 1.5 * iqr\n",
"\n",
"\n",
"# Calculate total co2_emission per country: emissions_by_country\n",
"emissions_by_country = food_consumption.groupby('country')['co2_emission'].sum()\n",
"\n",
"# Compute the first and third quantiles and IQR of emissions_by_country\n",
"q1 = np.quantile(emissions_by_country, 0.25)\n",
"q3 = np.quantile(emissions_by_country, 0.75)\n",
"iqr = q3 - q1\n",
"\n",
"# Calculate the lower and upper cutoffs for outliers\n",
"lower = q1 - 1.5 * iqr\n",
"upper = q3 + 1.5 * iqr\n",
"\n",
"# Subset emissions_by_country to find outliers\n",
"outliers = emissions_by_country[(emissions_by_country < lower) | (emissions_by_country > upper)]\n",
"print(outliers)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Random Numbers and Probability\n",
"\n",
"### Calculating probabilities\n",
"\n",
"You're in charge of the sales team, and it's time for performance\n",
"reviews, starting with Amir. As part of the review, you want to randomly\n",
"select a few of the deals that he's worked on over the past year so that\n",
"you can look at them more deeply. Before you start selecting deals,\n",
"you'll first figure out what the chances are of selecting certain deals.\n",
"\n",
"Recall that the probability of an event can be calculated by \\$\\$\n",
"P(\\text{event}) = \\frac{\\text{# ways event can happen}}{\\text{total \\#\n",
"of possible outcomes}} \\$\\$\n",
"\n",
"Both `pandas` as `pd` and `numpy` as `np` are loaded and `amir_deals` is\n",
"available.\n",
"\n",
"**Instructions**\n",
"\n",
"- Count the number of deals Amir worked on for each `product` type and\n",
" store in `counts`.\n",
"- Calculate the probability of selecting a deal for the different product types by dividing the counts by the total number of deals Amir worked on. Save this as `probs`.\n",
"\n",
"\n",
"**Answer**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Count the deals for each product\n",
"counts = amir_deals['product'].value_counts()\n",
"print(counts)\n",
"\n",
"\n",
"# Count the deals for each product\n",
"counts = amir_deals['product'].value_counts()\n",
"\n",
"# Calculate probability of picking a deal with each product\n",
"probs = counts / amir_deals.shape[0]\n",
"print(probs)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sampling deals\n",
"\n",
"In the previous exercise, you counted the deals Amir worked on. Now it's\n",
"time to randomly pick five deals so that you can reach out to each\n",
"customer and ask if they were satisfied with the service they received.\n",
"You'll try doing this both with and without replacement.\n",
"\n",
"Additionally, you want to make sure this is done randomly and that it\n",
"can be reproduced in case you get asked how you chose the deals, so\n",
"you'll need to set the random seed before sampling from the deals.\n",
"\n",
"Both `pandas` as `pd` and `numpy` as `np` are loaded and `amir_deals` is\n",
"available.\n",
"\n",
"**Instructions**\n",
"\n",
"- Set the random seed to `24`.\n",
"- Take a sample of `5` deals **without** replacement and store them as\n",
" `sample_without_replacement`.\n",
"- Take a sample of 5 deals with replacement and save as `sample_with_replacement`.\n",
"\n",
"\n",
"**Answer**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Set random seed\n",
"np.random.seed(24)\n",
"\n",
"# Sample 5 deals without replacement\n",
"sample_without_replacement = amir_deals.sample(5)\n",
"print(sample_without_replacement)\n",
"\n",
"\n",
"# Set random seed\n",
"np.random.seed(24)\n",
"\n",
"# Sample 5 deals with replacement\n",
"sample_with_replacement = amir_deals.sample(5, replace=True)\n",
"print(sample_with_replacement)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Creating a probability distribution\n",
"\n",
"A new restaurant opened a few months ago, and the restaurant's\n",
"management wants to optimize its seating space based on the size of the\n",
"groups that come most often. On one night, there are 10 groups of people\n",
"waiting to be seated at the restaurant, but instead of being called in\n",
"the order they arrived, they will be called randomly. In this exercise,\n",
"you'll investigate the probability of groups of different sizes getting\n",
"picked first. Data on each of the ten groups is contained in the\n",
"`restaurant_groups` DataFrame.\n",
"\n",
"Remember that expected value can be calculated by multiplying each\n",
"possible outcome with its corresponding probability and taking the sum.\n",
"The `restaurant_groups` data is available. `pandas` is loaded as `pd`,\n",
"`numpy` is loaded as `np`, and `matplotlib.pyplot` is loaded as `plt`.\n",
"\n",
"**Instructions**\n",
"\n",
"- Create a histogram of the `group_size` column of `restaurant_groups`,\n",
" setting `bins` to `[2, 3, 4, 5, 6]`. Remember to show the plot.\n",
"\n",
"<!-- -->\n",
"\n",
"- Count the number of each `group_size` in `restaurant_groups`, then\n",
" divide by the number of rows in `restaurant_groups` to calculate the\n",
" probability of randomly selecting a group of each size. Save as\n",
" `size_dist`.\n",
"- Reset the index of `size_dist`.\n",
"- Rename the columns of `size_dist` to `group_size` and `prob`.\n",
"\n",
"<!-- -->\n",
"\n",
"- Calculate the expected value of the `size_dist`, which represents the\n",
" expected group size, by multiplying the `group_size` by the `prob` and\n",
" taking the sum.\n",
"\n",
"<!-- -->\n",
"\n",
"- Calculate the probability of randomly picking a group of 4 or more\n",
" people by subsetting for groups of size 4 or more and summing the\n",
" probabilities of selecting those groups.\n",
"\n",
"**Answer**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create a histogram of restaurant_groups and show plot\n",
"restaurant_groups['group_size'].hist(bins=np.linspace(2,6,5))\n",
"plt.show()\n",
"\n",
"\n",
"# Create probability distribution\n",
"size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]\n",
"\n",
"# Reset index and rename columns\n",
"size_dist = size_dist.reset_index()\n",
"size_dist.columns = ['group_size', 'prob']\n",
"\n",
"print(size_dist)\n",
"\n",
"\n",
"# Create probability distribution\n",
"size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]\n",
"# Reset index and rename columns\n",
"size_dist = size_dist.reset_index()\n",
"size_dist.columns = ['group_size', 'prob']\n",
"\n",
"# Expected value\n",
"expected_value = np.sum(size_dist['group_size'] * size_dist['prob'])\n",
"print(expected_value)\n",
"\n",
"\n",
"# Create probability distribution\n",
"size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]\n",
"# Reset index and rename columns\n",
"size_dist = size_dist.reset_index()\n",
"size_dist.columns = ['group_size', 'prob']\n",
"\n",
"# Expected value\n",
"expected_value = np.sum(size_dist['group_size'] * size_dist['prob'])\n",
"\n",
"# Subset groups of size 4 or more\n",
"groups_4_or_more = size_dist[size_dist['group_size'] >= 4]\n",
"\n",
"# Sum the probabilities of groups_4_or_more\n",
"prob_4_or_more = np.sum(groups_4_or_more['prob'])\n",
"print(prob_4_or_more)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data back-ups\n",
"\n",
"The sales software used at your company is set to automatically back\n",
"itself up, but no one knows exactly what time the back-ups happen. It is\n",
"known, however, that back-ups happen exactly every 30 minutes. Amir\n",
"comes back from sales meetings at random times to update the data on the\n",
"client he just met with. He wants to know how long he'll have to wait\n",
"for his newly-entered data to get backed up. Use your new knowledge of\n",
"continuous uniform distributions to model this situation and answer\n",
"Amir's questions.\n",
"\n",
"**Instructions**\n",
"\n",
"- To model how long Amir will wait for a back-up using a continuous\n",
" uniform distribution, save his lowest possible wait time as `min_time`\n",
" and his longest possible wait time as `max_time`. Remember that\n",
" back-ups happen every 30 minutes.\n",
"\n",
"<!-- -->\n",
"\n",
"- Import `uniform` from `scipy.stats` and calculate the probability that\n",
" Amir has to wait less than 5 minutes, and store in a variable called\n",
" `prob_less_than_5`.\n",
"\n",
"<!-- -->\n",
"\n",
"- Calculate the probability that Amir has to wait more than 5 minutes,\n",
" and store in a variable called `prob_greater_than_5`.\n",
"\n",
"<!-- -->\n",
"\n",
"- Calculate the probability that Amir has to wait between 10 and 20\n",
" minutes, and store in a variable called `prob_between_10_and_20`.\n",
"\n",
"**Answer**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Min and max wait times for back-up that happens every 30 min\n",
"min_time = 0\n",
"max_time = 30\n",
"\n",
"# Import uniform from scipy.stats\n",
"from scipy.stats import uniform\n",
"\n",
"# Calculate probability of waiting less than 5 mins\n",
"prob_less_than_5 = uniform.cdf(5, min_time, max_time)\n",
"print(prob_less_than_5)\n",
"\n",
"\n",
"# Min and max wait times for back-up that happens every 30 min\n",
"min_time = 0\n",
"max_time = 30\n",
"\n",
"# Import uniform from scipy.stats\n",
"from scipy.stats import uniform\n",
"\n",
"# Calculate probability of waiting more than 5 mins\n",
"prob_greater_than_5 = 1 - uniform.cdf(5, min_time, max_time)\n",
"print(prob_greater_than_5)\n",
"\n",
"\n",
"# Min and max wait times for back-up that happens every 30 min\n",
"min_time = 0\n",
"max_time = 30\n",
"\n",
"# Import uniform from scipy.stats\n",
"from scipy.stats import uniform\n",
"\n",
"# Calculate probability of waiting 10-20 mins\n",
"prob_between_10_and_20 = uniform.cdf(20, min_time, max_time) - uniform.cdf(10, min_time, max_time)\n",
"print(prob_between_10_and_20)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Simulating wait times\n",
"\n",
"To give Amir a better idea of how long he'll have to wait, you'll\n",
"simulate Amir waiting 1000 times and create a histogram to show him what\n",
"he should expect. Recall from the last exercise that his minimum wait\n",
"time is 0 minutes and his maximum wait time is 30 minutes.\n",
"\n",
"As usual, `pandas` as `pd`, `numpy` as `np`, and `matplotlib.pyplot` as\n",
"`plt` are loaded.\n",
"\n",
"**Instructions**\n",
"\n",
"- Set the random seed to `334`.\n",
"- Import `uniform` from `scipy.stats`.\n",
"- Generate 1000 wait times from the continuous uniform distribution that models Amir's wait time. Save this as `wait_times`.\n",
"- Create a histogram of the simulated wait times and show the plot.\n",
"\n",
"**Answer**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Set random seed to 334\n",
"np.random.seed(334)\n",
"\n",
"# Import uniform\n",
"from scipy.stats import uniform\n",
"\n",
"# Generate 1000 wait times between 0 and 30 mins\n",
"wait_times = uniform.rvs(0, 30, size=1000)\n",
"\n",
"print(wait_times)\n",
"\n",
"\n",
"# Set random seed to 334\n",
"np.random.seed(334)\n",
"\n",
"# Import uniform\n",
"from scipy.stats import uniform\n",
"\n",
"# Generate 1000 wait times between 0 and 30 mins\n",
"wait_times = uniform.rvs(0, 30, size=1000)\n",
"\n",
"# Create a histogram of simulated times and show plot\n",
"plt.hist(wait_times)\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Simulating sales deals\n",
"\n",
"Assume that Amir usually works on 3 deals per week, and overall, he wins\n",
"30% of deals he works on. Each deal has a binary outcome: it's either\n",
"lost, or won, so you can model his sales deals with a binomial\n",
"distribution. In this exercise, you'll help Amir simulate a year's worth\n",
"of his deals so he can better understand his performance.\n",
"\n",
"`numpy` is imported as `np`.\n",
"\n",
"**Instructions**\n",
"\n",
"- Import `binom` from `scipy.stats` and set the random seed to 10.\n",
"\n",
"<!-- -->\n",
"\n",
"- Simulate 1 deal worked on by Amir, who wins 30% of the deals he works\n",
" on.\n",
"\n",
"<!-- -->\n",
"\n",
"- Simulate a typical week of Amir's deals, or one week of 3 deals.\n",
"\n",
"<!-- -->\n",
"\n",
"- Simulate a year's worth of Amir's deals, or 52 weeks of 3 deals each,\n",
" and store in `deals`.\n",
"- Print the mean number of deals he won per week.\n",
"\n",
"**Answer**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import binom from scipy.stats\n",
"from scipy.stats import binom\n",
"\n",
"# Set random seed to 10\n",
"np.random.seed(10)\n",
"\n",
"# Simulate a single deal\n",
"print(binom.rvs(1, 0.3, size=1))\n",
"\n",
"\n",
"# Import binom from scipy.stats\n",
"from scipy.stats import binom\n",
"\n",
"# Set random seed to 10\n",
"np.random.seed(10)\n",
"\n",
"# Simulate 1 week of 3 deals\n",
"print(binom.rvs(3, 0.3, size=1))\n",
"\n",
"\n",
"# Import binom from scipy.stats\n",
"from scipy.stats import binom\n",
"\n",
"# Set random seed to 10\n",
"np.random.seed(10)\n",
"\n",
"# Simulate 52 weeks of 3 deals\n",
"deals = binom.rvs(3, 0.3, size=52)\n",
"\n",
"# Print mean deals won per week\n",
"print(np.mean(deals))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Calculating binomial probabilities\n",
"\n",
"Just as in the last exercise, assume that Amir wins 30% of deals. He\n",
"wants to get an idea of how likely he is to close a certain number of\n",
"deals each week. In this exercise, you'll calculate what the chances are\n",
"of him closing different numbers of deals using the binomial\n",
"distribution.\n",
"\n",
"`binom` is imported from `scipy.stats`.\n",
"\n",
"**Instructions**\n",
"\n",
"- What's the probability that Amir closes all 3 deals in a week? Save\n",
" this as `prob_3`.\n",
"- What's the probability that Amir closes 1 or fewer deals in a week? Save this as `prob_less_than_or_equal_1`.\n",
"- What's the probability that Amir closes more than 1 deal? Save this as `prob_greater_than_1`.\n",
"\n",
"**Answer**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Probability of closing 3 out of 3 deals\n",
"prob_3 = binom.pmf(3, 3, 0.3)\n",
"\n",
"print(prob_3)\n",
"\n",
"\n",
"# Probability of closing <= 1 deal out of 3 deals\n",
"prob_less_than_or_equal_1 = binom.cdf(1, 3, 0.3)\n",
"\n",
"print(prob_less_than_or_equal_1)\n",
"\n",
"\n",
"# Probability of closing > 1 deal out of 3 deals\n",
"prob_greater_than_1 = 1 - binom.cdf(1, 3, 0.3)\n",
"\n",
"print(prob_greater_than_1)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How many sales will be won?\n",
"\n",
"Now Amir wants to know how many deals he can expect to close each week\n",
"if his win rate changes. Luckily, you can use your binomial distribution\n",
"knowledge to help him calculate the expected value in different\n",
"situations. Recall from the video that the expected value of a binomial\n",
"distribution can be calculated by \\\\n \\times p\\\\.\n",
"\n",
"**Instructions**\n",
"\n",
"- Calculate the expected number of sales out of the **3** he works on\n",
" that Amir will win each week if he maintains his 30% win rate.\n",
"- Calculate the expected number of sales out of the 3 he works on that\n",
" he'll win if his win rate drops to 25%.\n",
"- Calculate the expected number of sales out of the 3 he works on that\n",
" he'll win if his win rate rises to 35%.\n",
"\n",
"**Answer**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Expected number won with 30% win rate\n",
"won_30pct = 3 * 0.3\n",
"print(won_30pct)\n",
"\n",
"# Expected number won with 25% win rate\n",
"won_25pct = 3 * 0.25\n",
"print(won_25pct)\n",
"\n",
"# Expected number won with 35% win rate\n",
"won_35pct = 3 * 0.35\n",
"print(won_35pct)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## More Distributions and the Central Limit Theorem\n",
"\n",
"### Distribution of Amir's sales\n",
"\n",
"Since each deal Amir worked on (both won and lost) was different, each\n",
"was worth a different amount of money. These values are stored in the\n",
"`amount` column of `amir_deals` As part of Amir's performance review,\n",
"you want to be able to estimate the probability of him selling different\n",
"amounts, but before you can do this, you'll need to determine what kind\n",
"of distribution the `amount` variable follows.\n",
"\n",
"Both `pandas` as `pd` and `matplotlib.pyplot` as `plt` are loaded and\n",
"`amir_deals` is available.\n",
"\n",
"**Instructions**\n",
"\n",
"- Create a histogram with 10 bins to visualize the distribution of the\n",
" `amount`. Show the plot.\n",
"\n",
"**Answer**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Histogram of amount with 10 bins and show plot\n",
"amir_deals['amount'].hist(bins=10)\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Probabilities from the normal distribution\n",
"\n",
"Since each deal Amir worked on (both won and lost) was different, each\n",
"was worth a different amount of money. These values are stored in the\n",
"`amount` column of `amir_deals` and follow a normal distribution with a\n",
"mean of 5000 dollars and a standard deviation of 2000 dollars. As part\n",
"of his performance metrics, you want to calculate the probability of\n",
"Amir closing a deal worth various amounts.\n",
"\n",
"`norm` from `scipy.stats` is imported as well as `pandas` as `pd`. The\n",
"DataFrame `amir_deals` is loaded.\n",
"\n",
"**Instructions**\n",
"\n",
"- What's the probability of Amir closing a deal worth less than \\$7500?\n",
"\n",
"<!-- -->\n",
"\n",
"- What's the probability of Amir closing a deal worth more than \\$1000?\n",
"\n",
"<!-- -->\n",
"\n",
"- What's the probability of Amir closing a deal worth between \\$3000 and\n",
" \\$7000?\n",
"\n",
"<!-- -->\n",
"\n",
"- What amount will 25% of Amir's sales be *less than*?\n",
"\n",
"**Answer**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Probability of deal < 7500\n",
"prob_less_7500 = norm.cdf(7500, 5000, 2000)\n",
"\n",
"print(prob_less_7500)\n",
"\n",
"\n",
"# Probability of deal > 1000\n",
"prob_over_1000 = 1 - norm.cdf(1000, 5000, 2000)\n",
"\n",
"print(prob_over_1000)\n",
"\n",
"\n",
"# Probability of deal between 3000 and 7000\n",
"prob_3000_to_7000 = norm.cdf(7000, 5000, 2000) - norm.cdf(3000, 5000, 2000)\n",
"\n",
"print(prob_3000_to_7000)\n",
"\n",
"\n",
"# Calculate amount that 25% of deals will be less than\n",
"pct_25 = norm.ppf(0.25, 5000, 2000)\n",
"\n",
"print(pct_25)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Simulating sales under new market conditions\n",
"\n",
"The company's financial analyst is predicting that next quarter, the\n",
"worth of each sale will increase by 20% and the volatility, or standard\n",
"deviation, of each sale's worth will increase by 30%. To see what Amir's\n",
"sales might look like next quarter under these new market conditions,\n",
"you'll simulate new sales amounts using the normal distribution and\n",
"store these in the `new_sales` DataFrame, which has already been created\n",
"for you.\n",
"\n",
"In addition, `norm` from `scipy.stats`, `pandas` as `pd`, and\n",
"`matplotlib.pyplot` as `plt` are loaded.\n",
"\n",
"**Instructions**\n",
"\n",
"- Currently, Amir's average sale amount is \\$5000. Calculate what his\n",
" new average amount will be if it increases by 20% and store this in\n",
" `new_mean`.\n",
"- Amir's current standard deviation is \\$2000. Calculate what his new\n",
" standard deviation will be if it increases by 30% and store this in\n",
" `new_sd`.\n",
"- Create a variable called `new_sales`, which contains 36 simulated\n",
" amounts from a normal distribution with a mean of `new_mean` and a\n",
" standard deviation of `new_sd`.\n",
"- Plot the distribution of the `new_sales` `amount`s using a histogram\n",
" and show the plot.\n",
"\n",
"**Answer**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Calculate new average amount\n",
"new_mean = 5000 * 1.2\n",
"\n",
"# Calculate new standard deviation\n",
"new_sd = 2000 * 1.3\n",
"\n",
"# Simulate 36 new sales\n",
"new_sales = norm.rvs(new_mean, new_sd, size=36)\n",
"\n",
"# Create histogram and show\n",
"plt.hist(new_sales)\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The CLT in action\n",
"\n",
"The central limit theorem states that a sampling distribution of a\n",
"sample statistic approaches the normal distribution as you take more\n",
"samples, no matter the original distribution being sampled from.\n",
"\n",
"In this exercise, you'll focus on the sample mean and see the central\n",
"limit theorem in action while examining the `num_users` column of\n",
"`amir_deals` more closely, which contains the number of people who\n",