notebook.txt

{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Introduction to Statistics in Python\n",
                "\n",
                "## Summary Statistics\n",
                "\n",
                "### Mean and median\n",
                "\n",
                "In this chapter, you'll be working with the [2018 Food Carbon Footprint\n",
                "Index](https://www.nu3.de/blogs/nutrition/food-carbon-footprint-index-2018)\n",
                "from nu3. The `food_consumption` dataset contains information about the\n",
                "kilograms of food consumed per person per year in each country in each\n",
                "food category (`consumption`) as well as information about the carbon\n",
                "footprint of that food category (`co2_emissions`) measured in kilograms\n",
                "of carbon dioxide, or CO<sub>2</sub>, per person per year in each\n",
                "country.\n",
                "\n",
                "In this exercise, you'll compute measures of center to compare food\n",
                "consumption in the US and Belgium using your `pandas` and `numpy`\n",
                "skills.\n",
                "\n",
                "`pandas` is imported as `pd` for you and `food_consumption` is\n",
                "pre-loaded.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Import `numpy` with the alias `np`.\n",
                "- Create two DataFrames: one that holds the rows of `food_consumption`\n",
                "  for `'Belgium'` and another that holds rows for `'USA'`. Call these\n",
                "  `be_consumption` and `usa_consumption`.\n",
                "- Calculate the mean and median of kilograms of food consumed per person\n",
                "  per year for both countries.\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- Subset `food_consumption` for rows with data about Belgium and the\n",
                "  USA.\n",
                "- Group the subsetted data by `country` and select only the\n",
                "  `consumption` column.\n",
                "- Calculate the mean and median of the kilograms of food consumed per\n",
                "  person per year in each country using `.agg()`.\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Import numpy with alias np\n",
                "import numpy as np\n",
                "\n",
                "# Filter for Belgium\n",
                "be_consumption = food_consumption[food_consumption['country'] == 'Belgium']\n",
                "\n",
                "# Filter for USA\n",
                "usa_consumption = food_consumption[food_consumption['country'] == 'USA']\n",
                "\n",
                "# Calculate mean and median consumption in Belgium\n",
                "print(np.mean(be_consumption['consumption']))\n",
                "print(np.median(be_consumption['consumption']))\n",
                "\n",
                "# Calculate mean and median consumption in USA\n",
                "print(np.mean(usa_consumption['consumption']))\n",
                "print(np.median(usa_consumption['consumption']))\n",
                "\n",
                "\n",
                "# Import numpy as np\n",
                "import numpy as np\n",
                "\n",
                "# Subset for Belgium and USA only\n",
                "be_and_usa = food_consumption[(food_consumption['country'] == \"Belgium\") | (food_consumption['country'] == 'USA')]\n",
                "\n",
                "# Group by country, select consumption column, and compute mean and median\n",
                "print(be_and_usa.groupby('country')['consumption'].agg([np.mean, np.median]))\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Mean vs. median\n",
                "\n",
                "In the video, you learned that the mean is the sum of all the data\n",
                "points divided by the total number of data points, and the median is the\n",
                "middle value of the dataset where 50% of the data is less than the\n",
                "median, and 50% of the data is greater than the median. In this\n",
                "exercise, you'll compare these two measures of center.\n",
                "\n",
                "`pandas` is loaded as `pd`, `numpy` is loaded as `np`, and\n",
                "`food_consumption` is available.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Import `matplotlib.pyplot` with the alias `plt`.\n",
                "- Subset `food_consumption` to get the rows where `food_category` is\n",
                "  `'rice'`.\n",
                "- Create a histogram of `co2_emission` for rice and show the plot.\n",
                "- Use `.agg()` to calculate the mean and median of `co2_emission` for rice.\n",
                "\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Import matplotlib.pyplot with alias plt\n",
                "import matplotlib.pyplot as plt\n",
                "\n",
                "# Subset for food_category equals rice\n",
                "rice_consumption = food_consumption[food_consumption['food_category'] == 'rice']\n",
                "\n",
                "# Histogram of co2_emission for rice and show plot\n",
                "rice_consumption['co2_emission'].hist()\n",
                "plt.show()\n",
                "\n",
                "\n",
                "# Subset for food_category equals rice\n",
                "rice_consumption = food_consumption[food_consumption['food_category'] == 'rice']\n",
                "\n",
                "# Calculate mean and median of co2_emission with .agg()\n",
                "print(rice_consumption['co2_emission'].agg([np.mean, np.median]))\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Quartiles, quantiles, and quintiles\n",
                "\n",
                "Quantiles are a great way of summarizing numerical data since they can\n",
                "be used to measure center and spread, as well as to get a sense of where\n",
                "a data point stands in relation to the rest of the data set. For\n",
                "example, you might want to give a discount to the 10% most active users\n",
                "on a website.\n",
                "\n",
                "In this exercise, you'll calculate quartiles, quintiles, and deciles,\n",
                "which split up a dataset into 4, 5, and 10 pieces, respectively.\n",
                "\n",
                "Both `pandas` as `pd` and `numpy` as `np` are loaded and\n",
                "`food_consumption` is available.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Calculate the quartiles of the `co2_emission` column of\n",
                "  `food_consumption`.\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- Calculate the six quantiles that split up the data into 5 pieces\n",
                "  (quintiles) of the `co2_emission` column of `food_consumption`.\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- Calculate the eleven quantiles of `co2_emission` that split up the\n",
                "  data into ten pieces (deciles).\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Calculate the quartiles of co2_emission\n",
                "print(np.quantile(food_consumption['co2_emission'], [0, 0.25, 0.5, 0.75, 1]))\n",
                "\n",
                "# Calculate the quintiles of co2_emission\n",
                "print(np.quantile(food_consumption['co2_emission'], [0, 0.2, 0.4, 0.6, 0.8, 1]))\n",
                "\n",
                "# Calculate the deciles of co2_emission\n",
                "print(np.quantile(food_consumption['co2_emission'], np.linspace(0, 1, 11)))\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Variance and standard deviation\n",
                "\n",
                "Variance and standard deviation are two of the most common ways to\n",
                "measure the spread of a variable, and you'll practice calculating these\n",
                "in this exercise. Spread is important since it can help inform\n",
                "expectations. For example, if a salesperson sells a mean of 20 products\n",
                "a day, but has a standard deviation of 10 products, there will probably\n",
                "be days where they sell 40 products, but also days where they only sell\n",
                "one or two. Information like this is important, especially when making\n",
                "predictions.\n",
                "\n",
                "Both `pandas` as `pd` and `numpy` as `np` are loaded, and\n",
                "`food_consumption` is available.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Calculate the variance and standard deviation of `co2_emission` for\n",
                "  each `food_category` by grouping and aggregating.\n",
                "- Import `matplotlib.pyplot` with alias `plt`.\n",
                "- Create a histogram of `co2_emission` for the `beef` `food_category`\n",
                "  and show the plot.\n",
                "- Create a histogram of `co2_emission` for the `eggs` `food_category`\n",
                "  and show the plot.\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Print variance and sd of co2_emission for each food_category\n",
                "print(food_consumption.groupby('food_category')['co2_emission'].agg([np.var, np.std]))\n",
                "\n",
                "# Import matplotlib.pyplot with alias plt\n",
                "import matplotlib.pyplot as plt\n",
                "\n",
                "# Create histogram of co2_emission for food_category 'beef'\n",
                "food_consumption[food_consumption['food_category'] == 'beef']['co2_emission'].hist()\n",
                "# Show plot\n",
                "plt.show()\n",
                "\n",
                "# Create histogram of co2_emission for food_category 'eggs'\n",
                "food_consumption[food_consumption['food_category'] == 'eggs']['co2_emission'].hist()\n",
                "# Show plot\n",
                "plt.show()\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Finding outliers using IQR\n",
                "\n",
                "Outliers can have big effects on statistics like mean, as well as\n",
                "statistics that rely on the mean, such as variance and standard\n",
                "deviation. Interquartile range, or IQR, is another way of measuring\n",
                "spread that's less influenced by outliers. IQR is also often used to\n",
                "find outliers. If a value is less than \\\\\\text{Q1} - 1.5 \\times\n",
                "\\text{IQR}\\\\ or greater than \\\\\\text{Q3} + 1.5 \\times \\text{IQR}\\\\, it's\n",
                "considered an outlier. In fact, this is how the lengths of the whiskers\n",
                "in a `matplotlib` box plot are calculated.\n",
                "\n",
                "![Diagram of a box plot showing median, quartiles, and\n",
                "outliers](https://assets.datacamp.com/production/repositories/5758/datasets/ca7e6e1832be7ec1842f62891815a9b0488efa83/Screen%20Shot%202020-04-28%20at%2010.04.54%20AM.png)\n",
                "\n",
                "In this exercise, you'll calculate IQR and use it to find some outliers.\n",
                "`pandas` as `pd` and `numpy` as `np` are loaded and `food_consumption`\n",
                "is available.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Calculate the total `co2_emission` per country by grouping by country\n",
                "  and taking the sum of `co2_emission`. Store the resulting DataFrame as\n",
                "  `emissions_by_country`.\n",
                "- Compute the first and third quartiles of `emissions_by_country` and store these as `q1` and `q3`.\n",
                "- Calculate the interquartile range of `emissions_by_country` and store it as `iqr`.\n",
                "- Calculate the lower and upper cutoffs for outliers of `emissions_by_country`, and store these as lower and `upper`.\n",
                "- Subset `emissions_by_country` to get countries with a total emission greater than the `upper` cutoff or a total emission less than the `lower` cutoff.\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Calculate total co2_emission per country: emissions_by_country\n",
                "emissions_by_country = food_consumption.groupby('country')['co2_emission'].sum()\n",
                "\n",
                "print(emissions_by_country)\n",
                "\n",
                "\n",
                "# Calculate total co2_emission per country: emissions_by_country\n",
                "emissions_by_country = food_consumption.groupby('country')['co2_emission'].sum()\n",
                "\n",
                "# Compute the first and third quartiles and IQR of emissions_by_country\n",
                "q1 = np.quantile(emissions_by_country, 0.25)\n",
                "q3 = np.quantile(emissions_by_country, 0.75)\n",
                "iqr = q3 - q1\n",
                "\n",
                "\n",
                "# Calculate total co2_emission per country: emissions_by_country\n",
                "emissions_by_country = food_consumption.groupby('country')['co2_emission'].sum()\n",
                "\n",
                "# Compute the first and third quantiles and IQR of emissions_by_country\n",
                "q1 = np.quantile(emissions_by_country, 0.25)\n",
                "q3 = np.quantile(emissions_by_country, 0.75)\n",
                "iqr = q3 - q1\n",
                "\n",
                "# Calculate the lower and upper cutoffs for outliers\n",
                "lower = q1 - 1.5 * iqr\n",
                "upper = q3 + 1.5 * iqr\n",
                "\n",
                "\n",
                "# Calculate total co2_emission per country: emissions_by_country\n",
                "emissions_by_country = food_consumption.groupby('country')['co2_emission'].sum()\n",
                "\n",
                "# Compute the first and third quantiles and IQR of emissions_by_country\n",
                "q1 = np.quantile(emissions_by_country, 0.25)\n",
                "q3 = np.quantile(emissions_by_country, 0.75)\n",
                "iqr = q3 - q1\n",
                "\n",
                "# Calculate the lower and upper cutoffs for outliers\n",
                "lower = q1 - 1.5 * iqr\n",
                "upper = q3 + 1.5 * iqr\n",
                "\n",
                "# Subset emissions_by_country to find outliers\n",
                "outliers = emissions_by_country[(emissions_by_country < lower) | (emissions_by_country > upper)]\n",
                "print(outliers)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Random Numbers and Probability\n",
                "\n",
                "### Calculating probabilities\n",
                "\n",
                "You're in charge of the sales team, and it's time for performance\n",
                "reviews, starting with Amir. As part of the review, you want to randomly\n",
                "select a few of the deals that he's worked on over the past year so that\n",
                "you can look at them more deeply. Before you start selecting deals,\n",
                "you'll first figure out what the chances are of selecting certain deals.\n",
                "\n",
                "Recall that the probability of an event can be calculated by \\$\\$\n",
                "P(\\text{event}) = \\frac{\\text{# ways event can happen}}{\\text{total \\#\n",
                "of possible outcomes}} \\$\\$\n",
                "\n",
                "Both `pandas` as `pd` and `numpy` as `np` are loaded and `amir_deals` is\n",
                "available.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Count the number of deals Amir worked on for each `product` type and\n",
                "  store in `counts`.\n",
                "- Calculate the probability of selecting a deal for the different product types by dividing the counts by the total number of deals Amir worked on. Save this as `probs`.\n",
                "\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Count the deals for each product\n",
                "counts = amir_deals['product'].value_counts()\n",
                "print(counts)\n",
                "\n",
                "\n",
                "# Count the deals for each product\n",
                "counts = amir_deals['product'].value_counts()\n",
                "\n",
                "# Calculate probability of picking a deal with each product\n",
                "probs = counts / amir_deals.shape[0]\n",
                "print(probs)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Sampling deals\n",
                "\n",
                "In the previous exercise, you counted the deals Amir worked on. Now it's\n",
                "time to randomly pick five deals so that you can reach out to each\n",
                "customer and ask if they were satisfied with the service they received.\n",
                "You'll try doing this both with and without replacement.\n",
                "\n",
                "Additionally, you want to make sure this is done randomly and that it\n",
                "can be reproduced in case you get asked how you chose the deals, so\n",
                "you'll need to set the random seed before sampling from the deals.\n",
                "\n",
                "Both `pandas` as `pd` and `numpy` as `np` are loaded and `amir_deals` is\n",
                "available.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Set the random seed to `24`.\n",
                "- Take a sample of `5` deals **without** replacement and store them as\n",
                "  `sample_without_replacement`.\n",
                "- Take a sample of 5 deals with replacement and save as `sample_with_replacement`.\n",
                "\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Set random seed\n",
                "np.random.seed(24)\n",
                "\n",
                "# Sample 5 deals without replacement\n",
                "sample_without_replacement = amir_deals.sample(5)\n",
                "print(sample_without_replacement)\n",
                "\n",
                "\n",
                "# Set random seed\n",
                "np.random.seed(24)\n",
                "\n",
                "# Sample 5 deals with replacement\n",
                "sample_with_replacement = amir_deals.sample(5, replace=True)\n",
                "print(sample_with_replacement)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Creating a probability distribution\n",
                "\n",
                "A new restaurant opened a few months ago, and the restaurant's\n",
                "management wants to optimize its seating space based on the size of the\n",
                "groups that come most often. On one night, there are 10 groups of people\n",
                "waiting to be seated at the restaurant, but instead of being called in\n",
                "the order they arrived, they will be called randomly. In this exercise,\n",
                "you'll investigate the probability of groups of different sizes getting\n",
                "picked first. Data on each of the ten groups is contained in the\n",
                "`restaurant_groups` DataFrame.\n",
                "\n",
                "Remember that expected value can be calculated by multiplying each\n",
                "possible outcome with its corresponding probability and taking the sum.\n",
                "The `restaurant_groups` data is available. `pandas` is loaded as `pd`,\n",
                "`numpy` is loaded as `np`, and `matplotlib.pyplot` is loaded as `plt`.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Create a histogram of the `group_size` column of `restaurant_groups`,\n",
                "  setting `bins` to `[2, 3, 4, 5, 6]`. Remember to show the plot.\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- Count the number of each `group_size` in `restaurant_groups`, then\n",
                "  divide by the number of rows in `restaurant_groups` to calculate the\n",
                "  probability of randomly selecting a group of each size. Save as\n",
                "  `size_dist`.\n",
                "- Reset the index of `size_dist`.\n",
                "- Rename the columns of `size_dist` to `group_size` and `prob`.\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- Calculate the expected value of the `size_dist`, which represents the\n",
                "  expected group size, by multiplying the `group_size` by the `prob` and\n",
                "  taking the sum.\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- Calculate the probability of randomly picking a group of 4 or more\n",
                "  people by subsetting for groups of size 4 or more and summing the\n",
                "  probabilities of selecting those groups.\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Create a histogram of restaurant_groups and show plot\n",
                "restaurant_groups['group_size'].hist(bins=np.linspace(2,6,5))\n",
                "plt.show()\n",
                "\n",
                "\n",
                "# Create probability distribution\n",
                "size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]\n",
                "\n",
                "# Reset index and rename columns\n",
                "size_dist = size_dist.reset_index()\n",
                "size_dist.columns = ['group_size', 'prob']\n",
                "\n",
                "print(size_dist)\n",
                "\n",
                "\n",
                "# Create probability distribution\n",
                "size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]\n",
                "# Reset index and rename columns\n",
                "size_dist = size_dist.reset_index()\n",
                "size_dist.columns = ['group_size', 'prob']\n",
                "\n",
                "# Expected value\n",
                "expected_value = np.sum(size_dist['group_size'] * size_dist['prob'])\n",
                "print(expected_value)\n",
                "\n",
                "\n",
                "# Create probability distribution\n",
                "size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]\n",
                "# Reset index and rename columns\n",
                "size_dist = size_dist.reset_index()\n",
                "size_dist.columns = ['group_size', 'prob']\n",
                "\n",
                "# Expected value\n",
                "expected_value = np.sum(size_dist['group_size'] * size_dist['prob'])\n",
                "\n",
                "# Subset groups of size 4 or more\n",
                "groups_4_or_more = size_dist[size_dist['group_size'] >= 4]\n",
                "\n",
                "# Sum the probabilities of groups_4_or_more\n",
                "prob_4_or_more = np.sum(groups_4_or_more['prob'])\n",
                "print(prob_4_or_more)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Data back-ups\n",
                "\n",
                "The sales software used at your company is set to automatically back\n",
                "itself up, but no one knows exactly what time the back-ups happen. It is\n",
                "known, however, that back-ups happen exactly every 30 minutes. Amir\n",
                "comes back from sales meetings at random times to update the data on the\n",
                "client he just met with. He wants to know how long he'll have to wait\n",
                "for his newly-entered data to get backed up. Use your new knowledge of\n",
                "continuous uniform distributions to model this situation and answer\n",
                "Amir's questions.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- To model how long Amir will wait for a back-up using a continuous\n",
                "  uniform distribution, save his lowest possible wait time as `min_time`\n",
                "  and his longest possible wait time as `max_time`. Remember that\n",
                "  back-ups happen every 30 minutes.\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- Import `uniform` from `scipy.stats` and calculate the probability that\n",
                "  Amir has to wait less than 5 minutes, and store in a variable called\n",
                "  `prob_less_than_5`.\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- Calculate the probability that Amir has to wait more than 5 minutes,\n",
                "  and store in a variable called `prob_greater_than_5`.\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- Calculate the probability that Amir has to wait between 10 and 20\n",
                "  minutes, and store in a variable called `prob_between_10_and_20`.\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Min and max wait times for back-up that happens every 30 min\n",
                "min_time = 0\n",
                "max_time = 30\n",
                "\n",
                "# Import uniform from scipy.stats\n",
                "from scipy.stats import uniform\n",
                "\n",
                "# Calculate probability of waiting less than 5 mins\n",
                "prob_less_than_5 = uniform.cdf(5, min_time, max_time)\n",
                "print(prob_less_than_5)\n",
                "\n",
                "\n",
                "# Min and max wait times for back-up that happens every 30 min\n",
                "min_time = 0\n",
                "max_time = 30\n",
                "\n",
                "# Import uniform from scipy.stats\n",
                "from scipy.stats import uniform\n",
                "\n",
                "# Calculate probability of waiting more than 5 mins\n",
                "prob_greater_than_5 = 1 - uniform.cdf(5, min_time, max_time)\n",
                "print(prob_greater_than_5)\n",
                "\n",
                "\n",
                "# Min and max wait times for back-up that happens every 30 min\n",
                "min_time = 0\n",
                "max_time = 30\n",
                "\n",
                "# Import uniform from scipy.stats\n",
                "from scipy.stats import uniform\n",
                "\n",
                "# Calculate probability of waiting 10-20 mins\n",
                "prob_between_10_and_20 = uniform.cdf(20, min_time, max_time) - uniform.cdf(10, min_time, max_time)\n",
                "print(prob_between_10_and_20)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Simulating wait times\n",
                "\n",
                "To give Amir a better idea of how long he'll have to wait, you'll\n",
                "simulate Amir waiting 1000 times and create a histogram to show him what\n",
                "he should expect. Recall from the last exercise that his minimum wait\n",
                "time is 0 minutes and his maximum wait time is 30 minutes.\n",
                "\n",
                "As usual, `pandas` as `pd`, `numpy` as `np`, and `matplotlib.pyplot` as\n",
                "`plt` are loaded.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Set the random seed to `334`.\n",
                "- Import `uniform` from `scipy.stats`.\n",
                "- Generate 1000 wait times from the continuous uniform distribution that models Amir's wait time. Save this as `wait_times`.\n",
                "- Create a histogram of the simulated wait times and show the plot.\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Set random seed to 334\n",
                "np.random.seed(334)\n",
                "\n",
                "# Import uniform\n",
                "from scipy.stats import uniform\n",
                "\n",
                "# Generate 1000 wait times between 0 and 30 mins\n",
                "wait_times = uniform.rvs(0, 30, size=1000)\n",
                "\n",
                "print(wait_times)\n",
                "\n",
                "\n",
                "# Set random seed to 334\n",
                "np.random.seed(334)\n",
                "\n",
                "# Import uniform\n",
                "from scipy.stats import uniform\n",
                "\n",
                "# Generate 1000 wait times between 0 and 30 mins\n",
                "wait_times = uniform.rvs(0, 30, size=1000)\n",
                "\n",
                "# Create a histogram of simulated times and show plot\n",
                "plt.hist(wait_times)\n",
                "plt.show()\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Simulating sales deals\n",
                "\n",
                "Assume that Amir usually works on 3 deals per week, and overall, he wins\n",
                "30% of deals he works on. Each deal has a binary outcome: it's either\n",
                "lost, or won, so you can model his sales deals with a binomial\n",
                "distribution. In this exercise, you'll help Amir simulate a year's worth\n",
                "of his deals so he can better understand his performance.\n",
                "\n",
                "`numpy` is imported as `np`.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Import `binom` from `scipy.stats` and set the random seed to 10.\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- Simulate 1 deal worked on by Amir, who wins 30% of the deals he works\n",
                "  on.\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- Simulate a typical week of Amir's deals, or one week of 3 deals.\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- Simulate a year's worth of Amir's deals, or 52 weeks of 3 deals each,\n",
                "  and store in `deals`.\n",
                "- Print the mean number of deals he won per week.\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Import binom from scipy.stats\n",
                "from scipy.stats import binom\n",
                "\n",
                "# Set random seed to 10\n",
                "np.random.seed(10)\n",
                "\n",
                "# Simulate a single deal\n",
                "print(binom.rvs(1, 0.3, size=1))\n",
                "\n",
                "\n",
                "# Import binom from scipy.stats\n",
                "from scipy.stats import binom\n",
                "\n",
                "# Set random seed to 10\n",
                "np.random.seed(10)\n",
                "\n",
                "# Simulate 1 week of 3 deals\n",
                "print(binom.rvs(3, 0.3, size=1))\n",
                "\n",
                "\n",
                "# Import binom from scipy.stats\n",
                "from scipy.stats import binom\n",
                "\n",
                "# Set random seed to 10\n",
                "np.random.seed(10)\n",
                "\n",
                "# Simulate 52 weeks of 3 deals\n",
                "deals = binom.rvs(3, 0.3, size=52)\n",
                "\n",
                "# Print mean deals won per week\n",
                "print(np.mean(deals))\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Calculating binomial probabilities\n",
                "\n",
                "Just as in the last exercise, assume that Amir wins 30% of deals. He\n",
                "wants to get an idea of how likely he is to close a certain number of\n",
                "deals each week. In this exercise, you'll calculate what the chances are\n",
                "of him closing different numbers of deals using the binomial\n",
                "distribution.\n",
                "\n",
                "`binom` is imported from `scipy.stats`.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- What's the probability that Amir closes all 3 deals in a week? Save\n",
                "  this as `prob_3`.\n",
                "- What's the probability that Amir closes 1 or fewer deals in a week? Save this as `prob_less_than_or_equal_1`.\n",
                "- What's the probability that Amir closes more than 1 deal? Save this as `prob_greater_than_1`.\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Probability of closing 3 out of 3 deals\n",
                "prob_3 = binom.pmf(3, 3, 0.3)\n",
                "\n",
                "print(prob_3)\n",
                "\n",
                "\n",
                "# Probability of closing <= 1 deal out of 3 deals\n",
                "prob_less_than_or_equal_1 = binom.cdf(1, 3, 0.3)\n",
                "\n",
                "print(prob_less_than_or_equal_1)\n",
                "\n",
                "\n",
                "# Probability of closing > 1 deal out of 3 deals\n",
                "prob_greater_than_1 = 1 - binom.cdf(1, 3, 0.3)\n",
                "\n",
                "print(prob_greater_than_1)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### How many sales will be won?\n",
                "\n",
                "Now Amir wants to know how many deals he can expect to close each week\n",
                "if his win rate changes. Luckily, you can use your binomial distribution\n",
                "knowledge to help him calculate the expected value in different\n",
                "situations. Recall from the video that the expected value of a binomial\n",
                "distribution can be calculated by \\\\n \\times p\\\\.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Calculate the expected number of sales out of the **3** he works on\n",
                "  that Amir will win each week if he maintains his 30% win rate.\n",
                "- Calculate the expected number of sales out of the 3 he works on that\n",
                "  he'll win if his win rate drops to 25%.\n",
                "- Calculate the expected number of sales out of the 3 he works on that\n",
                "  he'll win if his win rate rises to 35%.\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Expected number won with 30% win rate\n",
                "won_30pct = 3 * 0.3\n",
                "print(won_30pct)\n",
                "\n",
                "# Expected number won with 25% win rate\n",
                "won_25pct = 3 * 0.25\n",
                "print(won_25pct)\n",
                "\n",
                "# Expected number won with 35% win rate\n",
                "won_35pct = 3 * 0.35\n",
                "print(won_35pct)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## More Distributions and the Central Limit Theorem\n",
                "\n",
                "### Distribution of Amir's sales\n",
                "\n",
                "Since each deal Amir worked on (both won and lost) was different, each\n",
                "was worth a different amount of money. These values are stored in the\n",
                "`amount` column of `amir_deals` As part of Amir's performance review,\n",
                "you want to be able to estimate the probability of him selling different\n",
                "amounts, but before you can do this, you'll need to determine what kind\n",
                "of distribution the `amount` variable follows.\n",
                "\n",
                "Both `pandas` as `pd` and `matplotlib.pyplot` as `plt` are loaded and\n",
                "`amir_deals` is available.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Create a histogram with 10 bins to visualize the distribution of the\n",
                "  `amount`. Show the plot.\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Histogram of amount with 10 bins and show plot\n",
                "amir_deals['amount'].hist(bins=10)\n",
                "plt.show()\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Probabilities from the normal distribution\n",
                "\n",
                "Since each deal Amir worked on (both won and lost) was different, each\n",
                "was worth a different amount of money. These values are stored in the\n",
                "`amount` column of `amir_deals` and follow a normal distribution with a\n",
                "mean of 5000 dollars and a standard deviation of 2000 dollars. As part\n",
                "of his performance metrics, you want to calculate the probability of\n",
                "Amir closing a deal worth various amounts.\n",
                "\n",
                "`norm` from `scipy.stats` is imported as well as `pandas` as `pd`. The\n",
                "DataFrame `amir_deals` is loaded.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- What's the probability of Amir closing a deal worth less than \\$7500?\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- What's the probability of Amir closing a deal worth more than \\$1000?\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- What's the probability of Amir closing a deal worth between \\$3000 and\n",
                "  \\$7000?\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- What amount will 25% of Amir's sales be *less than*?\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Probability of deal < 7500\n",
                "prob_less_7500 = norm.cdf(7500, 5000, 2000)\n",
                "\n",
                "print(prob_less_7500)\n",
                "\n",
                "\n",
                "# Probability of deal > 1000\n",
                "prob_over_1000 = 1 - norm.cdf(1000, 5000, 2000)\n",
                "\n",
                "print(prob_over_1000)\n",
                "\n",
                "\n",
                "# Probability of deal between 3000 and 7000\n",
                "prob_3000_to_7000 = norm.cdf(7000, 5000, 2000) - norm.cdf(3000, 5000, 2000)\n",
                "\n",
                "print(prob_3000_to_7000)\n",
                "\n",
                "\n",
                "# Calculate amount that 25% of deals will be less than\n",
                "pct_25 = norm.ppf(0.25, 5000, 2000)\n",
                "\n",
                "print(pct_25)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Simulating sales under new market conditions\n",
                "\n",
                "The company's financial analyst is predicting that next quarter, the\n",
                "worth of each sale will increase by 20% and the volatility, or standard\n",
                "deviation, of each sale's worth will increase by 30%. To see what Amir's\n",
                "sales might look like next quarter under these new market conditions,\n",
                "you'll simulate new sales amounts using the normal distribution and\n",
                "store these in the `new_sales` DataFrame, which has already been created\n",
                "for you.\n",
                "\n",
                "In addition, `norm` from `scipy.stats`, `pandas` as `pd`, and\n",
                "`matplotlib.pyplot` as `plt` are loaded.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Currently, Amir's average sale amount is \\$5000. Calculate what his\n",
                "  new average amount will be if it increases by 20% and store this in\n",
                "  `new_mean`.\n",
                "- Amir's current standard deviation is \\$2000. Calculate what his new\n",
                "  standard deviation will be if it increases by 30% and store this in\n",
                "  `new_sd`.\n",
                "- Create a variable called `new_sales`, which contains 36 simulated\n",
                "  amounts from a normal distribution with a mean of `new_mean` and a\n",
                "  standard deviation of `new_sd`.\n",
                "- Plot the distribution of the `new_sales` `amount`s using a histogram\n",
                "  and show the plot.\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Calculate new average amount\n",
                "new_mean = 5000 * 1.2\n",
                "\n",
                "# Calculate new standard deviation\n",
                "new_sd = 2000 * 1.3\n",
                "\n",
                "# Simulate 36 new sales\n",
                "new_sales = norm.rvs(new_mean, new_sd, size=36)\n",
                "\n",
                "# Create histogram and show\n",
                "plt.hist(new_sales)\n",
                "plt.show()\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### The CLT in action\n",
                "\n",
                "The central limit theorem states that a sampling distribution of a\n",
                "sample statistic approaches the normal distribution as you take more\n",
                "samples, no matter the original distribution being sampled from.\n",
                "\n",
                "In this exercise, you'll focus on the sample mean and see the central\n",
                "limit theorem in action while examining the `num_users` column of\n",
                "`amir_deals` more closely, which contains the number of people who\n",
                "intend to use the product Amir is selling.\n",
                "\n",
                "`pandas` as `pd`, `numpy` as `np`, and `matplotlib.pyplot` as `plt` are\n",
                "loaded and `amir_deals` is available.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Create a histogram of the `num_users` column of `amir_deals` and show\n",
                "  the plot.\n",
                "- Set the seed to `104`.\n",
                "- Take a sample of size `20` with replacement from the `num_users` column of `amir_deals`, and take the mean.\n",
                "- Repeat this 100 times using a `for` loop and store as `sample_means`. This will take 100 different samples and calculate the mean of each.\n",
                "- Convert `sample_means` into a `pd.Series`, create a histogram of the `sample_means`, and show the plot.\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Create a histogram of num_users and show\n",
                "amir_deals['num_users'].hist()\n",
                "plt.show()\n",
                "\n",
                "\n",
                "# Set seed to 104\n",
                "np.random.seed(104)\n",
                "\n",
                "# Sample 20 num_users with replacement from amir_deals\n",
                "samp_20 = amir_deals['num_users'].sample(20, replace=True)\n",
                "\n",
                "# Take mean of samp_20\n",
                "print(np.mean(samp_20))\n",
                "\n",
                "\n",
                "# Set seed to 104\n",
                "np.random.seed(104)\n",
                "\n",
                "# Sample 20 num_users with replacement from amir_deals and take mean\n",
                "samp_20 = amir_deals['num_users'].sample(20, replace=True)\n",
                "np.mean(samp_20)\n",
                "\n",
                "sample_means = []\n",
                "# Loop 100 times\n",
                "for i in range(100):\n",
                "  # Take sample of 20 num_users\n",
                "  samp_20 = amir_deals['num_users'].sample(20, replace=True)\n",
                "  # Calculate mean of samp_20\n",
                "  samp_20_mean = np.mean(samp_20)\n",
                "  # Append samp_20_mean to sample_means\n",
                "  sample_means.append(samp_20_mean)\n",
                "  \n",
                "print(sample_means)\n",
                "\n",
                "\n",
                "# Set seed to 104\n",
                "np.random.seed(104)\n",
                "\n",
                "sample_means = []\n",
                "# Loop 100 times\n",
                "for i in range(100):\n",
                "  # Take sample of 20 num_users\n",
                "  samp_20 = amir_deals['num_users'].sample(20, replace=True)\n",
                "  # Calculate mean of samp_20\n",
                "  samp_20_mean = np.mean(samp_20)\n",
                "  # Append samp_20_mean to sample_means\n",
                "  sample_means.append(samp_20_mean)\n",
                "  \n",
                "# Convert to Series and plot histogram\n",
                "sample_means_series = pd.Series(sample_means)\n",
                "sample_means_series.hist()\n",
                "# Show plot\n",
                "plt.show()\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### The mean of means\n",
                "\n",
                "You want to know what the average number of users (`num_users`) is per\n",
                "deal, but you want to know this number for the entire company so that\n",
                "you can see if Amir's deals have more or fewer users than the company's\n",
                "average deal. The problem is that over the past year, the company has\n",
                "worked on more than ten thousand deals, so it's not realistic to compile\n",
                "all the data. Instead, you'll estimate the mean by taking several random\n",
                "samples of deals, since this is much easier than collecting data from\n",
                "everyone in the company.\n",
                "\n",
                "`amir_deals` is available and the user data for all the company's deals\n",
                "is available in `all_deals`. Both `pandas` as `pd` and `numpy` as `np`\n",
                "are loaded.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Set the random seed to `321`.\n",
                "- Take 30 samples (with replacement) of size 20 from\n",
                "  `all_deals['num_users']` and take the mean of each sample. Store the\n",
                "  sample means in `sample_means`.\n",
                "- Print the mean of `sample_means`.\n",
                "- Print the mean of the `num_users` column of `amir_deals`.\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Set seed to 321\n",
                "np.random.seed(321)\n",
                "\n",
                "sample_means = []\n",
                "# Loop 30 times to take 30 means\n",
                "for i in range(30):\n",
                "  # Take sample of size 20 from num_users col of all_deals with replacement\n",
                "  cur_sample = all_deals['num_users'].sample(20, replace=True)\n",
                "  # Take mean of cur_sample\n",
                "  cur_mean = np.mean(cur_sample)\n",
                "  # Append cur_mean to sample_means\n",
                "  sample_means.append(cur_mean)\n",
                "\n",
                "# Print mean of sample_means\n",
                "print(np.mean(sample_means))\n",
                "\n",
                "# Print mean of num_users in amir_deals\n",
                "print(np.mean(amir_deals['num_users']))\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Tracking lead responses\n",
                "\n",
                "Your company uses sales software to keep track of new sales leads. It\n",
                "organizes them into a queue so that anyone can follow up on one when\n",
                "they have a bit of free time. Since the number of lead responses is a\n",
                "countable outcome over a period of time, this scenario corresponds to a\n",
                "Poisson distribution. On average, Amir responds to 4 leads each day. In\n",
                "this exercise, you'll calculate probabilities of Amir responding to\n",
                "different numbers of leads.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Import `poisson` from `scipy.stats` and calculate the probability that\n",
                "  Amir responds to 5 leads in a day, given that he responds to an\n",
                "  average of 4.\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- Amir's coworker responds to an average of 5.5 leads per day. What is\n",
                "  the probability that she answers 5 leads in a day?\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- What's the probability that Amir responds to 2 or fewer leads in a\n",
                "  day?\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- What's the probability that Amir responds to more than 10 leads in a\n",
                "  day?\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Import poisson from scipy.stats\n",
                "from scipy.stats import poisson\n",
                "\n",
                "# Probability of 5 responses\n",
                "prob_5 = poisson.pmf(5, 4)\n",
                "\n",
                "print(prob_5)\n",
                "\n",
                "\n",
                "# Import poisson from scipy.stats\n",
                "from scipy.stats import poisson\n",
                "\n",
                "# Probability of 5 responses\n",
                "prob_coworker = poisson.pmf(5, 5.5)\n",
                "\n",
                "print(prob_coworker)\n",
                "\n",
                "\n",
                "# Import poisson from scipy.stats\n",
                "from scipy.stats import poisson\n",
                "\n",
                "# Probability of 2 or fewer responses\n",
                "prob_2_or_less = poisson.cdf(2, 4)\n",
                "\n",
                "print(prob_2_or_less)\n",
                "\n",
                "\n",
                "# Import poisson from scipy.stats\n",
                "from scipy.stats import poisson\n",
                "\n",
                "# Probability of > 10 responses\n",
                "prob_over_10 = 1 - poisson.cdf(10, 4)\n",
                "\n",
                "print(prob_over_10)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Modeling time between leads\n",
                "\n",
                "To further evaluate Amir's performance, you want to know how much time\n",
                "it takes him to respond to a lead after he opens it. On average, he\n",
                "responds to 1 request every 2.5 hours. In this exercise, you'll\n",
                "calculate probabilities of different amounts of time passing between\n",
                "Amir receiving a lead and sending a response.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Import `expon` from `scipy.stats`. What's the probability it takes\n",
                "  Amir less than an hour to respond to a lead?\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- What's the probability it takes Amir more than 4 hours to respond to a\n",
                "  lead?\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- What's the probability it takes Amir 3-4 hours to respond to a lead?\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Import expon from scipy.stats\n",
                "from scipy.stats import expon\n",
                "\n",
                "# Print probability response takes < 1 hour\n",
                "print(expon.cdf(1, scale=2.5))\n",
                "\n",
                "\n",
                "# Import expon from scipy.stats\n",
                "from scipy.stats import expon\n",
                "\n",
                "# Print probability response takes > 4 hours\n",
                "print(1 - expon.cdf(4, scale=2.5))\n",
                "\n",
                "\n",
                "# Import expon from scipy.stats\n",
                "from scipy.stats import expon\n",
                "\n",
                "# Print probability response takes 3-4 hours\n",
                "print(expon.cdf(4, scale=2.5) - expon.cdf(3, scale=2.5))\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Correlation and Experimental Design\n",
                "\n",
                "### Relationships between variables\n",
                "\n",
                "In this chapter, you'll be working with a dataset `world_happiness`\n",
                "containing results from the [2019 World Happiness\n",
                "Report](https://worldhappiness.report/ed/2019/). The report scores\n",
                "various countries based on how happy people in that country are. It also\n",
                "ranks each country on various societal aspects such as social support,\n",
                "freedom, corruption, and others. The dataset also includes the GDP per\n",
                "capita and life expectancy for each country.\n",
                "\n",
                "In this exercise, you'll examine the relationship between a country's\n",
                "life expectancy (`life_exp`) and happiness score (`happiness_score`)\n",
                "both visually and quantitatively. `seaborn` as `sns`,\n",
                "`matplotlib.pyplot` as `plt`, and `pandas` as `pd` are loaded and\n",
                "`world_happiness` is available.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Create a scatterplot of `happiness_score` vs. `life_exp` (without a\n",
                "  trendline) using `seaborn`.\n",
                "- Show the plot.\n",
                "- Create a scatterplot of `happiness_score` vs. `life_exp` **with a linear trendline** using `seaborn`, setting `ci` to `None`.\n",
                "- Show the plot.\n",
                "- Calculate the correlation between `life_exp` and `happiness_score`. Save this as `cor`.\n",
                "\n",
                "\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Create a scatterplot of happiness_score vs. life_exp and show\n",
                "sns.scatterplot(x='life_exp', y='happiness_score', data=world_happiness)\n",
                "\n",
                "# Show plot\n",
                "plt.show()\n",
                "\n",
                "\n",
                "# Create scatterplot of happiness_score vs life_exp with trendline\n",
                "sns.lmplot(x='life_exp', y='happiness_score', data=world_happiness, ci=None)\n",
                "\n",
                "# Show plot\n",
                "plt.show()\n",
                "\n",
                "\n",
                "# Create scatterplot of happiness_score vs life_exp with trendline\n",
                "sns.lmplot(x='life_exp', y='happiness_score', data=world_happiness, ci=None)\n",
                "\n",
                "# Show plot\n",
                "plt.show()\n",
                "\n",
                "# Correlation between life_exp and happiness_score\n",
                "cor = world_happiness['life_exp'].corr(world_happiness['happiness_score'])\n",
                "\n",
                "print(cor)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### What can't correlation measure?\n",
                "\n",
                "While the correlation coefficient is a convenient way to quantify the\n",
                "strength of a relationship between two variables, it's far from perfect.\n",
                "In this exercise, you'll explore one of the caveats of the correlation\n",
                "coefficient by examining the relationship between a country's GDP per\n",
                "capita (`gdp_per_cap`) and happiness score.\n",
                "\n",
                "`pandas` as `pd`, `matplotlib.pyplot` as `plt`, and `seaborn` as `sns`\n",
                "are imported, and `world_happiness` is loaded.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Create a `seaborn` scatterplot (without a trendline) showing the\n",
                "  relationship between `gdp_per_cap` (on the x-axis) and `life_exp` (on\n",
                "  the y-axis).\n",
                "- Show the plot\n",
                "- Calculate the correlation between `gdp_per_cap` and `life_exp` and store as `cor`.\n",
                "\n",
                "\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Scatterplot of gdp_per_cap and life_exp\n",
                "sns.scatterplot(x='gdp_per_cap', y='life_exp', data=world_happiness)\n",
                "\n",
                "# Show plot\n",
                "plt.show()\n",
                "\n",
                "\n",
                "# Scatterplot of gdp_per_cap and life_exp\n",
                "sns.scatterplot(x='gdp_per_cap', y='life_exp', data=world_happiness)\n",
                "\n",
                "# Show plot\n",
                "plt.show()\n",
                "  \n",
                "# Correlation between gdp_per_cap and life_exp\n",
                "cor = world_happiness['gdp_per_cap'].corr(world_happiness['life_exp'])\n",
                "\n",
                "print(cor)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Transforming variables\n",
                "\n",
                "When variables have skewed distributions, they often require a\n",
                "transformation in order to form a linear relationship with another\n",
                "variable so that correlation can be computed. In this exercise, you'll\n",
                "perform a transformation yourself.\n",
                "\n",
                "`pandas` as `pd`, `numpy` as `np`, `matplotlib.pyplot` as `plt`, and\n",
                "`seaborn` as `sns` are imported, and `world_happiness` is loaded.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Create a scatterplot of `happiness_score` versus `gdp_per_cap` and\n",
                "  calculate the correlation between them.\n",
                "\n",
                "<!-- -->\n",
                "\n",
                "- Add a new column to `world_happiness` called `log_gdp_per_cap` that\n",
                "  contains the log of `gdp_per_cap`.\n",
                "- Create a `seaborn` scatterplot of `happiness_score` versus\n",
                "  `log_gdp_per_cap`.\n",
                "- Calculate the correlation between `log_gdp_per_cap` and\n",
                "  `happiness_score`.\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Scatterplot of happiness_score vs. gdp_per_cap\n",
                "sns.scatterplot(x='gdp_per_cap', y='happiness_score', data=world_happiness)\n",
                "plt.show()\n",
                "\n",
                "# Calculate correlation\n",
                "cor = world_happiness['gdp_per_cap'].corr(world_happiness['happiness_score'])\n",
                "print(cor)\n",
                "\n",
                "\n",
                "# Create log_gdp_per_cap column\n",
                "world_happiness['log_gdp_per_cap'] = np.log(world_happiness['gdp_per_cap'])\n",
                "\n",
                "# Scatterplot of happiness_score vs. log_gdp_per_cap\n",
                "sns.scatterplot(x='log_gdp_per_cap', y='happiness_score', data=world_happiness)\n",
                "plt.show()\n",
                "\n",
                "# Calculate correlation\n",
                "cor = world_happiness['log_gdp_per_cap'].corr(world_happiness['happiness_score'])\n",
                "print(cor)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Does sugar improve happiness?\n",
                "\n",
                "A new column has been added to `world_happiness` called\n",
                "`grams_sugar_per_day`, which contains the average amount of sugar eaten\n",
                "per person per day in each country. In this exercise, you'll examine the\n",
                "effect of a country's average sugar consumption on its happiness score.\n",
                "\n",
                "`pandas` as `pd`, `matplotlib.pyplot` as `plt`, and `seaborn` as `sns`\n",
                "are imported, and `world_happiness` is loaded.\n",
                "\n",
                "**Instructions**\n",
                "\n",
                "- Create a `seaborn` scatterplot showing the relationship between\n",
                "  `grams_sugar_per_day` (on the x-axis) and `happiness_score` (on the\n",
                "  y-axis).\n",
                "- Calculate the correlation between `grams_sugar_per_day` and\n",
                "  `happiness_score`.\n",
                "\n",
                "**Answer**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Scatterplot of grams_sugar_per_day and happiness_score\n",
                "sns.scatterplot(x='grams_sugar_per_day', y='happiness_score', data=world_happiness)\n",
                "plt.show()\n",
                "\n",
                "# Correlation between grams_sugar_per_day and happiness_score\n",
                "cor = world_happiness['grams_sugar_per_day'].corr(world_happiness['happiness_score'])\n",
                "print(cor)\n"
            ]
        }
    ],
    "metadata": {
        "anaconda-cloud": "",
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": "r",
            "file_extension": ".r",
            "mimetype": "text/x-r-source",
            "name": "python",
            "pygments_lexer": "r",
            "version": "3.10.12"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 1
}