Skip to content

Exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

Notifications You must be signed in to change notification settings

rakibhhridoy/ExploratoryDataAnalysis-Python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Exploratory Data Analysis in Python & Hypothese Testing

1: Definining Exploratory Data Analysis with an overview of the whole project.

2: Importing libraries and Exploring the Dataset.

3: Checking missing values and Outliers.

4: Creating visual methods to analyze the data.

5: Analyzing trends, patterns, and relationships in the Data. Hypotheses Testing

Exploratory Data Analysis

In statistics, exploratory data analysis is an approach to analyzing 
data sets to summarize their main characteristics, often with visual methods. 
A statistical model can be used or not, but primarily EDA is for seeing what 
the data can tell us beyond the formal modeling or hypothesis testing task. 
Exploratory data analysis was promoted by John Tukey to encourage statisticians 
to explore the data, and possibly formulate hypotheses that could lead to new 
data collection and experiments. EDA is different from initial data analysis (IDA),
which focuses more narrowly on checking assumptions required for model fitting and 
hypothesis testing, and handling missing values and making transformations of variables 
as needed. EDA encompasses IDA.

Importing libraries and Exploring the Dataset

import numpy as np 
import pandas as pd 
from matplotlib import pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import scipy.stats as stats
from sklearn.preprocessing import LabelEncoder
import copy
sns.set() 
insurance_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

Expected output:

The data should consist of 1338 instances with 7 attributes. 2 integer type, 2 float type and 3 object type (Strings in the column)

Checking missing values and Outliers

insurance_df.isna().apply(pd.value_counts)
insurance_df.describe().T

Output should include this Analysis:

  • All the statistics seem reasonable.

  • Age column: data looks representative of the true age distribution of the adult population with (39) mean.

  • Children Column: Few people have more than 2 children (75% of the people have 2 or less children).

  • The claimed amount is higly skewed as most people would require basic medi-care and only few suffer from diseases which cost more to get rid of.

png

Output should include this Analysis:

  • bmi looks normally distributed.

  • Age looks uniformly distributed.

  • As seen in the previous step, charges are highly skewed.

png

Output should include this Analysis:

  • There are lot more non-smokers than smokers.

  • Instances are distributed evenly accross all regions.

  • Gender is also distributed evenly.

  • Most instances have less than 3 children and very few have 4 or 5 children.

# Label encoding the variables before doing a pairplot because pairplot ignores strings

insurance_df_encoded = copy.deepcopy(insurance_df)
insurance_df_encoded.loc[:,['sex', 'smoker', 'region']] = insurance_df_encoded.loc[:,['sex', 'smoker', 'region']].apply(LabelEncoder().fit_transform) 

sns.pairplot(insurance_df_encoded)  #pairplot
plt.show()

png

Output should include this Analysis:

  • There is an obvious correlation between 'charges' and 'smoker'

  • Looks like smokers claimed more money than non-smokers

  • There's an interesting pattern between 'age' and 'charges'. Notice that older people are charged more than the younger ones

Analyzing trends, patterns, and relationships in the Data.

print("Do charges of people who smoke differ significantly from the people who don't?")
insurance_df.smoker.value_counts()
Do charges of people who smoke differ significantly from the people who don't?





no     1064
yes     274
Name: smoker, dtype: int64

png png

There is no apparent relation between gender and charges

# T-test to check dependency of smoking on charges
Ho = "Charges of smoker and non-smoker are same"   # Stating the Null Hypothesis
Ha = "Charges of smoker and non-smoker are not the same"   # Stating the Alternate Hypothesis

x = np.array(insurance_df[insurance_df.smoker == 'yes'].charges)  # Selecting charges corresponding to smokers as an array
y = np.array(insurance_df[insurance_df.smoker == 'no'].charges) # Selecting charges corresponding to non-smokers as an array

t, p_value  = stats.ttest_ind(x,y, axis = 0)  #Performing an Independent t-test

if p_value < 0.05:  # Setting our significance level at 5%
    print(f'{Ha} as the p_value ({p_value}) < 0.05')
else:
    print(f'{Ho} as the p_value ({p_value}) > 0.05')
Charges of smoker and non-smoker are not the same as the p_value (8.271435842177219e-283) < 0.05

Thus, Smokers seem to claim significantly more money than non-smokers

#Does bmi of males differ significantly from that of females?
print ("Does bmi of males differ significantly from that of females?")
insurance_df.sex.value_counts()   #Checking the distribution of males and females
Does bmi of males differ significantly from that of females?





male      676
female    662
Name: sex, dtype: int64
# T-test to check dependency of bmi on gender
Ho = "Gender has no effect on bmi"   # Stating the Null Hypothesis
Ha = "Gender has an effect on bmi"   # Stating the Alternate Hypothesis

x = np.array(insurance_df[insurance_df.sex == 'male'].bmi)  # Selecting bmi values corresponding to males as an array
y = np.array(insurance_df[insurance_df.sex == 'female'].bmi) # Selecting bmi values corresponding to females as an array

t, p_value  = stats.ttest_ind(x,y, axis = 0)  #Performing an Independent t-test

if p_value < 0.05:  # Setting our significance level at 5%
    print(f'{Ha} as the p_value ({p_value.round()}) < 0.05')
else:
    print(f'{Ho} as the p_value ({p_value.round(3)}) > 0.05')
Gender has no effect on bmi as the p_value (0.09) > 0.05

bmi of both the genders are identical

#Is the proportion of smokers significantly different in different genders?


# Chi_square test to check if smoking habits are different for different genders
Ho = "Gender has no effect on smoking habits"   # Stating the Null Hypothesis
Ha = "Gender has an effect on smoking habits"   # Stating the Alternate Hypothesis

crosstab = pd.crosstab(insurance_df['sex'],insurance_df['smoker'])  # Contingency table of sex and smoker attributes

chi, p_value, dof, expected =  stats.chi2_contingency(crosstab)

if p_value < 0.05:  # Setting our significance level at 5%
    print(f'{Ha} as the p_value ({p_value.round(3)}) < 0.05')
else:
    print(f'{Ho} as the p_value ({p_value.round(3)}) > 0.05')
crosstab
Gender has an effect on smoking habits as the p_value (0.007) < 0.05

Proportion of smokers in males is significantly different from that of the females

# Chi_square test to check if smoking habits are different for people of different regions
Ho = "Region has no effect on smoking habits"   # Stating the Null Hypothesis
Ha = "Region has an effect on smoking habits"   # Stating the Alternate Hypothesis

crosstab = pd.crosstab(insurance_df['smoker'], insurance_df['region'])  # Contingency table of sex and smoker attributes

chi, p_value, dof, expected =  stats.chi2_contingency(crosstab)

if p_value < 0.05:  # Setting our significance level at 5%
    print(f'{Ha} as the p_value ({p_value.round(3)}) < 0.05')
else:
    print(f'{Ho} as the p_value ({p_value.round(3)}) > 0.05')
crosstab
Region has no effect on smoking habits as the p_value (0.062) > 0.05
  • Smoking habbits of people of different regions are similar
# Is the distribution of bmi across women with no children, one child and two children, the same ?
# Test to see if the distributions of bmi values for females having different number of children, are significantly different

Ho = "No. of children has no effect on bmi"   # Stating the Null Hypothesis
Ha = "No. of children has an effect on bmi"   # Stating the Alternate Hypothesis


female_df = copy.deepcopy(insurance_df[insurance_df['sex'] == 'female'])

zero = female_df[female_df.children == 0]['bmi']
one = female_df[female_df.children == 1]['bmi']
two = female_df[female_df.children == 2]['bmi']


f_stat, p_value = stats.f_oneway(zero,one,two)


if p_value < 0.05:  # Setting our significance level at 5%
    print(f'{Ha} as the p_value ({p_value.round(3)}) < 0.05')
else:
    print(f'{Ho} as the p_value ({p_value.round(3)}) > 0.05')
No. of children has no effect on bmi as the p_value (0.716) > 0.05

Get Touch With Me

Connect- Linkedin
Website- RakibHHridoy

About

Exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

Topics

Resources

Stars

Watchers

Forks