Niko Seino 2023-05-05
- About This Project
- Data Preparation
- Data Cleaning
- Data Visualization
- Data Exploration
- Using Data to Make Predictions
This project contains an analysis of survey data containing 37 variables related to the health and demographics of survey participants.
My goal was to explore and visualize the data, then determine the best demographic variables for predicting depression.
To do this, I focused on four specific variables:
ADDEPEV2: Shows whether or not participants have been diagnosed with a depressive disorder. In the dataset, this variable is numeric but will need to be treated as nominal. The possible responses are: 1=Yes, 2=No, 7=Don’t know/Not sure, 9=Refused
POORHLTH: Shows how many days in the past 30 that poor physical or mental health kept participant from doing usual day-to-day activities.
The values in this variable are integer values referring to days between 1-30, and ‘88’ is used to indicate “None”, or 0 days. Other values include 77, meaning “Don’t know”,and 99, meaning “refused”.
EDUCA Shows participants’ level of education.This is a numeric variable with the numbers representing categories of education level. There are 6 categories 1-6 starting with 1 representing no formal education or only kindergarten, up to 6 representing 4 or more years of college. An additional category, 9, represents participants who did not give an answer.
CHILDREN Shows the number of children in each participant’s household. This is a numeric variable representing the number of children. The value has a range of 1-87, and 88 represents “none” or 0, which was important to note when doing statistical calculations. ‘99’ represents those who did not give an answer.
# Clear memory
rm(list = ls())
# Load packages
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(lm.beta))
# Set working directory and load dataset
#setwd("~/DTSC650")
brfss <- read_csv("BRFSS2015_650.csv", show_col_types = FALSE)
- Transforming variables
#Viewing the "Poor Health" data:
options(scipen = 999)
ggplot(brfss, aes(x = brfss$POORHLTH, na.rm = TRUE)) +
geom_histogram(binwidth = 3)
# The graph shows that the data is varied between 1-30 with no real outliers.
# I can also see that the majority of responses were 88, or 0 days.
# To better represent this and clean up the graph, I will update the data
# to show the zeros and also eliminate the 77s and 99s.
brfss$POORHLTH <- replace(brfss$POORHLTH, brfss$POORHLTH == 88, 0)
brfss <- brfss %>%
filter(POORHLTH <= 30, na.rm = TRUE)
# Updated graph:
ggplot(brfss, aes(x = POORHLTH)) +
geom_histogram(binwidth = 2)
# Now I can clearly see the most responses are at 0, and there are no tails on
# either end of the graph indicating no outliers.
# ADDEPEV2: Transforming into a binary variable with options "Yes" or "no"
# Eliminating the non responses ('7's and '9's)
# Converting variables to factors with levels
brfss <- brfss %>%
filter(ADDEPEV2 == 1 | ADDEPEV2 == 2)
brfss$ADDEPEV2 <- factor(brfss$ADDEPEV2, levels = c(1, 2), labels = c("Yes", "No"))
# Converting variables to factors and removing non responses
brfss <- brfss %>%
filter(EDUCA > 0, EDUCA < 7)
# Adding labels to decrease confusion
brfss$EDUCA <- factor(brfss$EDUCA,
levels = c("1", "2", "3", "4", "5", "6"),
labels = c("None", "Elementary", "Some High School",
"High School Graduate", "Some College",
"College Graduate"))
- Removing Outliers
# Creating a graph to view the "Children" data:
ggplot(brfss, aes(x = brfss$CHILDREN, na.rm = TRUE)) +
geom_histogram(binwidth = 3)
# The 88 and 99 values are making the data difficult to read, so I need to
# transform the 88 to 0 and remove the 99's since they are not relevant.
# I also want to know what the max is so I can better fit the graph.
brfss$CHILDREN <- replace(brfss$CHILDREN, brfss$CHILDREN == 88, 0)
brfss <- brfss %>%
filter(CHILDREN <= 87)
max(brfss$CHILDREN)
## [1] 77
#max number of children in a household is 77. This seems odd and a possible outlier.
# Creating a boxplot to view quantiles
ggplot(brfss, aes(x = CHILDREN)) +
geom_boxplot()
# Here we can see several possible outliers.
childrenupper <- quantile(brfss$CHILDREN, .9985, na.rm = TRUE)
# This shows that 99.85% of the values are at 6 or below.
# Values above 6 are more than 3 standard deviations from the mean, indicating
# they are outliers.
# Update the dataset with outliers removed
children_out <- which(brfss$CHILDREN > 6)
length(children_out)
## [1] 217
brfss <- brfss[-children_out, ]
# Visualizing poor health days
Q12a <- ggplot(brfss, aes(x = POORHLTH)) +
geom_bar() +
labs(title = "Days in the Past 30 in which Poor Health
Prevented Usual Daily Activities")
Q12a
# Poor health days by depression dx
ggplot(brfss, aes(x = POORHLTH)) +
geom_bar() +
facet_wrap(~ ADDEPEV2) +
labs(title = "Days in the Past 30 in which Poor Health
Affected Daily Activities of People
With and Without Depression")
Interpretation:
We can see that there are peaks at both ends, 0 and 30, meaning most people either were unaffected by poor health in the last 30 days, or were affected every day. This pattern is similar in those who have been told they have depression, however in this population the proportion of 0 poor health days is much lower, and the proportion of 30 poor health days is higher.
# Simple bar graph to view counts of participants with depression dx
ggplot(brfss, aes(x = ADDEPEV2, fill = ADDEPEV2)) +
geom_bar() +
labs(title = "Number of People in Survey Who Have Had Depression",
x = "Had Depression")
# Depression in relation to education level
Q12b <- ggplot(brfss, aes(x = ADDEPEV2, fill = EDUCA)) +
geom_bar(position = "dodge") +
labs(title = 'Education Level of People Who Have Had Depression')
Q12b
Interpretation:
Among Respondents who reported they have NOT been diagnosed with depression, there is a higher proportion of people who have completed 4 or more years of college. The proportions of other education levels are similar.
# Graphing education levels and number of children
Q12c <- ggplot(brfss) +
geom_bar(aes(x = EDUCA, fill = EDUCA)) +
labs(title = "Education Level of Survey Participants", x = "Education Level") +
theme(axis.text.x = element_text(angle = 90))
Q12c
Q12d <- ggplot(brfss) +
geom_bar(aes(x = CHILDREN)) +
labs(title = "Number of Children in Households of Survey Participants") +
scale_x_continuous(breaks = seq(0, 6, 1))
Q12d
# Creating a table to show frequencies of education level per # of children
edu_kids <- table(brfss$EDUCA, brfss$CHILDREN)
edu_kids
##
## 0 1 2 3 4 5 6
## None 197 41 30 20 9 1 2
## Elementary 4369 601 473 317 149 46 21
## Some High School 9049 1482 1053 648 270 92 36
## High School Graduate 46587 6579 4647 2229 818 275 115
## Some College 45940 7347 5681 2489 1021 304 116
## College Graduate 53195 8527 8706 3212 1051 318 98
# Plotting a comparison of education level vs number of children
ggplot(brfss, aes(x = EDUCA, fill = EDUCA)) +
geom_bar() +
scale_fill_brewer("blues") +
facet_grid(~ CHILDREN) +
labs(title = "Education Level vs Number of Children", x = "Education Level") +
theme(axis.text.x = element_text(angle = 90))
Interpretation:
The majority of participants had 0 children in their households, and for all numbers of children (0-6), counts increased as education level increased, with “College Graduate” being the most common education level. The one exception was participants who had 6 children–“Some College” was the most frequent education level and “High School Graduate” the second most frequent.
Conducting descriptive statistical calculations to gain further insights and answer some questions.
- What is the average number of days poor health affected daily activities of people who completed only elementary school, vs those who graduated college?
Q13a <- brfss %>%
filter(EDUCA == "Elementary" | EDUCA == "College Graduate") %>%
group_by(EDUCA) %>%
summarize(mean_poorhlth = round(mean(POORHLTH), 2))
Q13a
## # A tibble: 2 × 2
## EDUCA mean_poorhlth
## <fct> <dbl>
## 1 Elementary 8.51
## 2 College Graduate 3.65
- What are the frequencies of each education level for participants who had depression?
Q13b <- brfss %>%
filter(ADDEPEV2 == "Yes") %>%
group_by(EDUCA) %>%
summarize(count = n()) %>%
arrange(desc(count))
Q13b
## # A tibble: 6 × 2
## EDUCA count
## <fct> <int>
## 1 Some College 20549
## 2 College Graduate 19877
## 3 High School Graduate 19125
## 4 Some High School 4721
## 5 Elementary 1971
## 6 None 95
- What is the most common education level for participants who reported poor health affected daily activities for more than 15 out of the past 30 days?
Q13c <- brfss %>%
filter(POORHLTH > 15) %>%
group_by(EDUCA) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
head(1)
Q13c
## # A tibble: 1 × 2
## EDUCA count
## <fct> <int>
## 1 High School Graduate 9596
- What is the mean and standard deviation of number of children for those who did vs did not report having depression?
Q13d <- brfss %>%
group_by(ADDEPEV2) %>%
summarize(mean_children = mean(CHILDREN), sd_children = sd(CHILDREN)) %>%
as.data.frame() %>%
mutate_if(is.numeric, round, 2)
Q13d
## ADDEPEV2 mean_children sd_children
## 1 Yes 0.50 0.98
## 2 No 0.53 1.02
- Creating a dataframe with just the variables I want to focus on
df_lr <- brfss %>%
select(POORHLTH, ADDEPEV2, EDUCA, CHILDREN)
- Looking at summary stats
summary(df_lr)
## POORHLTH ADDEPEV2 EDUCA CHILDREN
## Min. : 0.000 Yes: 66338 None : 300 Min. :0.0000
## 1st Qu.: 0.000 No :151823 Elementary : 5976 1st Qu.:0.0000
## Median : 0.000 Some High School :12630 Median :0.0000
## Mean : 5.262 High School Graduate:61250 Mean :0.5193
## 3rd Qu.: 5.000 Some College :62898 3rd Qu.:1.0000
## Max. :30.000 College Graduate :75107 Max. :6.0000
- Performing logistic regressions to predict the likelihood of a participant having depression:
# First, predicting depression from poor health days:
mod1 <- glm(ADDEPEV2 ~ POORHLTH, binomial(link = "logit"), df_lr)
summary(mod1)
##
## Call:
## glm(formula = ADDEPEV2 ~ POORHLTH, family = binomial(link = "logit"),
## data = df_lr)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6727 -1.2361 0.7530 0.7692 1.3301
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.1154794 0.0056256 198.3 <0.0000000000000002 ***
## POORHLTH -0.0489213 0.0004765 -102.7 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 268024 on 218160 degrees of freedom
## Residual deviance: 257348 on 218159 degrees of freedom
## AIC: 257352
##
## Number of Fisher Scoring iterations: 4
# Converting coefficients from log odds to odds ratio
exp(coef(mod1))
## (Intercept) POORHLTH
## 3.051030 0.952256
Interpretation:
POORHLTH odds ratio of .95 indicates that for each additional day of poor health, the odds of not having had depression decreases by about 5%. So there is a negative relationship between poor health days and not having had depression. The POORHLTH variable is statistically significant in this model, however the AIC appears to be quite high. Will test other models to see if we can get a lower AIC.
- Adding another predictor variable
mod2 <- glm(ADDEPEV2 ~ POORHLTH + EDUCA, binomial(), df_lr)
summary(mod2)
##
## Call:
## glm(formula = ADDEPEV2 ~ POORHLTH + EDUCA, family = binomial(),
## data = df_lr)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7206 -1.1986 0.7495 0.7884 1.3954
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.153513 0.128359 8.987 <0.0000000000000002 ***
## POORHLTH -0.047969 0.000482 -99.515 <0.0000000000000002 ***
## EDUCAElementary 0.004921 0.131439 0.037 0.9701
## EDUCASome High School -0.213689 0.129705 -1.648 0.0995 .
## EDUCAHigh School Graduate -0.038998 0.128609 -0.303 0.7617
## EDUCASome College -0.144379 0.128595 -1.123 0.2615
## EDUCACollege Graduate 0.068568 0.128584 0.533 0.5939
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 268024 on 218160 degrees of freedom
## Residual deviance: 256958 on 218154 degrees of freedom
## AIC: 256972
##
## Number of Fisher Scoring iterations: 4
The EDUCA variable does not appear to be a good fit for this model as none of the categories have statistically significant coefficients.
- Predicting depression from poor health days and number of children
mod3 <- glm(ADDEPEV2 ~ POORHLTH + CHILDREN, binomial(), df_lr)
summary(mod3)
##
## Call:
## glm(formula = ADDEPEV2 ~ POORHLTH + CHILDREN, family = binomial(),
## data = df_lr)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6761 -1.2212 0.7505 0.7667 1.3637
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.1230464 0.0062528 179.61 < 0.0000000000000002 ***
## POORHLTH -0.0490393 0.0004785 -102.49 < 0.0000000000000002 ***
## CHILDREN -0.0133104 0.0047708 -2.79 0.00527 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 268024 on 218160 degrees of freedom
## Residual deviance: 257340 on 218158 degrees of freedom
## AIC: 257346
##
## Number of Fisher Scoring iterations: 4
exp(coef(mod3))
## (Intercept) POORHLTH CHILDREN
## 3.0742051 0.9521437 0.9867778
The CHILDREN coefficient is statistically significant and has a negative relationship with the outcome variable.
- Choosing the model with the best predictor variables
# Calculating AIC
AIC(mod3)
## [1] 257345.8
AIC(mod1)
## [1] 257351.6
Result:
The AIC of our model with 2 predictors and the first model with just one predictor are very close, but the AIC of the model with both POORHLTH and CHILDREN is slightly lower, indicating it as the better model.