The purpose of this repository is to create predictive models and automating R Markdown reports. Analysis are completed on the Online News Popularity Data Set from UCI. Additional information about this data can be accessed here.
- url: URL of the article (non-predictive)
- timedelta: Days between the article publication and the dataset acquisition (non-predictive)
- n_tokens_title: Number of words in the title
- n_tokens_content: Number of words in the content
- n_unique_tokens: Rate of unique words in the content
- n_non_stop_words: Rate of non-stop words in the content
- n_non_stop_unique_tokens: Rate of unique non-stop words in the content
- num_hrefs: Number of links
- num_self_hrefs: Number of links to other articles published by Mashable
- num_imgs: Number of images
- num_videos: Number of videos
- average_token_length: Average length of the words in the content
- num_keywords: Number of keywords in the metadata
- data_channel_is_lifestyle: Is data channel 'Lifestyle'?
- data_channel_is_entertainment: Is data channel 'Entertainment'?
- data_channel_is_bus: Is data channel 'Business'?
- data_channel_is_socmed: Is data channel 'Social Media'?
- data_channel_is_tech: Is data channel 'Tech'?
- data_channel_is_world: Is data channel 'World'?
- kw_min_min: Worst keyword (min. shares)
- kw_max_min: Worst keyword (max. shares)
- kw_avg_min: Worst keyword (avg. shares)
- kw_min_max: Best keyword (min. shares)
- kw_max_max: Best keyword (max. shares)
- kw_avg_max: Best keyword (avg. shares)
- kw_min_avg: Avg. keyword (min. shares)
- kw_max_avg: Avg. keyword (max. shares)
- kw_avg_avg: Avg. keyword (avg. shares)
- self_reference_min_shares: Min. shares of referenced articles in Mashable
- self_reference_max_shares: Max. shares of referenced articles in Mashable
- self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
- weekday_is_monday: Was the article published on a Monday?
- weekday_is_tuesday: Was the article published on a Tuesday?
- weekday_is_wednesday: Was the article published on a Wednesday?
- weekday_is_thursday: Was the article published on a Thursday?
- weekday_is_friday: Was the article published on a Friday?
- weekday_is_saturday: Was the article published on a Saturday?
- weekday_is_sunday: Was the article published on a Sunday?
- is_weekend: Was the article published on the weekend?
- LDA_00: Closeness to LDA topic 0
- LDA_01: Closeness to LDA topic 1
- LDA_02: Closeness to LDA topic 2
- LDA_03: Closeness to LDA topic 3
- LDA_04: Closeness to LDA topic 4
- global_subjectivity: Text subjectivity
- global_sentiment_polarity: Text sentiment polarity
- global_rate_positive_words: Rate of positive words in the content
- global_rate_negative_words: Rate of negative words in the content
- rate_positive_words: Rate of positive words among non-neutral tokens
- rate_negative_words: Rate of negative words among non-neutral tokens
- avg_positive_polarity: Avg. polarity of positive words
- min_positive_polarity: Min. polarity of positive words
- max_positive_polarity: Max. polarity of positive words
- avg_negative_polarity: Avg. polarity of negative words
- min_negative_polarity: Min. polarity of negative words
- max_negative_polarity: Max. polarity of negative words
- title_subjectivity: Title subjectivity
- title_sentiment_polarity: Title polarity
- abs_title_subjectivity: Absolute subjectivity level
- abs_title_sentiment_polarity: Absolute polarity level
- shares: Number of shares (target)
In this project, subsets by data_channel_is_* were produced for automating R Markdown reports. Predictive models used include linear regression models, random forest model, and boosted tree. These models were constructed on training data set and than tested on testing data set. The best model was selected based on lowest RMSE.
caret
To run the Regression and ensemble methods with Train/Split and cross validation.dplyr
A part of thetidyverse
used for manipulating data.GGally
To create ggcorr() and ggpairs() correlation plots .glmnet
To access best subset selection.ggplot2
A part of thetidyverse
used for creating graphics.gridextra
To plot with multiple grid objects.gt
To test a low-dimensional null hypothesis against high-dimensional alternative models.knitr
To get nice table printing formats, mainly for the contingency tables.leaps
To identify different best models of different sizes.markdown
To render several output formats.MASS
To access forward and backward selection algorithmsrandomforest
To access random forest algorithmstidyr
A part of thetidyverse
used for data cleaning
The analysis for Lifestyle articles is available here.
The analysis for Entertainment articles is available here.
The analysis for Business articles is available here.
The analysis for Social media articles is available here.
The analysis for Tech articles is available here.
The analysis for World articles is available here.
selectID <- unique(newData$channel)
output_file <- paste0(selectID, "Analysis.md")
params = lapply(selectID, FUN = function(x){list(channel = x)})
reports <- tibble(output_file, params)
library(rmarkdown)
apply(reports, MARGIN = 1,
FUN = function(x){
render(input = "./Project_3.Rmd",
output_format = "github_document",
output_file = x[[1]],
params = x[[2]])
})