This code was developed as a study tool for the Predictive Modeling, Model Fitting, and Regression Analysis course provided by the University of California Irvine on Coursera. It utilizes the Buzz in Social Media data set, available at the UCI Machine Learning Repository, for identifying the attributes in social media content that have the highest correlation to the amount of repercussion it gained. To achieve such result, several linear regression models are constructed, then ranked based on their respective model fit measure (R-square).
- Clone repository
- Fetch dataset (regression.tar.gz)
- Extract inside
{PROJECT_ROOT}/assets/dataset
so you have the following directories:{PROJECT_ROOT}/assets/dataset/regression/Twitter
{PROJECT_ROOT}/assets/dataset/regression/TomsHardware
(won't be used)
- Install requirements:
pip install -r requirements.txt
- Run
social_media_buzz
module:python -m social_media_buzz
- Check results under
/assets/results/
Special thanks to François Kawala, Ahlame Douzal, Eric Gaussier, and Eustache Diemert (from Université Joseph Fourier and BestofMedia Group) for providing the data set used here.
I'd also like to thank University of California Irvine for hosting the UCI Machine Learning Repository, where the data set can be downloaded.
- Load data from file
- Divide data in Training (80%) vs Testing (20%)
- Create linear regression model for a pair of variables (1 predictor)
- Cycle through features
- Get R-squared for each attribute
- Rank attribute based on R-squared value.
- Write short report
- Create several folds for Training/Testing data (Cross-validation)
- Cycle through folds
- Rank attribute based on testing data accuracy.
- Generate charts
- Fetch data set automatically
- Compare both rankings automatically
- Optimize with threads
- Optimize with Cython?