To help Reddit understand their topics, categorize comments attitude, and predict comments’ likes and dislike scores. We would like to conduct text analysis to better understand the current Reddit community through the subreddits.
- Target popular topics using word clouds.
- Categorize the arritude based on the comments.
- Predict scores for each comment accordingly.
- Google Cloud Platform (BigQuery, Dataproc, Cloud SQL)
- UChicago Research Computing Center
All analysis and data collection is found within jupyter_notebooks folder.
- Data Source:
- Understand people's opinions from a post
- Potentially help Reddit gain an overview of the wider public opinion behind certain topics
- Transformer pipeline:
- Regular Expression Tokenizer
- StopWords Tokenizer
- CountVectorizer
- StringIndexer
- HashingTF
- DF
- Based on the body of the post, predict a post’s success before it’s submitted
- Potentially help Redditors gain upvotes, and predict which posts will get popular enough to hit the front page