The purpose of this project is to clean and analyze raw Amazon Reviews data to determine whether reviews written as part of the Amazon Vine program are more likely to be positive.
Amazon Vine invites the most trusted reviewers on Amazon to post opinions about products to help their fellow customers make informed purchase decisions. Amazon invites customers to become Vine Voices based on the insightful reviews they published on their past purchases and helpfulness of their reviews. Amazon offers Vine members free products that have been submitted to the program by participating selling partners. Vine reviews are the independent opinions of the Vine Voices and the selling partners cannot influence, modify or edit the reviews.
pulled from https://www.amazon.com/vine/about
The raw data used in this project came from an S3 bucket hosted on AWS. The dataset I chose to use for this project was the US video game reviews dataset.
The data source
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_Games_v1_00.tsv.gz
- Created Amazon RDS database to store data
- Connected PgAdmin to Amazon RDS
- Created new database in PgAdmin on Amazon RDS server
- In PgAdmin, created tables according to
challenge_schema.sql
filecustomers_table
,products_table
,review_id_table
,vine_table
In Google Colab
- Imported and installed Spark and Java
- Download Postgres driver to connect Spark with Postgres
- Started Spark Session
- Loaded data into notebook as a dataframe
Once the data was loaded in as a dataframe, it was filtered into four additional dataframes to match the tables in PgAdmin.
Once all four dataframes were created, they were then written to their corresponding AWS RDS tables.
Set-up of connection variables
Writing dataframes to AWS RDS tables
Example Query in PgAdmin
Example Output
The vine_table
was used to analyze whether paid vine reviews were more likely to be positive versus unpaid reviews.
review_id
= ID of the reviewstar_rating
= what the reviewer rated the game (1-5)helpful_votes
= number of times the review was voted as being 'helpful'total_votes
= total number of times review was voted as 'helpful' or 'not helpful'vine
= whether the review was part of the Vine program (Y or N)verified_purchase
= whether the review was submitted by a verified purchaser of the game
Two additional dataframes, Vine (paid) and non-Vine (unpaid), were then created with the following criteria:
- Reviews must have >= 20 votes
- At least 50% of votes for each review were 'helpful'
- total number of reviews
- total number of five-star reviews
- percentage of five-star reviews
The analysis determined that 51.1% of paid Vine reviews were five-star reviews, versus only 38.7% of unpaid reviews. This indicates a potential bias toward positive reviews in the Vine program- which makes sense since paid reviewers may be more motivated to write positive reviews.
One major caveat to this analysis is there is a disproportionate number of Vine reviews (n=94) compared to non-Vine reviews (n=40,471), which may skew the data.