Skip to content

Amazon Dataset

潜心 edited this page Sep 1, 2020 · 2 revisions

Amazon

Amazon提供了商品数据集,该数据集包含亚马逊的产品评论和元数据,包括1996年5月至2014年7月期间的1.428亿评论。它包括很多子数据集,如:Book、Electronics、Movies and TV等,实验中我们主要使用Electronics子数据集

Electronics

Amazon-Electronics数据集分为两部分:reviews_Electronics_5.json为用户的行为数据,meta_Electronics为广告的元数据。 reviews某单个样本如下:

{
  "reviewerID": "A2SUAM1J3GNN3B",
  "asin": "0000013714",
  "reviewerName": "J. McDonald",
  "helpful": [2, 3],
  "reviewText": "I bought this for my husband who plays the piano.  He is having a wonderful time playing these old hymns.  The music  is at times hard to read because we think the book was published for singing from more than playing from.  Great purchase though!",
  "overall": 5.0,
  "summary": "Heavenly Highway Hymns",
  "unixReviewTime": 1252800000,
  "reviewTime": "09 13, 2009"
}

各字段分别为:

  • reviewerID:用户ID;
  • asin: 物品ID;
  • reviewerName:用户姓名;
  • helpful :评论帮助程度,例如上述[2, 3]表示为为2/3
  • reviewText :文本信息;
  • overall :物品评分;
  • summary:评论总结
  • unixReviewTime :时间戳
  • reviewTime :时间

meta某样本如下:

{
  "asin": "0000031852",
  "title": "Girls Ballet Tutu Zebra Hot Pink",
  "price": 3.17,
  "imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg",
  "related":
  {
    "also_bought": ["B00JHONN1S", "B002BZX8Z6", "B00D2K1M3O", "0000031909", ..., "B00E1YRI4C", "B008UBQZKU", "B00D103F8U", "B007R2RM8W"],
    "also_viewed": ["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W", ..., "B00BFXLZ8M"],
    "bought_together": ["B002BZX8Z6"]
  },
  "salesRank": {"Toys & Games": 211836},
  "brand": "Coxlures",
  "categories": [["Sports & Outdoors", "Other Sports", "Dance"]]
}

各字段分别为:

  • asin :物品ID;
  • title :物品名称;
  • price :物品价格;
  • imUrl :物品图片的URL;
  • related :相关产品(也买,也看,一起买,看后再买);
  • salesRank: 销售排名信息;
  • brand :品牌名称;
  • categories :该物品属于的种类列表;

数据集的具体处理方法见知乎文章:2018阿里CTR预估模型---DIN(深度兴趣网络),后附TF2.0复现代码

Experimental dataset

Model

Clone this wiki locally