Skip to content

Latest commit

 

History

History
38 lines (31 loc) · 2.77 KB

README.md

File metadata and controls

38 lines (31 loc) · 2.77 KB

The Facebook scandal

Unipi Python Version Mongodb Version

This repository contains the code base for the Social Network Analysis course hosted by the Master's degree in Computer Science of the University of Pisa.

The case story

On Saturday 17th of March 2018, The New York Times and The Guardian / The Observer broke reports on how the consulting firm Cambridge Analytica harvested private information from the Facebook profiles of more than 50 million users without their permission, making it one of the largest data leaks in the social network’s history.

Cambridge Analytica described itself as a company providing consumer research, targeted advertising and other data-related services to both political and corporate clients. The whistleblower Christopher Wylie, datascientist and former director of research at Cambridge Analytica revealed to the Observer how Cambridge Analytica used personal information taken without authorisation in early 2014 to build a system that could profile individual US voters, in order to target them with personalised political advertisements.

Christopher Wylie, who worked with a Cambridge University academic to obtain the data, told the Observer:

We exploited Facebook to harvest millions of people’s profiles. And built models to exploit what we knew about them and target their inner demons. That was the basis the entire company was built on.

The network

We have considered a network composed by the authors of tweets about the case, during the first period of the scandal outbreak. The data have been collected via the Twitter API and we built the network using the following consecutive steps:

  1. Crawling of all the available tweets over a period of more than 15 days, since the 17th of March, containing at least one of the most popular hashtags regarding the case:
    • #cambridgeanalytica
    • #facebookgate
    • #deletefacebook
    • #zuckerberg
  2. Cleaning of the crawled tweets, by selecting and storing in a MongoDB database only the users informations about the authors of tweets, excluding retweets and mentions.
  3. Selection of the case outbreak time period by observing the time history. The selected time period consists of 8 days, from the 17th to the 24th of March included (considering the Italian timezone).
  4. Crawling of the following list for each of the selected authors, extracting the following/follower relationships.