This is a Machine Learning Project which recommends friends by using link prediction techniques on graph.
- Data was obtained from kaggle. Link for data https://www.kaggle.com/c/FacebookRecruiting.
- This is a graph data.
- The data only contains list of nodes which have edge between them or y=1.
- For the given dataset, there are approx 1.86M nodes and 9.43M edges.
- The data was highly imbalanced as only one class label was present.
- For another class label i.e for links which are not present in the graph(y=0) we randomly sampled it.
- The training and test data was exactly balanced.
It is the most important part of this project. Featurization was done based on following measures:
1. Similarity measures
- Jaccard distance
- Cosine distance
2. Ranking measures
-Page Ranking
3. Graph features
- Shortest Path
- Checking for same community
- Adamic/Adar Index
- Is following back
- Katz Centrality
- Hits Score
- num followers
- num followees
4. Weight features
- weight of incoming edges
- weight of outgoing edges
- weight of incoming edges + weight of outgoing edges
- weight of incoming edges * weight of outgoing edges
- 2*weight of incoming edges + weight of outgoing edges
- weight of incoming edges + 2*weight of outgoing edges
5. SVD features using adjacency matrix.(no. of components: 6)
- Random Forest
- XG Boost
- Confusion matrix
- F1 score
- Understanding of graph and feature engineering was the most important part of this project.
- For Random Forest, Follow_back was the most important feature found, followed by weight, inter_follower and shortest_path.
- For XGBOOST, follows_back was the most important feature. Followed by cosine_follower and weight_f1.
- XGBoost produced better results.