🗃️ System and Method for Big and Unstructured Data

This is the repository for the projects of the Systems and Methods for Big and Unstructured Data (SMBUD) course held at Polimi.

The purpose of the project is to design and implement different databases which store bibliographic information on major journals, papers and authors, inspired by already existing systems, such as DBLP. Different aspects of the type of data is highlighted with different data management technologies:

Relational Database
Graph Database- Neo4j
Documental Database - MongoDB
Computational Engine - Spark

For each technology a database is designed, data are extracted from sources (or generated randomly for didactic purposes) and different queries are written to study and learn the used language.

📚 A final report (and presentation) is available in which all the phases of the 3 projects are described in detail.

1) Graph Database

Neo4j is used to represent with a graph db the main relationship between publications, authors, institutions and so on. All the data is taken starting from DBLP XML (some GBs) and some records are generated randomly using some python scripts. In the linked folder is present:

XML parser
Dataset obtained (franction of the entire set)
Queries executed and Load Commands

DMLP database dump

2) Document Database

MongoDB is used to represent the main features and data of a collection of documents (citations, sections, figures). Records are uploaded as JSON using a Python script to convert data of the first project from csv to json (text of the documents is generated randomly). In the linked folder is present:

Python script to convert data
Dataset (queries present in the report)

3) Spark

Spark is used to represent the main features of the dataset used in the first project, exploiting the efficiency of the computational engine. In the linked folder is present a Jupyter Notebook that takes as input the dataset and transforms it in RDD and uses it in the queries.

✔️ Final Evaluation: 10.8/11

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
1-GraphDB		1-GraphDB
2-DocumentalDB		2-DocumentalDB
3-Spark		3-Spark
README.md		README.md
final_presentation.pdf		final_presentation.pdf
final_report.pdf		final_report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🗃️ System and Method for Big and Unstructured Data

1) Graph Database

2) Document Database

3) Spark

About

Releases

Packages

Languages

GppCalcagno/SMBUD-Project

Folders and files

Latest commit

History

Repository files navigation

🗃️ System and Method for Big and Unstructured Data

1) Graph Database

2) Document Database

3) Spark

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages