GitHub - gkontogiannhs/Cassandra-Python

Project Aim

This project aims to introduce the reader to technologies used for storing, retrieving and analyzing Big Data, which can come in the form of various data sources. The focus is on practical data management using open source tools, specifically Apache Cassandra. The retrieval process is done through Python Cassandra API, which allows for easy interaction with the Cassandra database.

Why Bother

Conventional solutions are often unable to effectively manage such data, but Big Data analysis systems can handle large, heterogeneous and non-uniform data from relational databases, retrieve large data sets from various systems, perform natural language processing, machine learning, artificial intelligence, and prediction based on unstructured data.

Getting Started

Data URL: https://www.kaggle.com/grouplens/movielens-20m-dataset
Install and Cassandra Documentation: https://cassandra.apache.org/doc/latest/cassandra/getting_started/installing.html
OR
Create your Database in a real cloud-based database service: https://www.datastax.com/products/datastax-astra
AstraDB & DataStax driver documentation: https://docs.astra.datastax.com/en/landing_page/doc/landing_page/apiDocs.html
CQL reference: https://docs.datastax.com/en/cql-oss/3.3/cql/cqlIntro.html

Cassandra Data Modelling

Introduction to Cassandra Data Modelling: https://cassandra.apache.org/doc/latest/cassandra/data_modeling/intro.html
A. Chebotko, A. Kashlev and S. Lu, "A Big Data Modeling Methodology for Apache Cassandra," 2015 IEEE International Congress on Big Data, 2015, pp. 238-245, doi: 10.1109/BigDataCongress.2015.41.

When building a NoSQL database, it is important to start by considering the potential queries that a user might make. From there, the database tables can be constructed in a way that allows for efficient retrieval of relevant data. This approach differs from traditional relational databases, which require tables to be designed based on the schema and relationships between entities. In NoSQL databases, the focus is on optimizing the data for specific use cases and query patterns, rather than adhering to a pre-defined schema.

Let's now pretend that that the database we want to build supports a user-interface search and selection of movies to watch (e.g. like Netflix).

Possible Queries

Show the movies that are popular (have a good rating) within a period of time (e.g. the last 2 months) – this may constitute well the initial "suggestions" screen to the user
Show the he movie(s) that contain some keywords in the title
Show the movies based on category (genre) and receive them based on some classification (year or average rating)
Show the detailed information about a movie (category, average rating, top-k tags)
Show the top-n movies related to some tag

Type	Diagram
Entity Relationship
Application Workflow
Chebotko

DataBase Implementation

The structure of the database will be created by shell statements. The data insertion is done via python scripts cassandra_insert.ipynb using dsbulk, after data are modelled (denormalization etc.) based on the queries requirements.

DDL Statements

# Create an empty Cassandra keyspace
CREATE KEYSPACE <big_data>

# 1st Query DDL
CREATE TABLE big_data.popular_movies_by_date (
    year int,
    month int,
    day int,
    rating float,
    movieid int,
    genres text,
    title text,
    PRIMARY KEY (year, month, day, rating, movieid)
) WITH CLUSTERING ORDER BY (month ASC, day ASC, rating DESC, movieid ASC)

# 2nd Query DDL
CREATE TABLE big_data.movies_by_keyword (
    keyword text,
    rating float,
    movieid int,
    genres text,
    title text,
    PRIMARY KEY (keyword, rating, movieid)
) WITH CLUSTERING ORDER BY (rating DESC, movieid ASC)

# 3rd Query DDL
CREATE TABLE big_data.movies_by_genre (
    genre text,
    release_date text,
    avg_rating float,
    movieid int,
    title text,
    PRIMARY KEY (genre, release_date, avg_rating, movieid)
) WITH CLUSTERING ORDER BY (release_date ASC, avg_rating DESC, movieid ASC)

# 4th Query DDL
CREATE TABLE big_data.movie_info_by_title (
    title text,
    tag_count int,
    avg_rating float,
    movieid int,
    tag text,
    genres text static,
    PRIMARY KEY (title, tag_count, avg_rating, movieid, tag)
) WITH CLUSTERING ORDER BY (tag_count DESC, avg_rating DESC, movieid ASC, tag ASC

# 5th Query DDL
CREATE TABLE big_data.movies_by_tag (
    tag text,
    avg_rating float,
    movieid int,
    genres text,
    title text,
    PRIMARY KEY (tag, avg_rating, movieid)
) WITH CLUSTERING ORDER BY (avg_rating DESC, movieid ASC)

Queries to implement:

Show the first 30 movies with the highest average rating between 01/01/2015 and 15/01/2015
Show movies containing the word “star”
Show the movies from "adventure" category sorted by year production
Show all the details about Jumanji movie (category, average rating, top-5 tags)
Show the first 20 movies with the highest average rating for the tag "comedy".

In cassandra_retrieve.ipynb jupyter file, the client side is simulated by retrieving the results for the above queries via the Python - Cassandra API. Execution time per consistency level (ALL, QUORUM, ONE) was also recorded as an experiment, for both READ and WRITE.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
snaps		snaps
LICENSE		LICENSE
README.md		README.md
cassandra_insert.ipynb		cassandra_insert.ipynb
cassandra_retrieve.ipynb		cassandra_retrieve.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Aim

Why Bother

Getting Started

OR

Cassandra Data Modelling

Possible Queries

DataBase Implementation

DDL Statements

Queries to implement:

About

Releases

Packages

Languages

License

gkontogiannhs/Cassandra-Python

Folders and files

Latest commit

History

Repository files navigation

Project Aim

Why Bother

Getting Started

OR

Cassandra Data Modelling

Possible Queries

DataBase Implementation

DDL Statements

Queries to implement:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages