Skip to content

This project aims to classify Netflix shows into distinct clusters using machine learning algorithms. The objective is to enhance user experience by providing personalized show recommendations based on users' preferences.

Notifications You must be signed in to change notification settings

Datawithabhishek/Netflix-Movies-and-TV-Shows-Clustering

Repository files navigation

🔬 Unsupervised-Clustering-Analysis-on-Netflix-Movies-And-TV Shows.

--------------------------------------------------------------------------------------------

This repository contains the and dataset and the research notbook consisting various aproches for clustering movies and tv shows for netflix dataset. The movies and tv shows are present in the dataset which where released during the period of 1925 to 2021.

--------------------------------------------------------------------------------------------

Animated gif netflix Image

Problem Statement

The problem at hand involves exploring the Netflix dataset to gain insights into the content available on the platform. The dataset provides information about movies and TV shows, their attributes, and their availability in different countries. By integrating this dataset with external sources such as IMDB ratings and Rotten Tomatoes, we can extract further valuable information.

The specific tasks to be performed in this project include:

  • Exploratory Data Analysis (EDA): Cleaned the data, unnested the Netflix content and tackled the null/missing values and conduct a thorough analysis of the dataset to uncover trends, patterns, and correlations among different attributes.
  • Understanding Content Availability: Determine the types of content available in different countries and identify any variations or preferences.
  • Analyzing Netflix's Focus: Investigate whether Netflix has been increasingly focusing on TV shows rather than movies in recent years.
  • Clustering Similar Content: Utilize text-based features to cluster similar content, enabling the development of a content-based recommender system.

💾 Project Files Description

This Project includes :-

  • Google Colab NoteBook
  • Project Summary (with the GitHub Repo link inside)
  • Presentation Video
  • Input Files:

  • NETFLIX MOVIES AND TV SHOW CLUSTERING.csv - Contains the records of movies and tv show released during 1925 to 2021.
  • Data Source:

    • Dataset - Dataset taken from AlmaBetter

    🗺️ Dataset Description

    The dataset contains 7,787 records and 11 features, including information about movies and TV shows from 1925 to 2021. The dataset includes two types of content: movies (71.1%) and TV shows (28.3%). The most frequent rating is TV-MA (Mature Audience only) followed by Teen and older. The highest number of content was added in 2019. The highest number of content is available in the months of October, December, and January.

    🔎 Key Findings

    • Most of the content is of the type of movie (71.1%).
    • The highest number of aquitted/created content was added in 2019.
    • The highest number of content is available in the months of October, December, and January.
    • The top genres where most of the movies and TV shows are listed are Dramas, International Movies, and Comedies.
    • The optimal number of clusters for the dataset is four.
    • The developed recommendation system is expected to improve the user experience and reduce subscriber churn for Netflix, which currently has over 220 million subscribers.

    🛠️ Builds with

    Python

    NumPy

    Pandas

    Matplotlib

    Seaborn

    scikit-learn

    SciPy

    Plotly

    GoogleColab

    Project Summary

    The aim of this project is to analyze the Netflix dataset, which includes information on movies and TV shows available on the platform until 2019. With over 220 million subscribers, Netflix is the world's largest online streaming service provider. By analyzing and clustering the content, we can enhance the user experience through a personalized recommendation system, potentially reducing subscriber churn.

    The project follows a step-by-step process, as outlined below:

    • Handling Missing Values: Address any null or missing values present in the dataset.
    • Dealing with Nested Columns: Process nested columns such as director, cast, listed_in, and country to facilitate clear visualization and analysis.
    • Rating Binning: Categorize ratings into appropriate categories, including adult, children's, family-friendly, and not rated content.
    • Exploratory Data Analysis (EDA): Perform in-depth EDA on various attributes, uncovering valuable findings to aid in churn prevention.
    • Cluster Creation: Create clusters using attributes such as director, cast, country, genre, rating, and description. Tokenize, preprocess, and vectorize the attribute values using TF-IDF vectorizer.
    • Clustering Algorithms: Employ K-Means Clustering and Agglomerative Hierarchical Clustering algorithms to construct two distinct types of clusters. Determine the optimal number of clusters using methods like the Elbow method and Dendrogram.

    About

    This project aims to classify Netflix shows into distinct clusters using machine learning algorithms. The objective is to enhance user experience by providing personalized show recommendations based on users' preferences.

    Topics

    Resources

    Stars

    Watchers

    Forks

    Releases

    No releases published

    Packages

    No packages published