YaDFS: Yet Another Distributed File System

YaDFS (Yet another Distributed File System) is inspired by the Google File System (GFS), a technology developed by Google to handle vast amounts of data with efficiency and reliability. GFS was designed to support Google’s intensive data processing needs, such as web crawling and indexing the web, by distributing storage across many machines while ensuring fault tolerance and high availability.

GFS

Key Features of the GFS

GFS is a file system that is distributed in nature, and tailored for handling large files and batch-processing workloads.
Multiple machines store copies of every file and multiple machines try to read/write the same file.
Its architecture consists of a single master server and multiple chunkservers, where data is stored in 64MB chunks.
The master server manages metadata, including file and chunk namespaces, chunk locations, and replica information.
To maintain high availability, GFS replicates data across multiple chunkservers and server racks, ensuring that data remains accessible even if individual machines or disks fail.
The system is optimized for reading (specifically large streaming reads) or appending because web crawling and indexing heavily rely on these operations.

GFS Architecture

Challenges and Evolution

Despite its robust design, GFS faced scalability challenges as Google’s data needs grew. The single master server became a bottleneck, and the system struggled with memory limitations and increased latency for user-facing applications. These limitations led to the development of successor technologies like Colossus, which introduced distributed metadata management using BigTable, enhancing scalability and performance.

YaDFS: Building on GFS Principles

YaDFS leverages the foundational principles of GFS. Developed using Python, Flask, MongoDB, and Docker, YaDFS facilitates efficient file operations, health monitoring, and metadata persistence. Docker containers ensure portability and scalability, making it easier to manage distributed files across various environments.

YaDFS Features

Upload file (upload_file)
Download file (get_file)
Both upload_file and get_file are coordinated by the NameNode in YaDFS (in Hadoop/GFS, NameNode handles only metadata operations)
Multithreaded NameNode with the capability of monitoring the health status of the DataNodes/chunkservers with a heartbeat mechanism.
Metadata persistence ensured using MongoDB.
File system commands are supported: list_directories (ls), create_directory (mkdir), get_directory, delete_file, delete_folder, move_file, move_folder, copy_file.
A custom CLI is developed using Python. This is the client-side interface to send instructions like create_directory, upload_file, and get_file to the NameNode.
get_info: Gives info on the distributed chunk organization.
datanode_status: Gives the status of all the DataNodes present in the system.
Replication of chunks is made to ensure High Availability of chunks and faster, parallel chunk reads. During the FileWrite process: replication of each chunk and its distribution is done on a completely different thread. (Replication Factor (rf) = 3).
chunks and replication_chunks are collections: that hold the metadata related to chunk storage. it stores all of them in a linear fashion. one chunk after the other regardless of the file. -future improvement: tree like database storage sturcture for faster retrieval of chunk metaData.
re_replicate : is a manual way to re-replicate chunks; in case there is under-replication, especially when multiple DataNodes fail; and the get_file endpoint fails
delete_folder has a recursive deletion capability: deleting all files within it, the file metadata and file chunks and replicated chunks located in different DataNodes.
All this has been dockerized. A custom number of DataNodes can be churned up just by adding another service in docker-compose.
Variable chunk-size: determined based on the "Number of chunks" parameter requested by the user.
M chunks are mapped onto N DataNodes using a simple Round Robin Algorithm

YaDFS Architecture

ScreenShots

NameNode and DataNodes as Docker Containers	DataNode Health Check by NN Thread
mkdir and Other File System Ops	mongosh logs
Another dir created	DataNode status when all nodes are alive Vs when some are dead
upload_file	get_info after upload_file
get_file
New file uploaded	Recursive File and Chunk Deletion
2 DataNodes are down	Fault Tolerant Download despite 2 DNs being down
dummy2.txt uploaded to /Sowmesh_BigData	Recursive Folder Deletion: Including deleting all files within it and all their chunks from all DNs

https://medium.com/@dhammikasamankumara/what-is-hadoop-distributed-file-system-hdfs-36a3503f9c60

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
DataNode		DataNode
NameNode		NameNode
data		data
.gitignore		.gitignore
Instructions.md		Instructions.md
Split.py		Split.py
cli.py		cli.py
docker-compose.yaml		docker-compose.yaml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YaDFS: Yet Another Distributed File System

GFS

YaDFS: Building on GFS Principles

About

Releases

Packages

Contributors 5

Languages

SowmeshSharma0411/YADFS_BigDataProject

Folders and files

Latest commit

History

Repository files navigation

YaDFS: Yet Another Distributed File System

GFS

YaDFS: Building on GFS Principles

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages