This project is a real-time news aggregator for Bulgarian news sources, with a focus on implementing various data engineering skills.
Here's where it gets interesting from a data engineering perspective:
-
ETL Pipeline: Set up a full Extract, Transform, Load pipeline utalizing Python, SQL and Pyspark. Data is extracted from web sources, transformed, and loaded into the database.
-
Big Data Processing: Utalized PySpark for data processing. Although this might be considered an overkill for our current data volume, it makes so the project can easily scale in the future and handle larger batches of data.
-
Data Quality: Data quality is ensured by cleaning it up with Pyspark through removing special characters, normalizing source names, standardizing timestamps.
-
Scalability: The architecture is set up so it could easily scale. We can add considerably more news sources or increase scraping frequency, our PySpark processing can handle the load.
-
Automation: The app automatically fetches new news sources once every X minutes throught the scheduling python library (X is currently 10, but can be lowered so we can get even closer to real-time)
- Aggregates news from multiple Bulgarian news sources including 24chasa, Dnevnik, and Fakti
- Provides direct links to original news articles for further reading
- Automatically updates the news database at regular intervals to ensure fresh content
- Offers a simple and intuitive user interface for browsing the latest news
-
Clone the repository:
git clone https://github.com/dvelkow/real_time_bulgarian_news_aggregator.git cd real_time_bulgarian_news_aggregator/backend
-
install dependencies:
pip install -r requirements.txt
-
Create a
.env
file in the main directory and add your MySQL credentials, u need to have a:DB_HOST=localhost DB_NAME= db_news /needs to be created first in MYSQL DB_USER= root DB_PASSWORD=your_mysql_password
-
Run the Flask application:
python app.py
Some of the hurdles I stumbled upon while building it:
- Web scraping is rather needy. All news sites have different HTML structures, meaning the fetch function needed to be rewritten for every different site, some were trickier then others.
- Balancing between real-time news aggregation and not filling all the slots of the site with spam news, as the sites sometimes tend to post multiple small news one after another.
- Setting up PySpark locally proved to be... painful.
There's always room for more:
- Adding different tags based on news headlines and making topics on the site
- Having an option for seeing trending news, could be done by taking the averages of the sites for views/hour and if a news article is beating it substantially we could determine it's trending
I might work on these changes in the future but seeing as they are not closely related to data engineering I would no be implementing them now, as my focus is elsewhere.
This project has been a great way to get hands-on experience with various data engineering tools and techniques I have only read/watched about. It's one thing to read about these concepts, but actually implementing them and during the process it felt that's where the real learning happend.