Skip to content

A comprehensive script for web scraping and NLP analysis, providing detailed insights from extracted articles.

License

Notifications You must be signed in to change notification settings

Onaga08/scrape-and-sense

Repository files navigation

Scrape & Sense

A comprehensive script for web scraping and NLP analysis, providing detailed insights from extracted articles.

Description

Scrape & Sense is a web scraping and NLP project that analyzes sentiment and readability metrics of articles extracted from websites. Using Natural Language Processing techniques, it computes Positive Score, Negative Score, Polarity Score, Subjectivity Score, and readability metrics like Average Sentence Length, Percentage of Complex Words, and Fog Index.

The workflow includes scripts for data extraction, stop word removal, and analysis, ensuring reproducibility with detailed setup instructions. Results are in CSV format for easy interpretation and further analysis.

Getting Started

Installation

Put the following command in your terminal/cmd after traversing to the designated folder

git clone https://github.com/Onaga08/scrape-and-sense.git

This repository uses several Python libraries and dependencies. Install all requirements through the command below.

pip install -r requirements.txt

Usage

This project has two broad functions:

  1. Web Scraping Using BeautifulSoup
  2. NLP Analysis

The runnables along with the required input and expected output of each python file is explained in instructions.md

Pre-Requisite Directories/Files

  1. Input.xlsx - Contains link for articles hosted on BlackCoffer website.
  2. Dict/ - Contains txt files for positive-words and negative-words analysis
  3. Stop Words/ - Contains txt files of three different types of Stop_Words

Project Workflow

graph TD
    A[Input.csv] -->|main.py| B[text_files directory]
    B -->|check.py| C{Articles with < 3 lines?}
    C -->|Yes| D[Error in data extraction]
    D -->A
    C -->|No| E[text_files directory]
    E -->|rmv_StopWords.py| F[updated_text_files directory]
    F -->|Analysis.py + output.py| G[output.csv]
Loading

Detailed Analysis

For detailed information on the formulas and logic used in the NLP analysis, please refer to the ANALYSIS_DETAILS.md file.

License is attached

About

A comprehensive script for web scraping and NLP analysis, providing detailed insights from extracted articles.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages