Skip to content

rezansrv/Natural-language-processing-NLP-

Repository files navigation

Multilingual Text Processing in Python

Overview

This Python project focuses on efficient text processing for both English and Persian documents. The primary operations include tokenization, stop word removal, parsing, and text cleaning. The goal is to enhance the readability, analysis, and information extraction from diverse textual content.

Operations

Tokenization

The project employs robust tokenization techniques to break down the input text into meaningful units, such as words or phrases. Tokenization facilitates subsequent analysis and processing steps.

Delete Stop Words

Stop words, common words that contribute little to the overall meaning, are identified and removed from the text. This step enhances the relevance of the remaining words in the document.

Parse

The parsing phase involves analyzing the grammatical structure, sentence composition, and word dependencies within the text. This operation provides valuable insights into the linguistic aspects of the document.

Clean

The text cleaning process ensures the removal of unnecessary characters, punctuation, and artifacts introduced during previous operations. This final step aims to produce a refined and standardized version of the document.

Language Support

The project seamlessly handles both English and Persian documents, making it versatile for multilingual text processing tasks.

Applications

  • Information retrieval
  • Document summarization
  • Sentiment analysis
  • Language-agnostic text analysis

Note

This project is designed to be adaptable, allowing users to configure and customize parameters based on specific requirements. The modular structure facilitates easy integration into various natural language processing pipelines.

Setting Up the Project

To clone and run this project, follow these steps:

1. Clone the Repository:

git clone https://github.com/rezansrv/Natural-language-processing-NLP-.git
cd Natural-language-processing-NLP-

2. Install Dependencies:

Ensure you have Python installed. Then, install the required packages using:

pip install nltk beautifulsoup4

3. Download NLTK Resources:

Run the following Python script to download the necessary NLTK resources:

python -c "import nltk; nltk.download('punkt')"

4. Organize Input Files:

Place your input documents (e.g., Eng.txt, Per.docx) in the 'inputs' directory.

5. Run the Project:

Execute the provided Python script to clean, parse, and tokenize the documents:

python Clean.py

Follow the on-screen instructions to enter the input file name.

6. Retrieve Output:

The cleaned text will be saved in the 'outputs' directory. Check the 'cleaned_text.txt' file for the processed result.

This project follows a modular approach, with each script (Clean.py, Parse.py, etc.) working sequentially. Users can customize and configure parameters based on specific requirements.

Releases

No releases published

Packages

No packages published

Languages