Skip to content

Search engine that collects and indexes documents from a list of URLs so that they can be searched. Machine learning portion predicts topic of given URLs.

License

Notifications You must be signed in to change notification settings

mr-rjh3/python-search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Search Engine

Purpose

Collects documents from a curated list of URLs to index and search querys on. Each URL pertains to a specific topic and by training a machine learning algorithm on the collected documents the program can predict the topic of a given URL.

Starting

Before starting, you need to have Git and Python installed.

# Clone the Repository

$ git clone https://github.com/mr-rjh3/python-search-engine

# Access the project directory

$ cd python-search-engine

# Install the requirements

$ pip install -r requirements.txt

# Run searchengine.py

$ python ./searchengine.py

Usage

Once started the program will offer 6 options:

Select an option: 

        1 - Collect new documents.
        2 - Index documents.      
        3 - Search for a query.   
        4 - Train ML classifier.  
        5 - Predict a link.       
        6 - Quit.

Enter your choice: 

Collect new documents

  • Collect new documents will scrape the URLs listed in 'sources.txt' and store the text content of that page with stopwords removed in it's given topic directory. It will continue this for all outgoing links on the page that stay in it's root domain to a maximum depth of 2.

Index documents

  • Index documents will create an inverted index of the collected documents from the previous option.

Search for a query

  • Search for a query will take in a query as input and search for the documents with the greatest cosine similarity using the inverted index.

Train ML classifier

  • Train ML classifier will train an SVM machine learning classifier on the gathered documents using their topics as the label. The scores and confusion matrix will be shown to the user once training is complete.

Predict a link

  • Predict a link will take in a URL as user input and using the classifier trained in the previous option, predict the topic of the given link along with the confidence of the prediction.

Quit

  • Quit will simply exit the program.

License

This project is under license from MIT. For more details, see the LICENSE file.

About

Search engine that collects and indexes documents from a list of URLs so that they can be searched. Machine learning portion predicts topic of given URLs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages