Skip to content

A basic search engine to index a corpus for searching and rank the document data set.

Notifications You must be signed in to change notification settings

Navnedia/Building-A-Search-Engine

Repository files navigation

Building a Search Engine

Warning work in progress!

A basic search engine that helps you index a corpus to search and rank the document data set. Built using Python and object-oriented programming principles to make the project extendable and maintainable.

Features:

  • Inverted Index - to improve search times.
  • Results Ranking - with term frequency–inverse document frequency (TF-IDF) to order results by relevance.
  • Query Expansion - to automatically add additional query terms (like synonyms) to improve results relevancy (see my testing analysis).
  • Result Evaluation - test and compare results with human-evaluated relevancy scores to gauge performance.

This started out as a course project, and I'm currently working on building this out further and adding more features to it. I'm planning to build out a front-end web interface so I can demo this project better. I will also be adding additional functionality to build on the project.

ToDo:

  • Spit up files and organize into packages.
  • Write Documentation!
  • Finish implementing stop words functionality.
  • Build a frontend web interface to the demo project.
  • Result snippet generation.
  • Implement advanced search operators (OR, NOT).
  • Improve query normalization.
  • Ranking improvements.
  • Add caching and on-demand loading to improve memory efficiency.

I hope to writing some more conprehensive documentation for this project in the near future.

Stay tuned :)