PDF-Querying-using-TF-IDF-from-Scratch

Given a set of PDFs and the query, the most relevant pdf can be found with the help of TF-IDF. The code has not used any library to implement TF-IDF

Explanation

The code only uses pdfminer and glob libraries to read pdf and traverse a directory for pdf. The Tf-idf is done manually without using any library. To understand the code, please read the comments in the code.

PDF Files

A sample folder is uploaded with few pdfs to tryout the code.

PDF_querying.py

Includes the reading of pdf files using pdfminer library
Extracting words from each pdf
Take query input from the user
tf-idf for the pdf and query
Ranking the pdfs that have same words from the query

text querying.py

The text from the documents are taken as string initially
Rest process is same as the other code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PDF-Querying-using-TF-IDF-from-Scratch

Explanation

PDF Files

PDF_querying.py

text querying.py

Files

README.md

Latest commit

History

README.md

File metadata and controls

PDF-Querying-using-TF-IDF-from-Scratch

Explanation

PDF Files

PDF_querying.py

text querying.py