Given a set of PDFs and the query, the most relevant pdf can be found with the help of TF-IDF. The code has not used any library to implement TF-IDF
The code only uses pdfminer and glob libraries to read pdf and traverse a directory for pdf. The Tf-idf is done manually without using any library. To understand the code, please read the comments in the code.
A sample folder is uploaded with few pdfs to tryout the code.
- Includes the reading of pdf files using pdfminer library
- Extracting words from each pdf
- Take query input from the user
- tf-idf for the pdf and query
- Ranking the pdfs that have same words from the query
- The text from the documents are taken as string initially
- Rest process is same as the other code.