Skip to content

A Node.js based REST PDF Text Extraction API using pdf-parse.

License

Notifications You must be signed in to change notification settings

samestrin/pdf-extract-api-digitalocean

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdf-extract-api-digitalocean

Star on GitHubFork on GitHub Watch on GitHub

Version 0.0.1 License: MIT Built with Python

This project implements a simulated Optical Character Recognition (OCR) service that extracts text from PDF files uploaded by users. Built with Node.js and utilizing several libraries such as Express, Multer, and pdf-parse, this application is designed to be easy to set up and integrate into other systems needing PDF text extraction capabilities.

Features

  • PDF Text Extraction: Allows users to upload PDF files and extracts readable text from them.
  • File Upload Management: Utilizes Multer for efficient handling of file uploads with customizable storage options.
  • Error Handling: Robust error management to ensure stability and provide meaningful error messages to the client.

Dependencies

  • Node.js: The script runs in a Node.js environment.
  • express: Web framework for Node.js.
  • multer: Middleware for handling multipart/form-data, used for uploading files.
  • pdf-parse: Library to parse and extract text from PDF files.
  • fs.promises: Part of Node.js File System module to handle file operations using promises.
  • path: Utilities for handling and transforming file paths.

Installing Node.js

Before installing, ensure you have Node.js and npm (Node Package Manager) installed on your system. You can download and install Node.js from Node.js official website.

Installing pdf-extract-api-digitalocean

To install and use pdf-extract-api-digitalocean, follow these steps:

Clone the Repository: Begin by cloning the repository containing the pdf-extract-api-digitalocean to your local machine.

git clone https://github.com/samestrin/pdf-extract-api-digitalocean/

Set PORT environment variable to define the port on which the server will listen. Default is 3000

Navigate to your project's root directory and run:

npm start

Endpoints

Extract

Endpoint: /extract Method: POST

Extract text from a PDF file.

Parameters

  • file: PDF file

Example Usage

Use a tool like Postman or curl to make a request:

curl -F "file=@path_to_pdf_file.pdf" http://localhost:[PORT]/extract

The server will process the uploaded file and return the extracted text in JSON format.

Error Handling

The API handles errors gracefully and returns appropriate error responses.

  • 400 Bad Request: Invalid request parameters.
  • 500 Internal Server Error: Unexpected server error.

Contribute

Contributions to this project are welcome. Please fork the repository and submit a pull request with your changes or improvements.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Share

Twitter Facebook LinkedIn

About

A Node.js based REST PDF Text Extraction API using pdf-parse.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published