PDF Data Extractor

Automated batch processing tool built for a client to extract structured data from PDFs.

Description

  • > Batch processes multiple PDF files from a single folder
  • > Extracts text with layout preservation using pdfplumber
  • > Regex pattern matching to identify and parse dates
  • > Handles multi-line entries by combining split data
  • > Cleans and normalizes extracted data
  • > Outputs structured Excel spreadsheet with pandas
  • > 99% time reduction compared to manual data entry __

Technologies Used

PythonpdfplumberpandasRegular ExpressionsOpenPyXL