Announcements and Chatter
PDF to JSON Conversion Tools?
Gravatar is a globally recognized avatar based on your email address. PDF to JSON Conversion Tools?
  Harvey Mushman
  All
  Mar 13, 2024 @ 04:47am

Anyone here ever needed to integrate converting PDF files into JSON or some other data format? The PDF files are scanned images of type written originals, most of which are very clear and is a somewhat standard format. That format includes a list of items in a table format which can span several pages.

The tools I have looked at so far include FormX.ai and PDF.co. Both of these products offer template solutions and API interfaces. Looking for other possible products or any information about these two companies.

Gravatar is a globally recognized avatar based on your email address. re: PDF to JSON Conversion Tools?
  Rick Strahl
  Harvey Mushman
  Mar 13, 2024 @ 12:47pm

What exactly is PDF to JSON doing? Not clear on how that would be useful. Text extraction?

+++ Rick ---

Gravatar is a globally recognized avatar based on your email address. re: PDF to JSON Conversion Tools?
  Harvey Mushman
  Rick Strahl
  Mar 14, 2024 @ 04:41am

Hi Rick,

The State of California licenses all farmers who want to sell produce in any of the 800 farmers' markets throughout the state. The license lists all the content about the farmer including all the crops they grow, when they harvest and the production yeild expected. This licensing work is done by the County Commissioner within the county of production, of which there are 58 counties in California. The State collects copies of all the licenses (about 5000, and renewed yearly or when new crops are added) in PDF files but although there is a requirement for want information is included, there is no specified laws that define the format of the information. As a result there are many different layouts and all the documents are raster scans of what were originally plain-text documents. But I can't get the plain-text documents so the converstion needs to include an OCR process.

Several of the conversion solutions I have loked at have a template defined that allows the OCR label/data to get parsed into JSON. Where the label becomes the JSON tag for hte data that is extracted. But there are many such tools/services on the market many of which are Chineese companies... would prefer to shop locally if possible.

The end result of this project will be to build a database of all farmers and the products they grow. If I can get JSON from the conversion software, I can use WC to import the data and then build a searchable tool from there... Sound like fun?

Gravatar is a globally recognized avatar based on your email address. re: PDF to JSON Conversion Tools?
  Rick Strahl
  Harvey Mushman
  Mar 14, 2024 @ 09:35am

I don't think you need the JSON conversion - you can do that yourself. You basically need something that scan a PDF into text I imagine.

Depending on how the PDF was created a PDF is already text (Postscript) that can be read and retrieved, but unfortunately some other formats are images and that would have to be scanned and OCR converted. There are dedicated libraries that do this though but most are expensive or use Python (which isn't useful for applications).

+++ Rick ---

© 1996-2024