Searching PDFs and Text files for keywords - West Wind Technologies Support

Web Connection

Searching PDFs and Text files for keywords

Searching PDFs and Text files for keywords
Warren
All

Jun 17, 2019 @ 06:44am

Hello Everyone,

What's the best way to setup a employee knowledge base portal? I know in the past there was a DBSearch(?) library that would index all the PDF(s)/TXT files and allow you to search them without much fuss. I see goFish, is that what is typically used today to index files and allow for quick search results from within Fox Pro?

Warren

re: Searching PDFs and Text files for keywords
Rick Strahl
Warren

5ID0VKYKY

Jun 17, 2019 @ 02:44pm

I don't have any good answers for you other than you'll need some sort of third party indexing solution to get decent search performance across text and binary content. There are a number of options out there - .NET has Lucene which can be automated from FoxPro potentially via wwDotnetBridge. Talked to somebody who'd done some work with Lucene for a project with good success. All these solutions require a bit of up front work and active updates to keep the indexes up to date so it's not just a 'drop it in' solution usually.

GoFish() is a code searching tool so that doesn't apply.

Let us know what you find...

+++ Rick ---

re: Searching PDFs and Text files for keywords
Warren
Warren

5IE0EKIXQ

Jun 18, 2019 @ 06:47am

I found this yesterday: http://www.foxweb.com/fwFullText/

So far I have a sample of 30 PDF's text that I am searching and it seems to be pretty much what I needed. So 100% FoxPro code, pretty lightning fast to create the keyword search index.

re: Searching PDFs and Text files for keywords
Rick Strahl
Warren

5IF0RPI2R

Jun 19, 2019 @ 12:55pm

Interesting. Briefly looking at this I can't imagine this works well with PDF files other than just matching the file, but not finding content inside of it to a specific location. This looks text based not context based. IOW you can find that something contains the searched string, but not where in the document (at least not accurately).

+++ Rick ---

re: Searching PDFs and Text files for keywords
Warren
Rick Strahl

5IG0EIAGR

Jun 20, 2019 @ 06:46am

Sorry, I didn't really provide enough information. Here is what I am doing:

FILETOSTR on each PDF (about 6,000 of them) and looking for text that tells me the PDF is text searchable or not. I then CreateObject WScript.Shell and create an instance of a PDF viewer (sometimes Acrobat Reader other times FoxIt) and open the text searchable PDF. I send it Ctrl-A Ctrl-C to grab the text. I add/update my table to have the directory, filename, if it's text searchable or not, timestamp of last file change, and a memo field that holds the text grabbed. I then can run the index builder code and search for keywords in the text I copied from the PDFs and show users the relevant PDFs.

I have another job that scans my table for non-searchable PDF's and process it through ABBYY FineReader Server to convert them from image based PDFs to be text searchable PDFs.

So for me I am not trying to raw read PDFs (other than looking to see if I can tell if there is text in there). Some PDFs I have are encoded and I am not sure how to decode them, hence I fire off a PDF viewer and do the Ctrl-A and Ctrl-C to grab the actual text after the PDF viewer has decoded them.

I perhaps have overly thought this process out. There maybe easier ways to achieve what I want, but I like keeping things as clean as possible. So it's 90% Visual FoxPro, 8% Web Connect, 2% PDF tools.

re: Searching PDFs and Text files for keywords
Warren
Warren

5IL0LNVXT

Jun 25, 2019 @ 10:06am

Well... I gave up on auto detecting if the PDF is searchable or not... I just fire off the PDF viewer and do Select All / Copy and if _CLIPTEXT is 0 or < 100(?) I know I have to send it off to Abby to OCR it.