Ephesoft’s FuzzyDB feature is a great way to use values
extracted from a document to retrieve an index value from a database table.
We’ve made great use of this feature to identify a vendor name in
a document when the vendor’s name only appears inside a logo, and doesn’t
appear in actual text. We configured our
FuzzyDB table with one column that contains the vendor name, and several other columns that contain the vendor’s address and/or unique strings from the vendor's paperwork (like a company slogan).
When Ephesoft processes a document, it compares the text of the
document to an indexed version of the contents of the address column, and
returns the vendor name for the row with the best match.
Sometimes the results returned by the FuzzyDB search aren’t what
we expected to find, and it can be helpful to troubleshoot the search using
Luke, a Lucene tool included in the Ephesoft installation.
- Open Luke by running luke.bat from the following directory: <Ephesoft-Home>\Dependencies\luke
- The following window should appear by default. If not, choose File->Open Lucene Index.
- Browse to the path of the Lucene index you want to view, then click OK. For Ephesoft FuzzyDB work, choose the path to your FuzzyDB table name in the following folder:<Ephesoft-Home>\SharedFolders\BC<Number>\fuzzydb-index\ephesoft\<table-name> (It’s best to open the index file in Read-Only mode, just to be safe.)
- This will take you to the following window:
- Click on the Search tab at the top, then go to the Analysis tab on the right and change the drop-down list to “org.apache.lucene.analysis.standard.StandardAnalyzer”
- Enter some search criteria the upper-left text box and click Search (ensure that there are no punctuation characters in your search string).
- Results will be displayed in the Results list. Note the "rowID" values that are returned:
- Open your FuzzyDB table in a SQL tool like Heidi, and you can see that the rowID values returned from the search above map to rowID values in theFuzzyDB table:
The above procedure is
useful for generic testing. However, if
you have a document in a specific batch that is giving you problems, you can use
the following steps to test FuzzyDB retrieval for an entire page.
- On the Ephesoft server, find the HOCR.xml file for that page, then copy the HOCR value from the last line of the XML file.
- Paste that entire line into an editor like Notepad++ and remove all special characters (see the end of this page for an easy way to clean out special characters in Notepad++).
- Convert the entire string to lowercase (Edit -> Convert Case to -> Lowercase).
- Paste the cleaned, lowercase string into the Luke search dialog, and click Search.
When you do your page-specific search test, you may find that multiple vendor rows are being returned, or maybe you're simply getting the wrong vendor. Typically this means that the values you've entered in the searchable columns of your FuzzyDB table aren't clear enough to return the proper result.
Edit the values in the searchable columns of the FuzzyDB table, and repeat your search. Continue editing/searching until you get the desired results. Look for values that are unique to that vendor's paperwork, such as company name, address, ZIP code, company slogans, or distinct wording that appears on each of that vendor's documents. You can even use personal names if the vendor's paperwork always has the same person's name on it.
When you're satisfied with the results inside Luke, make sure to do a Learn DB inside Ephesoft to recreate the Lucene indexes within Ephesoft based on the changes you've made to the FuzzyDB table.
Avoid Reserved Words in your Luke Search Expression
The words “AND” and
“OR” in uppercase are reserved words in Luke. Ending a search string with
either of those words will cause an error. Using AND or OR inside a
search string is likely to return unexpected results as those words will be
treated as part of the query.
Removing Special
Characters in Notepad++
To remove special
characters from a string in Notepad++, open the Find/Replace dialog (Ctrl+F)
and click on the Replace tab. Select the Regular Expression radio button
at the bottom left, then type :punct: in
the Find What field, and click the Replace All button to remove those
characters (:punct: is
a regular expression that will find all punctuation characters in the string).