Semi-Structured: The Data Keyword Pairs Approach

The Semi-Structured or data keyword pairs approach is quite unique compared to all other approaches. In this approach, coordinates are very rarely considered. If they are considered, they are relative to the subject page being processed at any given time. In this approach, there is also a stage of training where an operator loads a set of sample images of a particular document type. The operator then analyzes the variance across the images and determines the best logic to locate fields. Most fields are located using keywords. Fields can also be located using graphics, lines, and white spaces. Once a keyword has been identified for a particular field, the operator provides the logic to tell the software where the field is located relative to that keyword or object.


To find the invoice number on an invoice page, the logic would start by looking for the words “Invoice No.”, “Invoice Number”, “Invoice #”, or any other similar phrase that appears on the page. Now that the keyword for the invoice field is found, the next step of logic is to tell the software that the invoice number is to the right of the keyword, some number or percent of pixels below the top of the keyword, or some number or percent of pixels above the bottom of the keyword. In this case, the logic would probably also specify the type of characters an invoice number will contain. In this approach, the software guesses where the information is located and picks the best guess if there are several.

Logic Flexibility

There is a tug of war in the setup stage of this approach between how flexible the logic should be and how constrained it should be. The amount of flexibility is determined by the variance across pages and the complexity of the document type. Flexibility is usually determined at the field level. For example, in most commercial invoices, the total amount due is located on the bottom right portion of the last page of the invoice.


Some data capture packages will combine iterative templates and data keyword pair approaches collectively as one or as a choice to the operator based on document types and the setup operator’s skill set. Other solutions that are fully automated will incorporate the assisted capture approach for quality assurance.