Understanding Data Capture Accuracy
Data capture accuracy is often confused with full-page optical character recognition accuracy (OCR), which has a single percentage of error based on the number of correct characters. This confusion can cause many problems when an organization is determining the system for evaluating data capture products. Data capture accuracy is derived from a series of accuracy calculations. The calculations go step-wise, and each step impacts the next; failure in one step often prevents the other accuracy levels from being calculated. The best way to consider the accuracy of a package is to measure both the actual accuracy achieved based on truth data, and based on the percentage of uncertainty. Uncertainty is the percentage of characters a software package flags for a manual review. If there are one hundred characters in a document and the percentage of uncertainty is five percent, then an operator will look at five characters. In data capture, it is important for organizations to understand that despite the reported accuracy rate, there is always a potential for false positives. A false positive is a result that technology reports as accurate, but in reality is not. False positives are combated during setup with business rules and data types. Below are the stages of accuracy calculation.
“Data capture accuracy is derived from a series of accuracy calculations and is not a single percentage.”
The identification of a page and its associated document type. If a page is not identified as any type the accuracy is zero percent; if the page type is accurately identified, it’s accuracy is one hundred percent. Pages identified as a wrong type (false positive) result in zero percent accuracy. Page ID is an all or nothing accuracy calculation. A zero percent accuracy may stop processing of a document entirely or result in false positives.
This is the process of zoning the fields to be recognized on the document. If a field is not located at all, it is zero percent accurate. If a field is partially located, it is one to ninety nine percent accurate, depending on total possible field length and the length identified.
Character Level OCR
This is the accuracy reported by the OCR engine per character. When referring to accuracy, Character Level OCR is the most commonly known and used form of measurement. Character level accuracy ranges from zero to one hundred percent accurate. If a field has ten characters and nine are correct, the field is ninety percent accurate.
Business Rules and Data Types
This accuracy changes based on different software packages, but is applied at the final step of recognition. If a field does not match a particular data type, for example a proper date format, it could impact the accuracy of that field by one hundred percent, or reduced by a percentage representing the number of characters of the whole that does not match the data type. If a business rule states that a particular field should be five characters long, and it is recognized as seven characters long, it could impact the accuracy of that field by one hundred percent or less. Business rules and data types are the final tools to enhance accuracy and avoid false positives.
All four of the above accuracy calculations can be rolled into a single percentage that becomes the final calculation of data capture accuracy per page. Production environment accuracy is usually determined by the running production of documents for a set period of time and averaging the per page accuracies. Companies dealing heavily with business process driven automation consider accuracy only on a complete document level as opposed to page level. For example, consider AP automation of invoices, PO’s, and checks. There may be an average page level accuracy of ninety five percent on the invoices, ninety five percent on the PO’s and only seventy five percent on checks and because a document is only as strong as its least accurate type, the accuracy of AP automation would be considered to be seventy five percent on average.