Language plays an important role in document classification and data extraction. It also matters in downstream processing and when the humans come into the loop. A document is written in a certain language, your users speak many languages, your business interacts with companies in various countries with yet other languages.
Detecting the language of a document early on in the processing workflow can improve accuracy and throughput.
Why the simplest way is not always the best
Let’s say you have an AP AutomationAP stands for Accounts Payables. This is a sub-process of the larger Purchase-to-Pay (P2P) process. The Process starts when the... system in place. It centrally processes invoices that your organization receives from many countries. An OCRWe speak about this a lot here. OCR is a technology used to interpret the pixels in an image as... engine does text recognition. Human operators review incorrectly extracted data in a browser-based UI. The simplest way of setting up the system with regards to language support is this:
- In the OCRWe speak about this a lot here. OCR is a technology used to interpret the pixels in an image as... engine configuration you select all document languages the system can encounter.
- Configure the machine-learning-based data extraction so that it builds an AI model for all languages at once.
- Send invoices for review to a large pool of all operators so you can efficiently burn through the review tasks.
This approach can make the configuration of the automation system simple but each of the above “default settings” has several disadvantages, which we will now look into.
Language Detection helps configure the OCR engine
The way OCRWe speak about this a lot here. OCR is a technology used to interpret the pixels in an image as... engine language settings work is like this: They segment the image into characters, classify each character to know which one it is, then form words based on the spacing. Lastly, they use a language-specific dictionary to improve the raw recognition, as it is more likely that a word is correct if it appears in a dictionary.
For more details take a look at our other posts on how OCRWe speak about this a lot here. OCR is a technology used to interpret the pixels in an image as... works:
This works well if you select only one or 2 languages, but many engines allow you to select dozens of languages at once. If you do that, the OCRWe speak about this a lot here. OCR is a technology used to interpret the pixels in an image as... engine applies all the selected dictionaries and the chances increase that it reads a French word in an English document, as an example.
You can avoid this in many automation products by using language detection. Instead of processing all documents with OCRWe speak about this a lot here. OCR is a technology used to interpret the pixels in an image as... engine settings using all languages, you can first detect the language and then route the document to a processing step where the OCRWe speak about this a lot here. OCR is a technology used to interpret the pixels in an image as... engine settings use only one language. This means more configuration effort, but likely results in better recognition.
Language Detection improves the data extraction model
Many capture products include machine learning tools for data extraction. In the old days, consultants came in and set up regular expressions and dozens of rules per field that needed extraction. This becomes unmanageable quickly. Machine learning to the rescue. Instead of writing rules, users train the system by clicking on the value of each field in several documents. The system builds a model from all these sample documents and figures out the good old rules by itself.
Now, this model’s performance is also impacted by what you throw into the samples used to train it. Invoices from all countries used to train the same model can cause similar effects as explained earlier. Keywords and labels for fields can become inconsistent the more different languages the engine sees in the samples.
Again, you can avoid this with some design effort. Most capture automation products include classification features. You can use language detection based on classification to route the document to a language-specific sub-class. In that sub-class, the fields are extracted with a language-specific AI model only, which likely provides better accuracy. Designing the capture project like this has other great side effects, too:
- You can use the language information to verify the extracted vendor. E.g. it is unlikely that a Spanish vendor sends a French invoice.
- You can use the language to derive the country of the vendor in many cases. This allows you to fine-tune the local tax validations and other country-specific rules.
Language Detection gets the right document to the right operator
When you automate the Accounts Payables process, the people reviewing the incorrectly extracted invoices or doing the downstream processing (approvals, GL-codingThis is a term from AP Automation. When invoices get booked in ERP systems, usually every invoice line needs to..., etc.) are often experts. If you centralize the processing of all invoices it makes sense to route them to users fluent in the invoice language. It doesn’t make much sense to have a French accountant review a Chinese invoice.
Language detection can solve that problem, too. You can configure many AP AutomationAP stands for Accounts Payables. This is a sub-process of the larger Purchase-to-Pay (P2P) process. The Process starts when the... systems to route tasks to individuals or groups based on custom criteria. Why not use the detected invoice language as a criterion to route the invoice to the person who speaks the invoice’s language?
How does language detection work?
Language detection is usually done with classification technology. There are 3 kinds of tools for this:
- Built into the OCRWe speak about this a lot here. OCR is a technology used to interpret the pixels in an image as... engine. You can configure some OCRWe speak about this a lot here. OCR is a technology used to interpret the pixels in an image as... engines either select all languages manually or auto-detect the right one. The auto-detection is done internally in the engine. Often, it does the first pass in a language with western characters like English. and then classifies the text with an AI model trained on all languages. But for you as a user of the OCRWe speak about this a lot here. OCR is a technology used to interpret the pixels in an image as... engine, it is just a checkbox: “Auto-detect language”
- Built into the Automation platform. Sometimes, language detection is a feature there that you can use to steer the workflow or modify settings like OCRWe speak about this a lot here. OCR is a technology used to interpret the pixels in an image as... languages dynamically.
- Build your own. Most Capture automation platforms include classification. You can create your own classification model for languages. If you need samples to train such a language classifier, you can find them easily on Kaggle.com, or you use existing, labeled samplesLabeled Samples are the input for machine learning algorithms. An ML algorithm, for example, an artificial neural network, learns by... from your system of record.
When should I care?
Consider your business case. You might not need to worry about language detection if you process only documents in one language and if all your people working on them speak the same language. But in all other cases, consider language detection capabilities when selecting OCRWe speak about this a lot here. OCR is a technology used to interpret the pixels in an image as... engines or buying software that solves your document automation problems.
Check out our OCRWe speak about this a lot here. OCR is a technology used to interpret the pixels in an image as... engine buying guide for more information on the different types and benefits of the OCRWe speak about this a lot here. OCR is a technology used to interpret the pixels in an image as... engines on the market: