Optical Character Recognition (OCR) in Legal Tech


What is OCR?

Optical character recognition (OCR) is foundational for many legal tech tools. OCR is the technology that turns images or scans of documents into plain text that a computer can recognize. For example, you might have scanned a paper contract and saved it as a JPEG image file. What you really want, and what computer algorithms need, is a document file or even a plain text file that can be edited. OCR is the intermediate step necessary before any sort of manipulation or analysis.

Image of document and results of OCR testing
Image to text includes handwritten notes.
Source: https://source.opennews.org/articles/so-many-ocr-options/

Why is OCR the foundation of most legal tech?

If you have an inbox filled with scanned images, you’ll struggle to find text in a specific document. On the other hand, if you have all Microsoft Word files, you can click CTRL+F to find text across all your documents. That’s just scratching the surface. If you can fully digitize your documents, you can view and analyze them programmatically. You can run it through Grammarly for spelling and grammar checks. You can use AI to generate summaries. You can ask questions about your documents. You can automatically redact phrases. Essentially, you vastly increase the number of tasks you can automate if your computer can understand and see the text. That makes changing an image file to a text file extremely valuable.

thumbnail image 1 of blog post titled 
	
	
	 
	
	
	
				
		
			
				
						
							Bootstrap Your Text Summarization Solution with the Latest Release from NLP-Recipes
Law has a reputation for long documents. However, computer algorithms have the potential to change how legal professionals work.
Source: https://techcommunity.microsoft.com/t5/ai-customer-engineering-team/bootstrap-your-text-summarization-solution-with-the-latest/ba-p/1268809

The State of OCR

One of the best things about OCR is that the need to recognize characters in images is a massive challenge that exists across multiple industries. If it were up to the legal profession to create this innovation alone, we would be very far behind. But today, OCR is important, whether it is for self-driving cars to recognize signage or for medical professionals sorting through their documents.

As a result, OCR has moved incredibly quickly. In the last 2 years, the accuracy of the top cloud products (Google Cloud Vision and AWS Textract) has increased from the high ’80s range to the 99%+ range. Most of the remaining inaccuracies relate to handwriting recognition, which is imperfect even for a live person. In my view, these recent advances have made OCR viable for mainstream usage in the legal profession.

However, the progress in OCR is primarily in extracting text. Though there would be significant value in recreating a document in Word format, that has not been the goal of OCR. All significant cloud OCR products output plain text with all formatting stripped. They are not interested or rather not confident in recreating formatting choices yet. This is a challenge for specific uses a legal professional might want, but not all of them. Data analytics is one large area utterly unaffected by the lack of formatting.

Image of document and results of OCR testing
OCR typically retains minimal formatting. Legal formatting is particularly challenging.
Source: https://source.opennews.org/articles/so-many-ocr-options/

OCR and the legal profession

The tough pill to swallow is that so many of our legal tech resources are being used to reinvent OCR. Many companies in the contract lifecycle management space are tweaking their in-house OCR and advertising best-in-class text recognition. My frustration is that text extraction is an artificial problem that shouldn’t need to exist. And worse yet, the solution is costly and resource-intensive. I wish this effort was spent creating and improving actual tools rather than an intermediate step.

Digital-first and standardization

We actively trick and confuse computers if you think about how we currently draft contracts. For example, the law firm letterhead adds unnecessary logos, columns, charts, and lists to a document. We also use multiple columns, which reflow the text around charts and images differently. Generally, we love odd formatting choices. For example, Microsoft Word is a terrible file format because it mixes formatting and text. Modern formats, like HTML, tends to separate the formatting and the text. It isn’t that the firm shouldn’t use a letterhead. It just shouldn’t come at the cost of usability.

The fascinating thing is that if we as a profession were able to maintain certain standards to adhere to certain design choices, we would be much better off. Even better, if we embraced digital contracting, we could completely ignore the OCR problem.

Nevertheless, the silver lining is the diminishing importance of OCR in the face of greater adoption of digital contracting. Companies are choosing to invest increasing amounts into OCR but getting diminishing returns as companies move to fully digital contracting. So my personal plea is that we should move towards digital as quickly as possible.