From several PDFs to a single spreadsheet – improve customer data collection with OCR in PDFfiller

OCR, Optical Character Recognition, OCR software, scan PDF, digital workflow, PDFfiller, extract data, merge PDF

OCR stands for Optical Character Recognition, which is a technology that converts an image of scanned or handwritten text into an editable computer format. OCR software is widely used in programs and apps to streamline workflow and avoid manually retyping text. This simple yet effective tool can bring a number of benefits to businesses when properly utilized. In this post, we’ll go over how PDFfiller implements OCR in our Extract in Bulk feature and why it’s so useful.


How do we use OCR?

Our users often have to deal with data collection from multiple documents and are constantly looking for ways to optimize their workflow. PDFfiller is well aware of this demand and created the Extract in Bulk feature, which allows users to copy lines of data from one or several documents and compile them into an Excel Spreadsheet.

Now let’s meet Lucy, an administrative assistant at The Best Medical Center. She has one hundred Patient Intake Forms and her task is to copy specific fields of patient information and compile them in a separate document. All data has to be transferred accurately and as soon as possible. Before, the most common way for Lucy to complete her task was to individually highlight the information in each file and then to manually copy and paste it into a separate document. Imagine how much time she spent on this physically and mentally draining task each day.

But with Extract in Bulk, Lucy only has to highlight the necessary fields in one document, which will serve as her template. After that, she can upload the remaining files and click a few buttons. The OCR technology will recognize the similar fields of information and convert them into Excel format for her, thus automating the process.

Get detailed instruction on how to Extract in Bulk in this video:


How does it work?

OCR technology is not that modern: it originated in the early 1800s and has been gradually evolving ever since. One of it’s more popular uses has been for reading text aloud to visually challenged people. And within recent decades, its use has become more widespread and sophisticated.

The main problem of this software is that a computer cannot “see”, unlike the human eye which automatically recognizes handwritten or printed letters and digits. An operating system perceives any scanned  symbols as a set of pixels. This makes correctly capturing an image and decoding it into machine-readable text an extremely complicated task for OCR. The software is challenged with recognizing a pattern of pixels and comparing it to the font existing in its database.

In the 1960s, a uniform font called OCR-A was invented and used for documents such as bank checks, bills and invoices. However, the ability to recognize just one font was not enough, so the system was taught to analyze the common features of letters, digits and punctuation marks. For example, the system was taught to recognize that the letter “A” contains two lines meeting at the top with a line crossing them in the middle. In this way, it could be taught how to identify multiple other fonts. As time went on, OCR programs gained more advanced features such as spellcheck to “guess” the right word, searching for it in the closest variant of its built-in dictionary whenever a symbol was not clearly recognized.



Save time by eliminating unnecessary document-filing routines with PDFfiller. 

Get a 30-day free trial