PDF Text Extraction

Text-based vs. Image-based PDF Files

When dealing with PDF files, and especially when performing text extraction on them, it is important to note how the text in the PDF is represented. The two most common types of PDF are text-based and image based. Creating a raw text representation of the former is fairly straightforward, as the text is immediately selectable. The latter will require additional tools, as described below.

An example of a text-based PDF

An example of an image-based PDF

As demonstrated in the above example, extracting the contents of a text-based PDF is fairly straightforward. Simply highlight the portion of the text that you want to extract (Ctrl+A is the usual shortcut to select all if you'll be using the entire document), copy it, and paste it into a text editor. Some PDF readers, such as Adobe Reader, also have the option to export PDF files to plain text. To do this in Adobe Reader, perform the following actions:

  1. Select "File" → "Save As Other..." → "Text..."
  2. A dialog box will appear, prompting you to save the file. Choose an appropriate file name and inspect the file.

​You can see the results of the above process here

Using OCR

Clearly, not every PDF document that one encounters will contain pure text. Scans of books, for instance, are a common source material for PDF files, and while there are some scanners which allow the user to import their scans as a text-searchable PDF, this will certainly not always be the case. A common tool that may be turned to in extracting text from such documents is optical character recognition, or OCR. There are a number of software products containing an OCR feature, both freeware and commercial.  For this example, we'll be using Google Docs, which, despite its input limit of 2 MB, will suffice for our demonstration:

  1. To begin, log in to Google Docs. You'll need to instruct the site to attempt to convert your PDF (or image) files to text. To do so, click the Settings button in the upper-right corner (). From there, select Upload Settings, and check the "Convert text from uploaded PDF and image files" option.
  2. Select the Upload button () and choose files. Navigate to your PDF, and select it.
  3. After Google Docs has processed the file, select it in the browser window. This will open a new window, where the original PDF and the OCR-transcribed version will be displayed. Please note that there may be a delay where a blank document is displayed, simply wait for the text to appear, especially if the PDF was particularly large or complex.
  4. Make sure to verify the contents of the document, checking for any obvious errors in transcription. Keep in mind that the ability of any OCR software to read a PDF is highly dependent on the quality of the images it contains. Whenever possible, try to obtain the highest-quality scans available.]
  5. Google Docs contains a save-to-text feature, which can be employed by selecting "File" → "Download as" → "Plain Text (.txt)"

You can see the results of the above process here.

Although Google Docs is suitable for this simple example, you may find that it does not suit your individual needs. It should certainly not be considered the end-all of OCR software, as many other programs are available. Additionally, for the conversion of physical documents to PDF files, newer scanners will often have an option to convert the scanned images to a PDF, or plain text, using an internal OCR capability. 

It is important to note that the PDF file format can hold combinations of text and images (as well as less common forms of data, such as audio and video). Because of this, the text that you require from a PDF might not necessarily be exclusively represented as text or images. It is important to check your document beforehand if you are unsure of whether it contains purely text, or a mixture of text and images. Most OCR software can deal with specific pages or ranges of pages, so take advantage of those capabilities which are appropriate for your project.

Character Sets

When planning to perform OCR on a text, PDF or otherwise, it is important to keep in mind that some OCR programs are limited to recognizing Latin script. For those who are only concerned with English-language texts, this may not pose a problem, but it adds another layer of complexity to the text-extraction process when dealing with texts in other scripts, such as Cyrillic, CJK scripts, etc. Google Docs' PDF conversion software does allow specifying the language of the text, but again, is not always the optimal solution. It is important to identify the needs of your project, and tailor any product-purchasing decisions around what best suits those needs.

