Reading the “clean” text from PDF with PHP

Portable Document Format (PDF) is a file format created for the document exchange. Each PDF file encapsulates a complete description of a fixed-layout 2D document (and, with Acrobat 3D, embedded 3D documents) that includes the text, fonts, images, and 2D vector graphics which compose the documents.

PDF file structure

At first let’s look into the PDF file.

PDF implements documents as a hierarchy of tagged objects organized into trees and/or linked lists. The objects can encapsulate various types of content, or attributes, or pointers to external resources.

There are eight basic kinds of objects in PDF: Booleans, numbers, names, strings, arrays, dictionaries, streams and the null object. Let’s take a look at some of these objects we need to work with.

Strings

In PDF a string consists of a series of 8-bit bytes surrounded by parentheses. A string can be divided into several lines by using the backslash (\) at the end of the line. The backslash itself is not considered as part of the string. For example:

( This is a string. )
( This is a longer \
string. )

Any 8-bit value can be represented either by its octal equivalent (in the form \ddd, where ddd is the octal number), or by its two-digit hex equivalent, surrounded by angle brackets. Later we will search for the text data in the strings.

Arrays

An array is a sequence of PDF objects, enclosed in square brackets. For example:

[(Hello,)10(world!)]

Dictionaries

A dictionary is the key/value pairs, enclosed in two left angle brackets (<<) in the beginning and two right angle brackets (>>) at the end:

<< /Length 4 0 R
   /Filter /FlateDecode
>>

A dictionary is used to assign some properties to an object. We will use these data to determine how to decrypt the stream, find its length, or, for example, omit the current object (if it is an image).

Streams

A stream is a sequence of 8-bit bytes between the keywords stream and endstream. Any type of content made up of raw binary data is represented by a stream.

Streams are represented as objects (see below), which also means the stream will be bracketed by obj and endobj keywords. Before the stream keyword there must be a stream attribute dictionary, giving information about stream length (/Length key) and, often, the kind of compression employed (/Filter key).

As an example, a small text stream might look like:

2 0 obj
<<
/Length 39
>>
stream
BT
/F1 12 Tf
72 712 Td (A short text stream.) Tj
ET
endstream
endobj

In this example, the text itself is given as a string followed by the display text operator Tj.

Objects

An object can enclose the content of any PDF data types (Boolean, number, name, string, etc.), bracketed between obj and endobj keywords. We are primarily interested in objects with the streams inside.

How to get “clean” text?

So, where should we look for text objects in a PDF-document? The answer is simple: we look for objects that contain streams.

Another few things we need to consider:

  • The text in a stream is enclosed between BT (beginning of text) and ET (end of text) keywords.
  • PDF displays a text if there is Tj (display text) or TJ (display text considering the individual character positioning) keyword after a text string or an array of strings.
  • PDF supports the individual character positioning. This means that we can set arbitrary and individual size of the distance between each pair of characters.
  • PDF supports composite fonts where a single character is encoded by one or more bytes of the string. In this case the code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap. PDF also uses a special ToUnicode CMaps to map character codes to Unicode values.

Let’s read!

Now we have obtained enough theoretical knowledge to read our first PDF file. Below you can find the most interesting code parts with comments and the link to the source code.

function pdf2text($filename) { 

    // Read the data from pdf file
    $infile = @file_get_contents($filename, FILE_BINARY);
    if (empty($infile))
        return "";

    // Get all text data.
    $transformations = array();
    $texts = array();

    // Get the list of all objects.
    preg_match_all("#obj(.*)endobj#ismU", $infile, $objects);
    $objects = @$objects[1];

    // Select objects with streams.
    for ($i = 0; $i < count($objects); $i++) {
        $currentObject = $objects[$i];

        // Check if an object includes data stream.
        if (preg_match("#stream(.*)endstream#ismU", $currentObject, $stream)) {
            $stream = ltrim($stream[1]);

            // Check object parameters and look for text data.
            $options = getObjectOptions($currentObject);
            if (!(empty($options["Length1"]) && empty($options["Type"]) && empty($options["Subtype"])))
                continue;

            // So, we have text data. Decode it.
            $data = getDecodedStream($stream, $options); 
            if (strlen($data)) {
                if (preg_match_all("#BT(.*)ET#ismU", $data, $textContainers)) {
                    $textContainers = @$textContainers[1];
                    getDirtyTexts($texts, $textContainers);
                } else
                    getCharTransformations($transformations, $data);
            }
        }

    }

    // Analyze text blocks taking into account character transformations and return results.
    return getTextUsingTransformations($texts, $transformations);
}

You can find the source code HERE.

We must say that this code will parse correctly the simple PDF files. You can use this code as a basis and improve it according to your needs.

Usage example

To read PDF file (e.g. sample.pdf) and display received plain text in the browser window, add the following code to the source code before the first function.

$result = pdf2text ('sample.pdf');
echo $result;

Do not forget to replace sample.pdf with your PDF file name.

Related information

admin

admin

Leave a Reply

Your email address will not be published.