Extract Text from PDF using PHP

The PDF (Portable Document Format) file is used to save text/image data for offline use. Sometimes PDF file is used to display text/graphics content on the web page for online use. Generally, a web viewer is used to embed PDF files on the browser. When a PDF file is embedded on the web page, the text/graphics content is not appended to the HTML page. Since the PDF content is not rendered on the web page, it causes a negative impact on SEO. To overcome this issue, you can extract text content from PDF and include it on the web page.

PDF Parser library is very helpful to extract elements from PDF files using PHP. This PHP library parses PDF files and extracts text contents from all the pages. The object, headers, metadata, and text can be parsed from the PDF file using PHP. This tutorial will show you how to extract text from PDF files using PHP.

In this example script, we will use the PDF Parser library to extract text from PDF with PHP. Also, we will show how you can upload PDF files and extract text data on the fly using PHP.

Install PDF Parser Library

Run the following command to install PDF Parser library using composer.

composer require smalot/pdfparser

Note that: You don’t need to download the PDF Parser library separately, all the required files are included in the source code. Download the source code if you want to install and use PDF Parser without composer.

Include autoloader to load PDF Parser library and helper functions in the PHP script.

include 'vendor/autoload.php';

Extract Text from PDF

The following code snippet extracts all the text content from PDF file using PHP.

  • Initialize and load PDF Parser library.
  • Specify the source PDF file from where the text content will retrieve.
  • Parse PDF file using parseFile() function of the PDF Parser class.
  • Extract text from PDF using getText() method of the PDF Parser class.
// Initialize and load PDF Parser library 
$parser = new \Smalot\PdfParser\Parser();

// Source PDF file to extract text
$file 'path-to-file/Brochure.pdf';

// Parse pdf file using Parser library
$pdf $parser->parseFile($file);

// Extract text from PDF
$textContent $pdf->getText();

Upload PDF File and Extract Text

This example code snippet shows you the step-by-step process to upload PDF files and extract the text using PHP.

PDF File Upload Form:
Define HTML elements for file uploading form.

<form action="submit.php" method="post" enctype="multipart/form-data">
    <div class="form-input">
        <label for="pdf_file">PDF File</label>
        <input type="file" name="pdf_file" placeholder="Select a PDF file" required="">
    </div>
    <input type="submit" name="submit" class="btn" value="Extract Text">
</form>

On form submission, the selected file is submitted to the server-side script for process further.

Server-side Script (submit.php) to Extract Text from Uploaded PDF:
The following code is used to upload the submitted file and extract text from PDF.

  • Retrieve file name using $_FILES in PHP.
  • Get file extention using pathinfo() function with PATHINFO_EXTENSION filter.
  • Validate the file to check whether it is a valid PDF file.
  • Retrieve file path using tmp_name in $_FILES.
  • Parse uploaded PDF file and extract text content using PDF Parser library.
  • Format text content by replacing the new line (\n) with line break (<br/>) using nl2br() function in PHP.
$pdfText ''; 
if(isset(
$_POST['submit'])){
    
// If file is selected
    
if(!empty($_FILES["pdf_file"]["name"])){
        
// File upload path
        
$fileName basename($_FILES["pdf_file"]["name"]);
        
$fileType pathinfo($fileNamePATHINFO_EXTENSION);
        
        
// Allow certain file formats
        
$allowTypes = array('pdf');
        if(
in_array($fileType$allowTypes)){
            
// Include autoloader file
            
include 'vendor/autoload.php';
            
            
// Initialize and load PDF Parser library
            
$parser = new \Smalot\PdfParser\Parser();
            
            
// Source PDF file to extract text
            
$file $_FILES["pdf_file"]["tmp_name"];
            
            
// Parse pdf file using Parser library
            
$pdf $parser->parseFile($file);
            
            
// Extract text from PDF
            
$text $pdf->getText();
            
            
// Add line break
            
$pdfText nl2br($text);
        }else{
            
$statusMsg '<p>Sorry, only PDF file is allowed to upload.</p>';
        }
    }else{
        
$statusMsg '<p>Please select a PDF file to extract text.</p>';
    }
}

// Display text content
echo $pdfText;

Add Watermark to Existing PDF using PHP

Do you want to get implementation help, or enhance the functionality of this script? Click here to Submit Service Request

1 Comment

  1. Naeem Said...

Leave a reply

keyboard_double_arrow_up