html to pdf

HTML to PDF Generator

Posted on: December 29th, 2013 by taff No Comments

 

I've just finished building an HTML to PDF converter. The client couldn't find any plugin that met the requirements so they decided upon a custom solution which I had the privilege of building.

The first step was to find a PHP library that could convert HTML to PDF. I decided to go with MPDF, as a few tests had shown it to convert HTML to PDF really fast.

The first part of converting HTML to PDF was to get the HTML. I simply used the PHP function

$sourceCode = file_get_contents("http://www.domain.com/path/to/page.html");

We now have the Source Code stored in our $sourceCode variable. Simple and effective...unless the HTML isn't valid, which would cause problems later on.

The next step I took in converting the HTML to PDF is to remove the markup that I don't need. The advantages here are twofold:

1. Less HTML Markup to parse into PDF
2. Lowers the risk of invalid HTML Markup causing PDF generation to fail

Before I go "throwing away" markup that I don't need in my HTML to PDF script, I need to ensure that anything of use is kept. Theoretically there are two things that could be useful:

1. The contents of the title tag. I can use this to create a filename later on
2. The Print.CSS if I have one

Extract the page title with a regular expression

The first stage is to extract the page title with a regular expression. We'll save what we find in a variable. If we don't find anything we will create a generic title.

if (preg_match('/(.*?)<\/title>/is',$sourceCode,$found)) { $title = "PDF: " . $found[1]; } else { $title = "A PDF from somewhere"; } 

If we wanted to use a media='print' CSS stylesheet we would do it in a similar way but I have one defined especially to make the PDF look glossier 😉 So we don't need anything else from between the head tags, so let's clear that out. The php function strstr is perfect for this.

 $noheader = strstr($sourceCode, '');
$body = strstr($noheader, '', true);

which should get the body tags and whatever is in between into our $body tag.

Removing ID Attribute with a Regular Expression

Because certain ID's had been used more than once, which of course is invalid and would cause an exception to be thrown, I decided to simply remove all the ID's. For this I used a simple one-liner to allow the HTML to be parsed by MPDF later that looks like this:

$body = preg_replace('#\s(id)="[^"]+"#', '', $body);

My example HTML to PDF script had 2 different elements that I needed to be parsed, all of which are contained in a div with the class "content" a title, in a H5 tag (the only h5 tag in the markup), the content in one or more 'p' tags. PHP comes with the very useful DOMXpath and with my past knowledge of xpath, made things a breeze

$xpath = new DOMXpath($body);
$htmlElements = array();

Above we are simply preparing the HTML to be parsed.
The next step is to get our h5 element into a variable ready for parsing

$contentTitle = $xpath->query("//*[@class='content']/h5");
$contentElements = array();

$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($contentTitle, true));
$headerElement = $tmp_dom->saveHTML();
$htmlElements['header'] = $headerElements;

The same applies below, but because we don't know how many elements we have we need a foreach loop.

$elements = $xpath->query("//*[@class='content']/p");
$contentElements = array();
foreach ($elements as $e):
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($e, true));
$contentElements[] = $tmp_dom->saveHTML();
endforeach;
$htmlElements['p'] = $contentElements;

All the content we need to convert from HTML to PDF is now stored in our $htmlElements array.

The rest is pretty easy thanks to MPDF.

include("path/to/libraries/mpdf.php");

$mpdf=new mPDF('c', 'A4', 0, 'Arial', 10, 10, 10, 10, 10, 10);
//The title we extracted earlier
$mpdf->SetTitle($title);
$mpdf->SetDisplayMode('fullpage');

Because I want to load a custom stylesheet just for the PDF generation, I simply do this

$stylesheet = file_get_contents('path/to/pdf_print.css');
$mpdf->WriteHTML($stylesheet,1);

The second parameter with the value 1 simply tells MPDF that this is a stylesheet and nothing else, the magic is then done.
We now just need to add our markup to the $mpdf variable like this:

$mpdf->WriteHTML("<body>");
$mpdf->WriteHTML("<h1>" . $htmlElements['header'] . "</h1>");
foreach ($htmlElements['p'] as $paragraph):
	$mpdf->WriteHTML($paragraph);
endforeach;
$mpdf->WriteHTML("</body>");
$mpdf->Output($filename,'I');

When we now execute our script it will force the browser to offer the page as a download. I hope this helps to show how easy it is to convert HTML to PDF.