HTML to PDF Generator

Posted on: December 29th, 2013 by taff No Comments


I've just finished building an HTML to PDF converter. The client couldn't find any plugin that met the requirements so they decided upon a custom solution which I had the privilege of building.

The first step was to find a PHP library that could convert HTML to PDF. I decided to go with MPDF, as a few tests had shown it to convert HTML to PDF really fast.

The first part of converting HTML to PDF was to get the HTML. I simply used the PHP function

$sourceCode = file_get_contents("");

We now have the Source Code stored in our $sourceCode variable. Simple and effective...unless the HTML isn't valid, which would cause problems later on.

The next step I took in converting the HTML to PDF is to remove the markup that I don't need. The advantages here are twofold:

1. Less HTML Markup to parse into PDF
2. Lowers the risk of invalid HTML Markup causing PDF generation to fail

Before I go "throwing away" markup that I don't need in my HTML to PDF script, I need to ensure that anything of use is kept. Theoretically there are two things that could be useful:

1. The contents of the title tag. I can use this to create a filename later on
2. The Print.CSS if I have one

Extract the page title with a regular expression

The first stage is to extract the page title with a regular expression. We'll save what we find in a variable. If we don't find anything we will create a generic title.

if (preg_match('/(.*?)<\/title>/is',$sourceCode,$found)) { $title = "PDF: " . $found[1]; } else { $title = "A PDF from somewhere"; } 

If we wanted to use a media='print' CSS stylesheet we would do it in a similar way but I have one defined especially to make the PDF look glossier 😉 So we don't need anything else from between the head tags, so let's clear that out. The php function strstr is perfect for this.

 $noheader = strstr($sourceCode, '');
$body = strstr($noheader, '', true);

which should get the body tags and whatever is in between into our $body tag.

Removing ID Attribute with a Regular Expression

Because certain ID's had been used more than once, which of course is invalid and would cause an exception to be thrown, I decided to simply remove all the ID's. For this I used a simple one-liner to allow the HTML to be parsed by MPDF later that looks like this:

$body = preg_replace('#\s(id)="[^"]+"#', '', $body);

My example HTML to PDF script had 2 different elements that I needed to be parsed, all of which are contained in a div with the class "content" a title, in a H5 tag (the only h5 tag in the markup), the content in one or more 'p' tags. PHP comes with the very useful DOMXpath and with my past knowledge of xpath, made things a breeze

$xpath = new DOMXpath($body);
$htmlElements = array();

Above we are simply preparing the HTML to be parsed.
The next step is to get our h5 element into a variable ready for parsing

$contentTitle = $xpath->query("//*[@class='content']/h5");
$contentElements = array();

$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($contentTitle, true));
$headerElement = $tmp_dom->saveHTML();
$htmlElements['header'] = $headerElements;

The same applies below, but because we don't know how many elements we have we need a foreach loop.

$elements = $xpath->query("//*[@class='content']/p");
$contentElements = array();
foreach ($elements as $e):
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($e, true));
$contentElements[] = $tmp_dom->saveHTML();
$htmlElements['p'] = $contentElements;

All the content we need to convert from HTML to PDF is now stored in our $htmlElements array.

The rest is pretty easy thanks to MPDF.


$mpdf=new mPDF('c', 'A4', 0, 'Arial', 10, 10, 10, 10, 10, 10);
//The title we extracted earlier

Because I want to load a custom stylesheet just for the PDF generation, I simply do this

$stylesheet = file_get_contents('path/to/pdf_print.css');

The second parameter with the value 1 simply tells MPDF that this is a stylesheet and nothing else, the magic is then done.
We now just need to add our markup to the $mpdf variable like this:

$mpdf->WriteHTML("<h1>" . $htmlElements['header'] . "</h1>");
foreach ($htmlElements['p'] as $paragraph):

When we now execute our script it will force the browser to offer the page as a download. I hope this helps to show how easy it is to convert HTML to PDF.

Uncaught Gearman Exception

Posted on: October 20th, 2013 by taff No Comments


Using an older version of the most excellent Gearman (0.8.1), I came across a phenomena that wouldn't let me catch the ErrorException. Because I couldn't upgrade due to compatibility issues I had find the workaround.

Gearman Timeout Exception

The Gearman Timeout Exception occurred when the worker wasn't running when the runTasks() method was called. I thought I could prevent my worker file from expiring by using set_time_limit = 0; Unfortunately it didn't. So these were the problems that I faced:

  • Uncaught Gearman Exception setTimeout
  • Someway to reinstantiate the client worker, ideally when the exception happen
  • A cronjob to periodically check if the GearmanWorker is working

The Uncaught Gearman Exception

Although was initially the largest problem, it was in fact the easiest to solve by redefining the set_error_handler like so:

set_error_handler( array($this, 'gearman_error_handler'), -1 & ~E_NOTICE & ~E_USER_NOTICE);
//reset error handler.

public function gearman_error_handler($errno, $errstr, $errfile, $errline) {
if ($errno == 2) {
echo("A search went wrong");
} else {
$_log = \Logger::getInstance("preisvergleich_gearman_worker");
echo("A search went wrong because the Gearman Workers weren't running";

The last line resets back to the default error handling class that is implemented by the application

Reinstantiate the Gearman Client

I found something on StackOverflow that got me thinking. Why not build a class to monitor the Gearman Worker? In particularly as I had more that one worker running, albeit for different tasks, being able to check if the task was running would also come in handy for the cronjob check. Using shell_exec I was then able to restart the worker by executing the php file if and when the worker dies on me again.
That obviously goes a bit beyond catching Gearman Exceptions but could well be around in another blog post if someone finds it of interest. This is the method that I used:

protected static function callWorkerFile($strPathToWorker) {
 return shell_exec("nohup php " . $strPathToWorker . " > /dev/null & echo $!");

Which is the best PHP Framework in 2013?

Posted on: August 26th, 2013 by taff 4 Comments


Finding the best PHP framework is hard. Asking for PHP Framework suggestions on a forum gets you nearly as many different opinions as replies. Asking on stackoverflow normally gets the question blocked because it doesn't adhere to their Q&A.

There are so many aspects to take into account when choosing a PHP framework and I think it goes a lot further than how the syntax looks and what feels right.

Wikipedia offers a large tabled comparison of PHP frameworks allowing you to see what features are offered. There are however a few missing and I am not sure if all those green fields should be that colour.

We don't want to have to dabble in a new PHP Framework in six months because a new project has Perfomance as the highest priority and we decided to go with Zend. Maybe we don't want to go with CodeIgniter when it has an insecure future. Maybe the fact that cakephp has no templating engline puts us off, maybe just that fact appeals to us. To that end I think the best way is to setup a matrix listing the points that we feel are most relevant. Your grid may contain a lot more PHP frameworks. The following list had already been hit by the best PHP framework equivalent of the nerf bat, based on general performance.

Best PHP Framework based upon Performance

  • Yii
  • CakePHP / Liquify
  • Phalcon
  • Kohana
  • Codeigniter

I can hear lots of calls of cake and performance but I left it in as I have used it in the past for a large project and it scaled well. This would speed up development.

How to choose the best PHP framework

My criteria when choosing the best PHP framework for me was based on the following:

What do I want out of the framework?

Generally I want it to be fast and functional. As long as I can include a class from, or have some other way of including third-party code and get things working quickly I only really need the following to do the bulk of heavy lifting:

PHP Framework Functionality

  • Caching: All PHP frameworks have various caching capabilities so this isn't going to help with my decision.
  • Form validation / generation: This is one of the biggy's. I want the ability to generate forms automatically and be able to implement validation for 90% of my fields whilst still half asleep 😉
  • Unit testing: I want to be able to use PHPUnit to automate app testing but may look at alternatives if the rest of the framework is godlike.
  • Session Management: What session features does the framework bring with it?
  • ACL: Some type of user management / access control out of the box is a must. Fortunately most offer just that.
  • Templating: Smarty, Twig, which template engine do I want? Do I even need one and if so what is the performance price I have to pay to avoid lots of php tags in my view?

Other Criteria

Just because the frameworks in our shortlist have (nearly) all the above functionality out of the box, there are a few other points we should take into account if we want a long and happy relationship with our framework.

  • Age: I'm not sure how you feel about this point, but I want my new framework to be too young. I also don't want it to still be supporting PHP4 so a happy medium needs to be found.
  • Javascript: If we have a favourite Javascript library, does the framework support it?
  • Documentation: We all know how important good documentation is. It will allow to get so much more out of the framework, save us time and stop us swearing quite so often.
  • Community: When you are dabbling in your new framework prior to the final decision, take some time to look at the community. Active and friendly communities are a game winner.
  • Pros and Cons: These are the points that don't fit in somewhere else, but will influence your decision either positively or negatively. This could be the fact that a framework still supports an old version of PHP, maybe they don't have a dedicated forum etc.

These are the points that I use to help when deciding which is the best PHP framework for me. I look forward to hearing what criteria you use to help youmake a decision

Alternative Syntax for Control Structures

Posted on: August 26th, 2013 by taff No Comments


PHP offers an alternative syntax, also known as colon syntax, for a number of control structures, including if, for, foreach and switch.

Here is an example of a nested foreach, if, else and switch using standard syntax.

Control Structures - Standard Syntax

$allowedExtensions = ["jpg", "gif", "png", "pdf"];
$filenames = ["test.jpg", "foo.gif", "temp.pdf", "spreadsheet.xls"];
foreach ($filenames as $filename){
$file_extn = substr($filename, strrpos($filename, '.')+1);
if ( true === in_array($file_extn, $allowedExtensions)){
switch( $file_extn ){
case "jpg":
case "gif":
case "png":
echo "Ohh ein Bildchen!";
case "pdf":
echo "Ein PDF";
echo "Output nach Switch";
echo "Dateityp verboten";
echo "Output nach if";

Using tabstops we can make this readable but what happens when the number of switch cases we need to check become so many that we need to scroll back up to see what structure opened the curly bracket we are currently looking at.

In my opinion using the alternative syntax makes it easier when browsing code to see where, what is closing.

Control Structures - Alternative Syntax

$allowedExtensions = ["jpg", "gif", "png", "pdf"];
$filenames = ["test.jpg", "foo.gif", "temp.pdf", "spreadsheet.xls"];
foreach ($filenames as $filename):
$file_extn = substr($filename, strrpos($filename, '.')+1);
if ( true === in_array($file_extn, $allowedExtensions)):
switch( $file_extn ): case "jpg":
case "gif":
case "png":
echo "Ohh ein Bildchen!";
case "pdf":
echo "Ein PDF";
echo "Output nach Switch";
echo "Dateityp verboten";
echo "Output nach if";

It is important to point out that switch statements written in this format require the first case to be included with the statement. If you put the first case in a separate PHP block, you will get the following error:
Parse error: syntax error, unexpected T_INLINE_HTML, expecting T_ENDSWITCH or T_CASE or T_DEFAULT
I'm not sure exactly why PHP behaves this way, but it is a commonly made mistake that is not often explained or warned against.

You may want to find out more about the alternative Syntax

Encode Email Address with PHP and Javascript

Posted on: June 21st, 2012 by taff No Comments


Posting your email address anywhere on the web without encoding it in some way is not something you should do unless you want plenty of spam. In my opinion using email "encryption" techniques like person[at]domain[dot]com aren't going to stop a lot of bots either. If they are smart enough to build a crawler looking for email addresses with regular expressions, they are also going to be looking for [at]. If you have PHP and Javascript possibilities, you can protect yourself to a large extent with this useful snippet to encode mailto addresses.
This is a little PHP script that I use a lot.

function encryptAddress($address){
	$output="<script type="text/javascript">";
	$output.="var listOfEncryptedLetters=[";
		$output.= ord(substr($address,$i,1)).",";
	$output = substr($output, 0, -1); 
	$output.= "&#93;n";
	var newName='';
	for (var i=0; i<listOfEncryptedLetters.length; i++)
	document.write('<a href="mailto:'+newName+'">'+newName+'</a>')
	return $output;
echo encryptAddress("info@test.html");

Using a simple PHP loop, we consecutively convert each letter of the string passed to the function into it's equivalent ASCII value and add it to a Javascript array.

var listOfEncryptedLetters=[105,110,102,111,64,116,101,115,116,46,104,116,109,108]

The next step is to output our encrypted ASCII code as Javascript, with which we generate our anchor with a mailto:.
The output should look something like this now:

<script type="text/javascript">var listOfEncryptedLetters=[105,110,102,111,64,116,101,115,116,46,104,116,109,108]

	var newName='';
	for (var i=0; i<listOfEncryptedLetters.length; i++)
	document.write('<a href="mailto:'+newName+'">'+newName+'</a>')

Hope this helps. If you don't have PHP available but would like this script, holler and I'll throw up a form to automatically generate code so you can just copy and paste. I hope this script to encode your email address with javascript helps. I am not even sure if the crawlers do javascript so you may even get away with a document.write.


Deleting related items (i.e. all posts when a thread is deleted) with the cakephp model is a breeze. All you need to do is make that model dependent (not dependant, a typo that cost me 20 minutes). So in my posts model I would have:

var $belongsTo = array(
	'Thread' => array(
          'className' => 'Thread',
	  'foreignKey' => 'thread_id',
	  'dependent' => true)

Calling $this->Thread->del($id) will now not only delete all threads, but also any related posts which have a corresponding thread_id. This expels the chance of redundant data filling up your database.

Retrieving related data is also easy with the cakephp model, a simple

$this->recursive = 1

will get related data from the database, including data in a HasAndBelongsToMany relationship. Easy as pie...err cake