PHP, XML, and Unicode

Update: in a comment John Cowan points out the obvious, that a UTF-8 escape sequence can never contain an ASCII character (because the high bit is always set, as I knew but failed to register). As a result, my xml_escape() function is way over-complicated. Thanks, John.

Update #2: in a comment, Jirka Kosek points out that PHP5 is actually using the also-excellent libxml instead of Expat — the PHP developers actually ported the expat-based, low-level interface to libxml so that it wouldn’t break legacy code. In that case, I’m especially impressed that my script produces byte-for-byte identical output with PHP4 and PHP5. I’m still looking for a problem with PHP’s XML+Unicode handling (other than the inconvenience of working with UTF-8 on the byte level).

Update #3: here’s a good summary of XML support in PHP5

A couple of weeks ago, Tim Bray posted about PHP and received a firestorm of comments, just as I did when I posted about PHP and Ruby on Rails almost a year ago. PHP generates a lot of passion, for good or for ill: my posting still gets a new comment every week or two.

As Tim updated his posting with comments, he linked to a two-year-old posting by Steve Minutillo about PHP4’s inability to detect character encodings in XML files and other Unicode bugs. That caught me by surprise — after all, PHP uses the venerable Expat as its XML parsing engine (the same engine used in most programming environments other than Java), and if Expat wasn’t getting things right, then the PHP people must have gone way out of their way to misconfigure it.

Testing Unicode support

To test XML character-encoding support in PHP, I used two PHP versions: 4.4.0, and 5.0.5 (which happen to be the current PHP4 and PHP5 heads in Ubuntu). I wrote a simple identity transform script (available for download at http://www.megginson.com/Software/xml-identity-transform.php — please consider it Public Domain) to read an XML file and write a simplified version of it back out again (I forgot to include processing instructions — sorry. I’ll fix that later.) The script always produces UTF-8 output, regardless of the input encoding. I ran it under both PHP4 and PHP5 against two XML source files with accented characters: one encoded in UTF-8, and the other encoded in ISO-8859-1 (with a suitable XML declaration). The script produces identical and correct UTF-8 output under both PHP4 and PHP5 (at least, the versions I tested). There is no conditional code based on the PHP version, but I did have to set a couple of options carefully.

Setting up a PHP XML parser

Here’s how I set up my XML parser in PHP:

$parser = xml_parser_create();
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, false);

The first line creates the parser (I’m not using Namespaces for this example, or it would look a little different.) The second line requests that the parser report element names, attribute names and values, content, and everything else to my application using UTF-8, no matter what the input encoding was. The final option undoes a mind-numbingly stupid default in PHP, where all element and attribute names are converted to upper case before being passed on.

Next, I register my event handlers with the parser (this step should be familiar to anyone who has ever programmed with Expat or SAX):

xml_set_element_handler($parser, 'start_element', 'end_element');
xml_set_character_data_handler($parser, 'character_data');

The handlers themselves are naively simple, attempting to recreate the XML markup reported to them:

function start_element ($parser, $name, $atts)
{
  echo("< $name");
  foreach ($atts as $aname => $avalue) {
    echo " $aname=\"" . xml_escape($avalue) . '"';
  }
  echo(">");
}

function end_element ($parser, $name)
{
  echo("</$name>");
}

function character_data ($parser, $data)
{
  echo(xml_escape($data));
}

The only complicated bit happens in the xml_escape function. Unfortunately, since I’m dealing with raw UTF-8, I have to know a bit about UTF-8 encoding to do the escaping — otherwise, my code might mistake part of an multi-byte escape sequence for an ampersand and replace it with an entity reference (note: this is all unnecessary — see John Cowan’s comment):

function xml_escape ($s)
{
  $result = '';
  $len = strlen($s);
  for ($i = 0; $i < $len; $i++) {
    if ($s{$i} == '&') {
      $result .= '&';
    } else if ($s{$i} == '<') {
      $result .= '<';
    } else if ($s{$i} == '>') {
      $result .= '>';
    } else if ($s{$i} == '\'') {
      $result .= ''';
    } else if ($s{$i} == '"') {
      $result .= '"';
    } else if (ord($s{$i}) > 127) {
      // skipping UTF-8 escape sequences requires a bit of work
      if ((ord($s{$i}) & 0xf0) == 0xf0) {
        $result .= $s{$i++};
        $result .= $s{$i++};
        $result .= $s{$i++};
        $result .= $s{$i};
      } else if ((ord($s{$i}) & 0xe0) == 0xe0) {
        $result .= $s{$i++};
        $result .= $s{$i++};
        $result .= $s{$i};
      } else if ((ord($s{$i}) & 0xc0) == 0xc0) {
        $result .= $s{$i++};
        $result .= $s{$i};
      }
    } else {
      $result .= $s{$i};
    }
  }
  return $result;
}

The rest of my code is just the normal Expat parsing loop: open a file (or URL), feed it to Expat in buffered chunks, and then report that the input is finished.

So where are the problems?

There may be huge problems that I somehow missed in my brief test.
The PHP documentation XML is not entirely clear about input and output character encodings, probably because the documentation writers were themselves a bit confused about this stuff.
It is possible (even likely) that bugs existed in both the PHP4 and PHP5 codebases two years ago when Steve wrote his piece, but have since been fixed.
It is a bit tricky working with UTF-8, since you have to remember to detect escape sequences. A PHP library would be nice. Or better yet, hide it completely, like Java does. Still, it’s only a nuisance, not a show-stopper.
Steve referred to the PHP XML parser’s mangling numeric character references. Expat doesn’t do that. However, it is possible that people think numerical character references refer to their current encoding, rather than to the abstract Unicode character set, and that will get them into serious trouble.
Expat does not support all character encodings out of the box. In fact, XML parsers are required to support only UTF-8 and UTF-16 — use any other encoding (even ISO-8859-1) at your peril, since there’s no guarantee that other XML software will be able to read it.
People often forget to declare what encoding they’re using.
Anyone who serves XML documents as text/xml is going to get in trouble no matter what language people use, because of the reencoding that might take place.

Most of these problems are not unique to PHP — XML is hard and confusing, Unicode is hard and confusing, and when you put the two together, there’s lots of opportunity for human error.

I’d be interested in the URLs of well-formed XML documents in supported encodings (UTF-8, UTF-16, US-ASCII, or ISO-8859-1, I think) that do not work properly in recent versions of PHP4 or PHP5 with the simply identity-transformation script I posted. If there are deep problems with PHP, XML, and Unicode, rather than just user confusion, I’d like to know about them.

8 Responses to PHP, XML, and Unicode

Jirka Kosek says:

March 1, 2006 at 1:00 pm

XML support in PHP5 is completely reworked and it is using libxml2 as its base, not expat.

If you want to work with XML seriously in PHP, you need at least version 5.1. Former versions were missing critical features like ability to bind prefixes to namespaces for XPath evaluation and so on.

PHP doesn’t support Unicode, it treats strings as a sequence of bytes. So you are responsible for correct string operations. This can be overcome using mb_string library. This library can make many PHP functions utf-8 aware.

Even in PHP 5.1 there are some unresolved issues:

SAX like parser — doesn’t report all XML events (compared to original Java SAX2); doesn’t have OO interface — handlers are just plain functions

SimpleXML (simple XML2OO mapping) — doesn’t support mixed content; namespaces are supported in a very inconvenient way

XMLReader (pull parser) — is missing several critical methods, including readString()

Due to missing Unicode support and some problems in XML APIs PHP is still far beyond Java and .NET in XML support.
david says:

March 1, 2006 at 1:30 pm

Are you certain, Jirka, that the old xml_parser_create() interface isn’t still using Expat? If not, then I’m especially impressed that my script gives byte-for-byte identical output with PHP4 and PHP5.
John Cowan says:

March 1, 2006 at 3:17 pm

No multi-byte UTF-8 sequence can contain an ASCII character — that’s one of the design points of UTF-8. So you are taking precautions against a problem that doesn’t exist. (It does exist in UTF-16, however.)
Jirka Kosek says:

March 1, 2006 at 5:42 pm

[2] You can see which XML library is actually used in phpinfo() output in “xml” section.

Authors of XML extensions in PHP5 carefully modelled behaviour of xml_ functions using new underlying library. This is good for backward compatibility, OTOH some problems were transfered to the new API (e.g. see http://www.codecomments.com/archive222-2005-9-598406.html).
Jirka Kosek says:

March 1, 2006 at 5:46 pm

And one additional note. If you are using XML under PHP5 it is possible to read documents in any encoding supported by libxml2. AFAIK libxml2 uses iconv for encoding handling, so you can load documents in virtually any encoding, including iso-8859-x, windows-125x and so on.
david says:

March 1, 2006 at 5:48 pm

Thanks for the info, Jirka. phpinfo() shows versions for both libxml and expat with PHP4, and libxml and libxml2 for PHP5.
Aristotle Pagaltzis says:

March 3, 2006 at 6:23 pm

Meta note: for some reason, most of your links have xhref instead of href attributes, and in the one tag where the attribute is spelled href its value is empty.
david says:

March 3, 2006 at 8:46 pm

Thanks Aristotle — WordPress’s new GUI editor was mangling my postings badly, and I figured out how to disable it halfway through making the posting. I have no idea why it changed by hrefs, but I fixed them by hand.

Comments are closed.