Update: in a comment John Cowan points out the obvious, that a UTF-8 escape sequence can never contain an ASCII character (because the high bit is always set, as I knew but failed to register). As a result, my xml_escape() function is way over-complicated. Thanks, John.
Update #2: in a comment, Jirka Kosek points out that PHP5 is actually using the also-excellent libxml instead of Expat — the PHP developers actually ported the expat-based, low-level interface to libxml so that it wouldn’t break legacy code. In that case, I’m especially impressed that my script produces byte-for-byte identical output with PHP4 and PHP5. I’m still looking for a problem with PHP’s XML+Unicode handling (other than the inconvenience of working with UTF-8 on the byte level).
Update #3: here’s a good summary of XML support in PHP5
A couple of weeks ago, Tim Bray posted about PHP and received a firestorm of comments, just as I did when I posted about PHP and Ruby on Rails almost a year ago. PHP generates a lot of passion, for good or for ill: my posting still gets a new comment every week or two.
As Tim updated his posting with comments, he linked to a two-year-old posting by Steve Minutillo about PHP4’s inability to detect character encodings in XML files and other Unicode bugs. That caught me by surprise — after all, PHP uses the venerable Expat as its XML parsing engine (the same engine used in most programming environments other than Java), and if Expat wasn’t getting things right, then the PHP people must have gone way out of their way to misconfigure it.
Testing Unicode support
To test XML character-encoding support in PHP, I used two PHP versions: 4.4.0, and 5.0.5 (which happen to be the current PHP4 and PHP5 heads in Ubuntu). I wrote a simple identity transform script (available for download at http://www.megginson.com/Software/xml-identity-transform.php — please consider it Public Domain) to read an XML file and write a simplified version of it back out again (I forgot to include processing instructions — sorry. I’ll fix that later.) The script always produces UTF-8 output, regardless of the input encoding. I ran it under both PHP4 and PHP5 against two XML source files with accented characters: one encoded in UTF-8, and the other encoded in ISO-8859-1 (with a suitable XML declaration). The script produces identical and correct UTF-8 output under both PHP4 and PHP5 (at least, the versions I tested). There is no conditional code based on the PHP version, but I did have to set a couple of options carefully.
Setting up a PHP XML parser
Here’s how I set up my XML parser in PHP:
$parser = xml_parser_create();
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, false);
The first line creates the parser (I’m not using Namespaces for this example, or it would look a little different.) The second line requests that the parser report element names, attribute names and values, content, and everything else to my application using UTF-8, no matter what the input encoding was. The final option undoes a mind-numbingly stupid default in PHP, where all element and attribute names are converted to upper case before being passed on.
Next, I register my event handlers with the parser (this step should be familiar to anyone who has ever programmed with Expat or SAX):
xml_set_element_handler($parser, 'start_element', 'end_element');
xml_set_character_data_handler($parser, 'character_data');
The handlers themselves are naively simple, attempting to recreate the XML markup reported to them:
function start_element ($parser, $name, $atts)
{
echo("< $name");
foreach ($atts as $aname => $avalue) {
echo " $aname=\"" . xml_escape($avalue) . '"';
}
echo(">");
}
function end_element ($parser, $name)
{
echo("</$name>");
}
function character_data ($parser, $data)
{
echo(xml_escape($data));
}
The only complicated bit happens in the xml_escape function. Unfortunately, since I’m dealing with raw UTF-8, I have to know a bit about UTF-8 encoding to do the escaping — otherwise, my code might mistake part of an multi-byte escape sequence for an ampersand and replace it with an entity reference (note: this is all unnecessary — see John Cowan’s comment):
function xml_escape ($s)
{
$result = '';
$len = strlen($s);
for ($i = 0; $i < $len; $i++) {
if ($s{$i} == '&') {
$result .= '&';
} else if ($s{$i} == '<') {
$result .= '<';
} else if ($s{$i} == '>') {
$result .= '>';
} else if ($s{$i} == '\'') {
$result .= ''';
} else if ($s{$i} == '"') {
$result .= '"';
} else if (ord($s{$i}) > 127) {
// skipping UTF-8 escape sequences requires a bit of work
if ((ord($s{$i}) & 0xf0) == 0xf0) {
$result .= $s{$i++};
$result .= $s{$i++};
$result .= $s{$i++};
$result .= $s{$i};
} else if ((ord($s{$i}) & 0xe0) == 0xe0) {
$result .= $s{$i++};
$result .= $s{$i++};
$result .= $s{$i};
} else if ((ord($s{$i}) & 0xc0) == 0xc0) {
$result .= $s{$i++};
$result .= $s{$i};
}
} else {
$result .= $s{$i};
}
}
return $result;
}
The rest of my code is just the normal Expat parsing loop: open a file (or URL), feed it to Expat in buffered chunks, and then report that the input is finished.
So where are the problems?
- There may be huge problems that I somehow missed in my brief test.
- The PHP documentation XML is not entirely clear about input and output character encodings, probably because the documentation writers were themselves a bit confused about this stuff.
- It is possible (even likely) that bugs existed in both the PHP4 and PHP5 codebases two years ago when Steve wrote his piece, but have since been fixed.
- It is a bit tricky working with UTF-8, since you have to remember to detect escape sequences. A PHP library would be nice. Or better yet, hide it completely, like Java does. Still, it’s only a nuisance, not a show-stopper.
- Steve referred to the PHP XML parser’s mangling numeric character references. Expat doesn’t do that. However, it is possible that people think numerical character references refer to their current encoding, rather than to the abstract Unicode character set, and that will get them into serious trouble.
- Expat does not support all character encodings out of the box. In fact, XML parsers are required to support only UTF-8 and UTF-16 — use any other encoding (even ISO-8859-1) at your peril, since there’s no guarantee that other XML software will be able to read it.
- People often forget to declare what encoding they’re using.
- Anyone who serves XML documents as text/xml is going to get in trouble no matter what language people use, because of the reencoding that might take place.
Most of these problems are not unique to PHP — XML is hard and confusing, Unicode is hard and confusing, and when you put the two together, there’s lots of opportunity for human error.
I’d be interested in the URLs of well-formed XML documents in supported encodings (UTF-8, UTF-16, US-ASCII, or ISO-8859-1, I think) that do not work properly in recent versions of PHP4 or PHP5 with the simply identity-transformation script I posted. If there are deep problems with PHP, XML, and Unicode, rather than just user confusion, I’d like to know about them.