Quoderat

Two small, useful Nautilus shell scripts

Posted on March 29, 2006 by David Megginson

If you use a Unix-family operating system with the Gnome desktop and its default Nautilus file browser, you might know that you can extend Nautilus using simple shell scripts. Here two short and simple scripts.

Terminal window

This script, which I saved in my ~/.gnome2/nautilus-scripts/ directory as Shell, pops up a terminal window already set to the directory you’re browsing. If you want to do anything too complicated for Nautilus (or too tedious to do using a mouse), this is much more convenient than manually opening a shell window and changing the directory you’re already browsing:

#!/bin/sh

/usr/bin/gnome-terminal

Software build

This script, which I saved in my ~/.gnome2/nautilus-scripts/ directory as Make, builds a Makefile-based application inside Gnu Emacs, so that you can easily step through any errors in the source files (it would be easy to modify this to use Apache Ant or something similar):

#!/bin/sh

/usr/bin/emacs --eval '(compile "/usr/bin/make")'

It wouldn’t be too hard to rig up a variant of this to do well-formedness checking and validation of XML documents.

Simple is beautiful

I wasn’t lying when I wrote that these are short and simple scripts — it’s hard to believe how useful they are until you actually use them for a couple of days. It’s possible to do much more elaborate things with Nautilus and shell scripts, including operating on files selected in the GUI window, but as usual in tech, the biggest benefit comes from the lowest-hanging fruit.

Does anyone else have any nice 1- or 2-liners? I assume that KDE‘s file browser has similar functionality, so scripts from there would also be interesting.

Tagged tips | 2 Comments

The REST schism and the REST contradiction

Posted on March 25, 2006 by David Megginson

Update: a proposal for a better name.

Don Box got people talking last week in a posting where he distinguishes between two kinds of REST: lo-REST, which uses only HTTP GET and POST, and hi-REST, which also uses HTTP PUT and DELETE.

The schism

If this distinction doesn’t seem very important, don’t worry — it’s not. Tim Bray captured the most important point, that Don Box (who is heavily involved in REST’s nemesis, Web Services) is talking positively about REST at all. For the RESTafarians and some of their friends, however, Box’s heresy was even worse than his former non-belief, because heresy can easily lead the faithful astray: witness strong reactions from Dimitri Glazkov, Jonnay (both via Dare Obasanjo), and Dare Obasanjo himself. There is even a holy scripture, frequently cited to clinch arguments.

The contradiction

I do not yet have a strong opinion on which approach is better, but I do see a contradiction between the two arguments I hear most often from REST supporters:

REST is superior to Web Services/SOAP/SOA because it’s been proven to work on the Web.
Almost nobody on the Web uses REST correctly.

Pick one, and only one of these arguments, please. As far as I can see, apart from a few rare exceptions (like WebDAV), Don’s lo-REST — HTTP GET and POST only — is what’s been proven on the web. The pure Book of Fielding, hi-REST GET/POST/PUT/DELETE version is every bit as speculative and unproven as Web Services/SOAP/SOA themselves (that’s not to say that it’s wrong; simply that it’s unproven). Some REST supportors, like Ryan Tomayko, acknowledge this contradiction.

(Update) A better name?

Tim Bray proposes throwing out the REST name altogether and talking instead about Web Style. I like that idea, though the REST name may be too sticky to get rid of by now. Dumping the REST dogma along with the name would clear up a lot of confusion: HTTP GET and POST have actually been proven to work and scale across almost unimaginable volumes; on the other hand, like the WS-* stack, using HTTP PUT and DELETE remains a clever design idea that still needs to be proven practical and scalable.

Tagged architecture, programming | 4 Comments

XML 2006: Paper Tracks

Posted on March 21, 2006 by David Megginson

For XML 2006, which will be held in Boston from 5-7 December, we’ve decided to introduce four paper tracks. Each track will extend the full three days and will serve as its own mini-conference, concentrating on a specific area of interest (though we hope to see a lot of people moving among tracks):

Enterprise XML Computing: XML in the world of big business and government — legacy system integration, service-oriented architecture, REST and web services, etc.
XML on the Web: XML outside the firewall — AJAX, blogging technologies (RSS and Atom), Web 2.0, Semantic Web, publish/subscribe, tagging, etc.
Documents and Publishing: authoring, managing and publishing information using XML — DITA, Docbook, XSL(T/-FO), XHTML, and much, much more.
Hands-on XML: practical, workshop-oriented sessions, including last year’s popular Masters Series, case studies, tutorials, workshops, and live demos.

The official call for papers will go out at XTech 2006 in Amsterdam on Wednesday 17 May, and I hope to see many of you there. In the meantime, we’re counting on you to keep coming up with papers that educate, dazzle, and challenge, so please start thinking about what you’d like to propose for one or more of these tracks. Comments are, of course, very welcome.

(Technorati: xml2006)

Tagged conferences | 4 Comments

RFC: (Java) SAX exceptions and new minor SAX version

Posted on March 12, 2006 by David Megginson

(Note that this is not a major API change, and does not affect non-Java versions of SAX.)

Over on the sax-devel mailing list, Norman Walsh, who is involved with JAXP at Sun, has requested a small change to the SAXException class (see the archived thread).

When we were designing SAX quite a few years back, we needed the ability to embed an exception in another exception but Java did not support that, so we designed our own support. Starting with JDK 1.4, Java has supported embedded exceptions through the getCause method. Implementing getCause in SAXException would allow for more accurate stack traces and debugging, among other things.

Unfortunately, there is never such a thing as a perfectly backwards-compatible change. Chris Burdess pointed out that this change will break Java code that was calling initCause manually, and obviously, there will be some other differences in behaviour depending on which version of SAX people use. I believe that bringing SAX in line with modern Java usage (JDK 1.4 has itself been around for a while) is worth the trouble, and that very few applications would experience problems, but I’d like to see some wider discussion before I decide to put out a minor SAX release. Please let me know what you think, either by subscribing to the sax-devel list, posting a comment here, or posting your own blog entry and pinging this one.

Tagged programming | 2 Comments

Programming languages of distinction

Posted on March 6, 2006 by David Megginson

Via Ongoing, I read some interesting discussions of programming languages — mainly Python vs. Ruby, with most people happily dumping on Java.

Steve Yegge, in particular, argues that language success is based mainly on marketing, and that Python is doomed to obscurity because of the community’s lack of marketing savvy.

The programming language cycle

While I agree that Python probably is doomed to perpetual obscurity at this point, I think that Yegge’s focus on marketing is oversimplistic; instead, I’d argue that there’s a self-perpetuating cycle at work for successful programming languages:

Elite (guru) developers notice too many riff-raff using their current programming language, and start looking for something that will distinguish them better from their mediocre colleagues.
Elite developers take their shopping list of current annoyances and look for a new, little-known language that apparently has fewer of them.
Elite developers start to drive the development of the new language, contributing code, writing libraries, etc., then evangelize the new language.
Sub-elite (senior) developers follow the elite developers to the new language, creating a market for books, training, etc., and also accelerating the development and testing of the language.
Sub-elite developers, who have huge influence (elite developers tend to work in isolation on research projects rather than on production development teams), begin pushing for the new language in the workplace.
The huge mass of regular developers realize that they have to start buying books and taking courses to learn a new language.
Elite developers notice too many riff-raff using their current programming language, and start looking for something that will distinguish them better from their mediocre colleagues.

You’ll notice that there’s no step here called “marketing”; instead, there are several distinct stages of evangelization and community building. Major vendors (other than the language’s owner, if it’s a vendor) will start to notice the language once the second wave (sub-elite) developers arrive, and IT managers will notice it because of books, magazine articles, and pressure from the high-end developers. Some — possibly a lot — of marketing will come out of those steps, but it is as much a result of the language’s success as a cause.

Points of failure

In this cycle, there are a few highly probably points of failure:

Timing: A new language might not be at the right stage of development (too raw, or too stale) at the time when elite developers decide to make a mass migration.
Features: If the new language’s features don’t answer the elite developers’ annoyance list, not enough of them will migrate to it.
Openness: Elite developers are used to having a lot of influence, and if the new language’s development process does not allow them sufficient say in the new language’s evolution, they will leave before they attract enough sub-elite developers.
Tools: Sub-elite developers might find the language unsuitable for day-to-day production use, especially if enough basic tools are not available (libraries, testing, debugging, GUI tools, performance measurement, etc.).
General acceptance: Regular developers might object to the new language and sabotage projects using it, either by producing poor-quality code or by missing deadlines (and blaming the new language in both cases).

Most programming languages stumble over one or more of these — it’s as much luck as clever design when a language like C++ or Java makes it past the hurdles and into the workplace. Success tends to draw more success, money draws more money, etc.

The final and most important point here is that a programming language’s perceived coolness will always suffer from its success. Java cannot possibly still be cool when there are thousands of regular developers slaving away in the bowels of ACME Widgets using it to write enterprise applications. If, in fact, Ruby displaces Java in the enterprise (which may not happen, since Ruby has no advantage over Java to match Java’s memory-management advantage over C++), it will suffer precisely the same fate, and we can expect Bruce Tate to write a book Beyond Ruby in five years or so.

By that measure, Python’s very failure is a kind of success — as long as it never really becomes takes hold in the workplace it will always carry a small degree of distinction with it, and at least a few elite developers won’t feel pressured to move on. Like a movie or band that never becomes too popular, Python will hang onto its snob appeal.

Posted in Uncategorized | Tagged programming | 18 Comments

PHP, XML, and Unicode

Posted on March 1, 2006 by David Megginson

Update: in a comment John Cowan points out the obvious, that a UTF-8 escape sequence can never contain an ASCII character (because the high bit is always set, as I knew but failed to register). As a result, my xml_escape() function is way over-complicated. Thanks, John.

Update #2: in a comment, Jirka Kosek points out that PHP5 is actually using the also-excellent libxml instead of Expat — the PHP developers actually ported the expat-based, low-level interface to libxml so that it wouldn’t break legacy code. In that case, I’m especially impressed that my script produces byte-for-byte identical output with PHP4 and PHP5. I’m still looking for a problem with PHP’s XML+Unicode handling (other than the inconvenience of working with UTF-8 on the byte level).

Update #3: here’s a good summary of XML support in PHP5

A couple of weeks ago, Tim Bray posted about PHP and received a firestorm of comments, just as I did when I posted about PHP and Ruby on Rails almost a year ago. PHP generates a lot of passion, for good or for ill: my posting still gets a new comment every week or two.

As Tim updated his posting with comments, he linked to a two-year-old posting by Steve Minutillo about PHP4’s inability to detect character encodings in XML files and other Unicode bugs. That caught me by surprise — after all, PHP uses the venerable Expat as its XML parsing engine (the same engine used in most programming environments other than Java), and if Expat wasn’t getting things right, then the PHP people must have gone way out of their way to misconfigure it.

Testing Unicode support

To test XML character-encoding support in PHP, I used two PHP versions: 4.4.0, and 5.0.5 (which happen to be the current PHP4 and PHP5 heads in Ubuntu). I wrote a simple identity transform script (available for download at http://www.megginson.com/Software/xml-identity-transform.php — please consider it Public Domain) to read an XML file and write a simplified version of it back out again (I forgot to include processing instructions — sorry. I’ll fix that later.) The script always produces UTF-8 output, regardless of the input encoding. I ran it under both PHP4 and PHP5 against two XML source files with accented characters: one encoded in UTF-8, and the other encoded in ISO-8859-1 (with a suitable XML declaration). The script produces identical and correct UTF-8 output under both PHP4 and PHP5 (at least, the versions I tested). There is no conditional code based on the PHP version, but I did have to set a couple of options carefully.

Setting up a PHP XML parser

Here’s how I set up my XML parser in PHP:

$parser = xml_parser_create();
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, false);

The first line creates the parser (I’m not using Namespaces for this example, or it would look a little different.) The second line requests that the parser report element names, attribute names and values, content, and everything else to my application using UTF-8, no matter what the input encoding was. The final option undoes a mind-numbingly stupid default in PHP, where all element and attribute names are converted to upper case before being passed on.

Next, I register my event handlers with the parser (this step should be familiar to anyone who has ever programmed with Expat or SAX):

xml_set_element_handler($parser, 'start_element', 'end_element');
xml_set_character_data_handler($parser, 'character_data');

The handlers themselves are naively simple, attempting to recreate the XML markup reported to them:

function start_element ($parser, $name, $atts)
{
  echo("< $name");
  foreach ($atts as $aname => $avalue) {
    echo " $aname=\"" . xml_escape($avalue) . '"';
  }
  echo(">");
}

function end_element ($parser, $name)
{
  echo("</$name>");
}

function character_data ($parser, $data)
{
  echo(xml_escape($data));
}

The only complicated bit happens in the xml_escape function. Unfortunately, since I’m dealing with raw UTF-8, I have to know a bit about UTF-8 encoding to do the escaping — otherwise, my code might mistake part of an multi-byte escape sequence for an ampersand and replace it with an entity reference (note: this is all unnecessary — see John Cowan’s comment):

function xml_escape ($s)
{
  $result = '';
  $len = strlen($s);
  for ($i = 0; $i < $len; $i++) {
    if ($s{$i} == '&') {
      $result .= '&';
    } else if ($s{$i} == '<') {
      $result .= '<';
    } else if ($s{$i} == '>') {
      $result .= '>';
    } else if ($s{$i} == '\'') {
      $result .= ''';
    } else if ($s{$i} == '"') {
      $result .= '"';
    } else if (ord($s{$i}) > 127) {
      // skipping UTF-8 escape sequences requires a bit of work
      if ((ord($s{$i}) & 0xf0) == 0xf0) {
        $result .= $s{$i++};
        $result .= $s{$i++};
        $result .= $s{$i++};
        $result .= $s{$i};
      } else if ((ord($s{$i}) & 0xe0) == 0xe0) {
        $result .= $s{$i++};
        $result .= $s{$i++};
        $result .= $s{$i};
      } else if ((ord($s{$i}) & 0xc0) == 0xc0) {
        $result .= $s{$i++};
        $result .= $s{$i};
      }
    } else {
      $result .= $s{$i};
    }
  }
  return $result;
}

The rest of my code is just the normal Expat parsing loop: open a file (or URL), feed it to Expat in buffered chunks, and then report that the input is finished.

So where are the problems?

There may be huge problems that I somehow missed in my brief test.
The PHP documentation XML is not entirely clear about input and output character encodings, probably because the documentation writers were themselves a bit confused about this stuff.
It is possible (even likely) that bugs existed in both the PHP4 and PHP5 codebases two years ago when Steve wrote his piece, but have since been fixed.
It is a bit tricky working with UTF-8, since you have to remember to detect escape sequences. A PHP library would be nice. Or better yet, hide it completely, like Java does. Still, it’s only a nuisance, not a show-stopper.
Steve referred to the PHP XML parser’s mangling numeric character references. Expat doesn’t do that. However, it is possible that people think numerical character references refer to their current encoding, rather than to the abstract Unicode character set, and that will get them into serious trouble.
Expat does not support all character encodings out of the box. In fact, XML parsers are required to support only UTF-8 and UTF-16 — use any other encoding (even ISO-8859-1) at your peril, since there’s no guarantee that other XML software will be able to read it.
People often forget to declare what encoding they’re using.
Anyone who serves XML documents as text/xml is going to get in trouble no matter what language people use, because of the reencoding that might take place.

Most of these problems are not unique to PHP — XML is hard and confusing, Unicode is hard and confusing, and when you put the two together, there’s lots of opportunity for human error.

I’d be interested in the URLs of well-formed XML documents in supported encodings (UTF-8, UTF-16, US-ASCII, or ISO-8859-1, I think) that do not work properly in recent versions of PHP4 or PHP5 with the simply identity-transformation script I posted. If there are deep problems with PHP, XML, and Unicode, rather than just user confusion, I’d like to know about them.

Tagged programming, tips | 8 Comments

A new Namespaces discussion

Posted on February 26, 2006 by David Megginson

Eliot Kimber and I were both on the old W3C XML Working Group during the development of the Namespaces in XML specification. Late in the process, pressure from outside the WG forced us to make a major change to the specification, angering many of the members. Eliot, who was already pretty unhappy with the Namespaces spec, left; I decided to stay.

Eliot has recently had the grace and integrity to making a posting where he admits to being wrong about Namespaces, and states that he is now, with only a few caveats, a big fan of the spec. He even goes so far as to write the following:

If you’re not using namespaces you should be–I can’t see any excuse for anyone defining any set of XML elements that is not in a namespace. It should be required and it’s too bad that XML, for compatibility reasons, has to allow no-namespace documents.

The context problem

While I wasn’t originally as strongly opposed to Namespaces as Eliot was, I cannot claim to be as strongly in favour now. For me, the biggest problem with the Namespaces spec is the requirement for a context to interpret prefixed names. That’s no biggie as far as XML element and attribute names go:

<foo:bar xmlns:foo="http://www.example.org/foo" foo:a="b"/>

Here, there’s no doubt that foo:bar stands for “{http://www.example.org/foo/}bar” (or however you want to notate it), while foo:a stands for “{http://www.example.org/foo/}a.”

QNames in content and attribute values

What happens, however, when the prefixed name appears in an attribute value or content?

<foo:bar xmlns:foo="http://www.example.org/foo/" foo:a="foo:b">foo:c</foo:bar>

Simply looking at this XML document in isolation, there’s no way to know whether the attribute value “foo:b” and the content “foo:c” is meant as a literal string or a qualified name. The context (the xmlns declaration) is still there to allow software to expand the prefix, but you need something else — an external schema, hard-coded application logic, prompting a human operator — to decide whether it’s safe to expand the name. Any feature that requires the use of schemas to perform basic XML processing should raise red flags.

QNames in XPath expressions

The biggest problem, however, comes with referring to parts of an XML document in non-XML syntax. Consider the following XPath expression:

//foo:bar/@foo:a

Unlike the XML document, this expression does not provide any way to expand the foo: prefix. It needs some kind of external context. That means that you can never simply pass this around as a string argument in a programming language, for example, without also passing around a whole set of Namespace declarations. Namespace processors cannot safely discard prefixes, because they might still be important later on. XML transformation filters have to try to preserve original prefixes whenever possible. In short, in non-trivial XML processing, the distinction between the Namespace prefix and the Namespace URI quickly becomes blurred. And this is not simply a problem for tool makers — it’s one that bites developers, script writers, database administrators, and even information authors.

Namespaces if necessary, but not necessarily Namespaces

I don’t know an easy fix for this (perhaps including the full Namespace URI in XPath expressions would have been smarter), but given all of this hassle, I cannot agree with Eliot that Namespaces should always be mandatory. Where Namespaces are not needed for disambiguation — where an XML document isn’t meant to be published to the web for general use — avoiding Namespaces (or at least, using them sparsely) removes a huge amount of complexity from XML development, authoring, and information management. A script kiddie, for example, can easily write PHP code to deal with non-Namespaces qualified XML documents, but may quickly fall out of his or her depth once we stir Namespaces into the mix.

I do still believe that Namespaces are valuable, and in general, I’m not unhappy with the current specification; however, I also believe that simpler XML markup still has its place for a huge range of applications, especially when the XML document will be used in a specific way and not published to the world at large.

Tagged programming | 9 Comments

Earthquakes and high tech

Posted on February 25, 2006 by David Megginson

Ottawa had a little earthquake (magnitude 4.5) yesterday evening at 8:39 pm EST. Ottawa is Canada’s biggest high tech centre (or at least was before the dot.bomb, drawing more investment than Toronto). Like the San Francisco Bay area, Ottawa is built on top of a series of geological fault lines; however, ours never result in worse than a minor tremor every 5-10 years. Our tech industry is (relatively) minor as well. Does the severity of fault lines correlate with high tech success?

Maybe a little danger gives people an edge. Tech people in the Bay area live every day wondering if they’re going to fall into the Pacific tomorrow, and bus ads in San Francisco talk about stocking up on food and flashlights (I don’t think anyone’s every going to count on timely help from FEMA again). What are we worried about in Ottawa? A bad skating season on the Rideau Canal?

Note to Route 128 companies: to find the edge you’ll need to compete seriously with the Bay area, you’ll have to come up with a looming natural disaster. A mega tsunami caused by a volcano in the Canary Islands might fit the bill.

Tagged news | 1 Comment

Two Web Services Questions (what actually works?)

Posted on February 23, 2006 by David Megginson

My biggest frustration with the current Web Services debate (triggered innocently in a posting by Don Box, with followups by nearly everyone) is the lack of verifiable information. We need a big, independent study to answer two important questions about each part of the WS-* stack:

Does it actually work as specified in each individual implementation?
Does it actually work as specified across many different implementations?

Any WS-* feature that receives a ‘no’ answer to either of these questions is excluded from the debate — WS advocates cannot credibly claim that WS-* is more appropriate for complex, enterprise interfaces unless the complex enterprise features actually work, portably.

On the other hand, any WS-* feature that receives a ‘yes’ answer to both of these questions needs to be taken seriously by the REST advocates. They’ve gotten used to throwing mud at WS-*, assuming that everything is broken; where the WS people have managed to get something working robustly and portably, let’s at least start by giving them the benefit of a doubt that they might have solved a real business problem.

Tagged business, programming | 3 Comments

Remembering the Y2K panic

Posted on February 20, 2006 by David Megginson

Steven Levitt (of Freakonomics fame) has started a small controversy by casually mentioning that the Y2K crisis was a false prophesy (his more detailed followup posting is here; he also points to a paper that I didn’t bother reading, but probably does a better job than my posting of going over the issue).

While I never advertised myself as a Y2K consultant, I made money from the Y2K panic like everyone else in IT — even if I didn’t do Y2K projects directly, systems were being replaced early because of Y2K, IT departments were getting bigger budgets and spending on whatever they wanted, etc. And like many (most?) people reading this weblog, I went out of my way to try to explain my customers at every opportunity why the Y2K threat was exaggerated.

The logic was simple: the scare stories in the press talked about everything shutting down at midnight on December 31 2000, but in fact, times and dates in IT systems are much more complicated than that: information and events go through lifecycles that have starts, ends, and often many stages in-between. Here are some examples:

If you took out a 20-year mortgage in 1980, the expirty date would have been 2000.
If you were 55 in 1990, you would have been 65 in 2000.
If you received a new credit card with a five-year term in 1995, the expiry date would have been 2000.
When your credit card bill arrived on 15 December 1999, payment was probably due in 2000.

So how many of you received notices in 1981 that your mortages were 81 years overdue? Or how many of you received pension benefits for 156-year-olds in 1991? How many of you found that your credit cards were declined in 1996 because they were 96 years past expiry? Or how many of you were charged 99 years’ interest for an unpaid credit-card bill in 2000?

Of course, some of these things did happen to some people in the decades leading up to Y2K, but only very rarely — rarely enough, in fact, that every case was considered newsworthy. 2000 was going to be the peak of a curve that started decades before and ended decades after, but since the curve was still so close to zero by the 1990s, it was obvious to anyone who cared to spend time thinking (even a statistical numbskull like me) that the Y2K consultants screaming doom and gloom were either not fully competent or not fully honest. It was important, of course, to check the most critical systems, like hospital equipment or nuclear power plants, but Y2K was hardly going to be a real operational problem for most organizations.

Those same consultants defend themselves now, of course, by claiming that they averted a catastrophe, but that is trivially easy to disprove — countries that spent very little on Y2K preparedness, like France, had no more problems that countries that spent a lot, like the U.S. and Canada. Of course, France benefitted from some spill-over from the North American IT work, but there still should have been a significant, measurable difference between the two. There wasn’t. QED.

Tagged business, news | Comments Off