Saturday, January 15, 2005

XHTML 2 as the 'Universal Document'

One of my 'hats' is as an Invited Expert on the HTML Working Group. That may sound grand (and of course, it may not), but as it happens just about everyone I have met who is on a W3C Working Group is an expert, and the two groups I am involved with have more than their fair share of clever people. So, 'Invited Experts' are usually just individuals who may have something to contribute to a group, but because their companies are not large enough to be W3C members, they end up being invited on.

Now, it may not be obvious why I should be interested in HTML from the point of view of 'internet applications', but I am. In the model I'm working to, HTML is the host language of choice when putting together an IA architecture, and the latest version of HTML -- XHTML 2.0 [XHTML2] -- goes a long way to providing an 'abstract document' container that is ideal for our needs.

To explain what I am getting at, I should perhaps clarify my terminology. When I use the word document I mean pretty much anything that you would commonly refer to as a document -- a spreadsheet, a memo, a chapter of your autobiography, a piece of music, a blog entry, even a vector graphics image. However, I also mean pretty much anything that you would not commonly refer to as a document -- a list of RSS feeds, a list of RSS articles, a friend-of-a-friend (FoaF) file, even a configuration file for your email software. Why I include these in my list will hopefully become clear.

What's a Document?

By document then, I mean the thing itself that you have created, for some purpose. It's not the actual physical file itself, but it's pretty close to it. This document has a number of properties just by being a document -- for example, someone (or some thing) created it, on a certain date, and perhaps modified it on one or more other dates. With this in mind, we could therefore say that pretty much any document you care to create has a basic structure of:

  • metadata about the document;

  • the actual content of the document.


Expressed in XHTML 2 mark-up this general structure looks something like this:

<html
xmlns="http://www.w3.org/2002/06/xhtml2/"
>
<head>
<title>My Expenses Spreadsheet</title>
...
<head>
<body>
...
<body>
<html>

Now we need to be clear here that we're not drawing any hard and fast rules -- so we're not for example saying that all the metadata about this document that exists must somehow be squeezed into head, since user access rights (for example) might be stored in the operating system, or in some WebDAV server. And we're also not saying that there might not be other types of more specialised document structure which might be more appropriate in certain situations.

But the structure outlined works pretty well as a basic format for most of the documents we are likely to create or come across as we go about our business in Internetland.

Base Class

If you like to think in terms of object-oriented programming (and let's face it, who doesn't) then you can see the XHTML 2.0 'document' as a base class, onto which other document types can be built. For example, we could use XHTML 2.0 as a base format for creating a document format for email messages (the example below is adapted from [RFC2822]):

<html
xmlns="http://www.w3.org/2002/06/xhtml2/"
xmlns:email="http://www.faqs.org/rfcs/rfc2822.html#"
>
<head>
<meta property="email:From">John Doe &lt;jdoe@machine.example>&gt;</meta>
<meta property="email:To">Mary Smith &lt;mary@example.net&gt;</meta>
<meta property="email:Subject">Saying Hello</meta>
<meta property="email:Date">Fri, 21 Nov 1997 09:55:06-0600</meta>
<meta property="email:Message-ID">&lt;1234@local.machine.example&gt;</meta>
<title>An email from John to Mary saying 'hello'</title>
<head>
<body>
This is a message just to say hello.
So, "Hello".
<body>
<html>

The body element is pretty straightforward -- it's your actual message. The head is a bit more complex, but not much since all we've done is put a stack of metadata in, which is what we've always done with HTML documents. You probably noticed that we've used QNames to express the properties -- i.e., we have both a namespace prefix and a local name -- and this is one of the changes that has been made to HTML as it moves to XHTML 2.0 in order to make it more 'metadata friendly'.

And of course ... RSS

With an example like an email it's pretty easy to see how it corresponds to our notion of a 'document'. But what about a list of RSS feeds? First, some background.

RSS Readers

As a little test to see what further features we needed to add to formsPlayer to make building IAs as easy as possible, we decided to develop an RSS reader. There are a number of readers available, some of them free, some open source, and some for sale. The general format is a standalone application that allows you to maintain a list of feeds, which periodically retrieves a list of article references from each of those feeds, and then allows users to read each article.

Our take on this is that RSS readers don't actually warrant being applications that are seperate from your normal day-to-day browsing tasks. If we could put the list of articles from the feeds into a side-bar, then you could read each article in the browser just like you read any other article. Which of course means you can do things with this article that you normally do with any other article, such as bookmarking it, running your voice reader on it, and so on.

Anyway, that's a little off the main subject, but the point is that as we began to write the application we obviously realised that we needed a format for storing the users' lists of feeds. We looked around, and the most common syntax we came across was OPML.

OPML

OPML stands for Outline Processor Mark-up Language, and was actually intended to be part of a mechanism for navigating hierarchical lists, even to the level that part of its format is to say which nodes should be expanded and which contracted in the user interface, as the user moves through the data. It's quite a short specification [OPMLSPEC], but what little it does, it does well.

However, as RSS feeds became increasingly popular, the means of navigating a set of lists became the list itself, and before long OPML became the language of choice for storing collections of RSS feeds. Of course this isn't the worst thing that could happen, but we were disinclined to use OPML as our storage format for the formsPlayer RSS reader for two reasons:

  1. We really didn't want to have to design YACF -- Yet Another Configuration Form -- to allow users to maintain their list of favourite feeds. Every time we designed a new IA we were having to come up with a new format to store the data, and then we had to build YACF to manage that data.

  2. We also had this idea that any configuration file should be readable in a web-browser. We weren't completely wedded to this idea in the sense that if we found a great format that we couldn't make work with this feature we'd live without this feature. But we did like the idea that if you stumbled across a configuration file in a directory somewhere you could just double-click on it and see what the file was for. And if you edited the file you could work out pretty easily where to change stuff if you wanted to.


HTML

These two reasons led to using HTML as the format for storing our configuration information; obviously it's readable in a browser, but it also provided us with a common structure for our data, which meant that we didn't have to keep re-inventing the wheel with each new Internet Application that we devised. Let's look at a typical feed list, first in OPML, and then XHTML 2.0.

The OPML version of my list of RSS feeds looks something like this:

<?xml version="1.0"?>
<opml version="1.0">
<head>
<title>Mark Birbeck's Feeds</title>
</head>
<body>
<outline
text="BBC News | UK | World Edition"
title="BBC News | UK | World Edition"
type="rss"
version="RSS"
xmlUrl="http://news.bbc.co.uk/rss/newsonline_world_edition/uk_news/rss091.xml"
htmlUrl="http://news.bbc.co.uk/"
description="Updated every minute of every day - FOR PERSONAL USE ONLY."
/>
</body>
</opml>

The attributes version, xmlUrl, htmlUrl and description are not actually part of OPML, but have been added by convention to deal with RSS feed lists. What is part of OPML though is that the outline element could contain further outline elements if we wanted. For example:

<body>
<outline text="News">
<outline
text="BBC News | UK | World Edition"
title="BBC News | UK | World Edition"
type="rss"
version="RSS"
xmlUrl="http://news.bbc.co.uk/rss/newsonline_world_edition/uk_news/rss091.xml"
htmlUrl="http://news.bbc.co.uk/"
description="Updated every minute of every day - FOR PERSONAL USE ONLY."
/>
</outline>
</body>

Anyway, returning to the first example, the first thing that strikes you is that we have head, title and body elements -- OPML is already looking like a strong candidate for being converted to HTML. But it gets better; XHTML 2 has a new type of list element for marking up navigational lists, and when they are nested they pretty much mirror OPML's outline layout. The following XHTML 2.0 could be used to represent the same RSS feed list as our previous OPML example:

<html xmlns="http://www.w3.org/2002/06/xhtml2/">
<head>
<title>Mark Birbeck's Feeds</title>
</head>
<body>
<nl>
<label>News Feeds</label>
<li href="http://news.bbc.co.uk/rss/newsonline_world_edition/uk_news/rss091.xml">
<meta property="dc:description">
Updated every minute of every day - FOR PERSONAL USE ONLY
</meta>
BBC News | UK | World Edition
</li>
</nl>
</body>
</html>

Here we can see another new feature of XHTML 2.0, which is that the meta element can appear anywhere in the document, and when it does it need not apply to the document as a whole. When placed in the head element it will usually apply to the document -- as it has done traditionally -- but if, as in this example, an @href appears above the meta element, then our metadata applies to that.

Universal Documents

What's great about this format is that there is actually nothing RSS-specific about it; the same structure could be used to store a list of bookmarks or your friends' web-sites. The fact that the URLs in this document are specifically RSS feeds is significant at the level of the application, not the document. (Of course, if we have a document that has both bookmarks and RSS feeds then we do need to know the difference -- but that's a topic for another day.)

And as we discussed before, the document can be read in a browser ... when they support XHTML 2.0!

Links

[OPMLSPEC] OPML Specification, http://www.opml.org/spec
[RFC2822] RFC 2822, http://www.faqs.org/rfcs/rfc2822.html
[XHTML2] XHTML™ 2.0, http://www.w3.org/TR/xhtml2/

0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home