Saturday, February 3, 2007

XML - It's not just for data. Really.

I had an illuminating moment a few weeks ago when I met with some developers at a client about an XML project that I am working on for them. The project involves some pretty sophisticated Java code to execute a transformation of XML content into multiple formats like HTML, PDF and Microsoft HTML Help. Essentially, the work involves extending an existing framework to support transforming any Document Markup Language (DITA, DocBook, TEI, custom DTD/Schema, you name it) instance to whatever the desired output should be.

Yet after a few days with these extremely bright developers, it became quite clear to me that they viewed XML strictly as a data format, something that contained configuration data, or content manifests. They had no experience or context to think about XML as content. And here I was explaining to them that XML is a very rich medium for technical manuals and even commercial publications (this is a whole other topic).

Having worked with DocBook for the last 6 years, and more recently DITA, my world view of XML was focused strongly on XML for documentation. Yes, I've created lots of XML for configuration files and other data (SOAP packets) in my development work. But generally I correlate XML with technical document content.

And then it dawned on me: Within the development community at-large, there appears to be a large disconnect about how XML is and can be used. Like the developers I spoke about, XML is rich data format with the ability to hold all kinds of data in a hierarchical way - something that Java .properties files and flat text files can't do easily; for others like myself, XML is a rich content format that provides a way of semantically describing the content. And in fact, the content becomes data, not just text.

There are a proliferation of XML standards: RSS, WSDL, SMIL and so on. These are powerful data structures that have advanced the Web from serving up static HTML pages to delivering a wide variety of new services and applications. Yet, ironically, all of these XML standards have their orgins in SGML (indirectly), of which, some of the earliest applications were used to mark up content: HTML being perhaps the most well known. DocBook started out as an SGML application in the early 90's and ported to XML later on.

This goes to show that XML is an immensly powerful technology that lives up to its name: eXtensible Markup Language. It has become so firmly entrenched in so many technologies. Interestingly, it's because XML is both data and content that we find two naturally occurring views in the development community. We can't know everything, and as a consequence of that, we tend to focus on specific technologies to become expert in (myself included). Look at Java or C# as examples, there are so many different distinct areas that these languages can be used, we can't possibly be expert in all of them.

This meeting proved to be illuminating in other ways. I began thinking about yesterday's post and Microsoft's decision not to support XSLT 2.0 or XPath 2.0. Perhaps the view of XML as data is a much more common than that of viewing XML as content. And maybe this distinction is an artifact of "conventional" ways of thinking about "data" and "human-readable" content. Data lives in databases, and configuration files; content lives Word and HTML files, and ne'er the two shall meet, and for the most part they didn't.

XML puts data and content on an equal playing field, which opens up a whole variety of possible applications, some of which are only beginning to emerge. It has also opened up existing applications to new ways of sharing information. For example, Microsoft's Office Open XML standard will allow interchange to and from a wide variety of sources, perhaps even from DocBook XML content!

No comments: