Saturday, February 17, 2007

Why XML Content Interoperability Is Critical

Content re-use, sharing, and re-purposing has long been the "Holy Grail" of publishing. There have been numerous attempts at solving this problem, all with varying degrees of success. Yet as the publishing has grown and evolved over time, technologies have emerged to solve one set of problems, only to find that other technical challenges have "filled the void."

Back In The Day...

In the 1980's, desktop publishing tools opened the doors to a brand new set of professionals to create and produce high quality and lower cost publications. Tools like Frame, Quark, WordStar, WordPerfect, and Word would allow veritable novices to quickly and easily create professional-looking typeset documents.

Yet, while these tools opened the door to cheaper, faster publication cycles, they also created a very significant problem: Each tool created content in their own proprietary formats. And these formats were not compatible with other tools! Sure, this wasn't a problem if all your content was in the same format, and you didn't have to leverage, share or reuse content from other tools. But the moment you needed to incorporate information from any other tool, life became very difficult.

The unintended consequences of trying to convert incompatible formats to your own usually resulted in "One-Offs", but more importantly, the incurred costs (including the pyschological and physical trauma) were very high. It wasn't uncommon that a conversion would take a single writer/editor/DTP professional several weeks to convert a manual. Now what if your organization was sharing content from several partners using different formats - now we're talking about real money!

Enter SGML...

In the late 80's, early 90's Standardized General Markup Language (SGML) emerged with the promise that it would solve the format conundrum that desktop publishing tools wrought on us. The idea was brilliant: All content would be stored as a text, but formatting and other semantics would be included using "tags" and "attributes" that could easily be interpreted.

Robert Burns could not foresee that his oft quoted verse would apply here:

"The best-laid schemes o' mice an' men
Gang aft agley,
An' lea'e us nought but grief an' pain,
For promis'd joy"

The problem with SGML was there were few tools that supported this very complex standard. What tools that were available were very expensive and out of reach but for the larger organizations that could afford the investment.

HTML and World Wide Web

The mid-90's introduced HTML and the World Wide Web, along with a brand new medium for sharing content. HTML is in fact an SGML application. But unlike other SGML applications, it was widely adopted because of a unique piece of software, made popular by the likes of Mark Andreessen, the web browser. Another factor in its adoption was the very small, easy to learn tag set - in essence, it contained the basic structural components found in most publications:

  • Headings (Sections)
  • Paragraphs
  • Lists
  • Tables
  • Images
  • Inline formatting elements (bold, italic, monospace)

Still, there were several problems with HTML. The markup was primarily focused on presentation. Content semantics were lost: Was an ordered list item intended to be a procedural step or the ith item in a numbered list? Is the monospaced phrase a command or environment variable?

The second problem was that web browsers enabled the misuse and abuse of elements in such a way that even the most basic semantics were buried in the mishmash of tags.

XML!

In the late 1990's the W3C developed the eXtensible Markup Language (XML). It promised a lightweight version of SGML that could be enabled for the web. It has lived up to its promise, and then some. Like HTML, XML is now widely adopted because of the availability of cheap (even free) tools. Along with these tools, major programming languages (Java, VB, C++, .NET) support XML through easy to use APIs.

SGML applications, like DocBook, soon ported their DTDs to XML. Editing tools like XMetaL and Arbortext Editor supported XML and made it easier for the "uninitiated" to create structured content.

The widespread adoption of XML also brought with it another problem: The number of XML Document Standards has grown significantly in the past 10 years. And today, standards like TEI, DocBook, DITA, and even ODF (Open Document Format - used by the Open Office tools) are widely used. There are also countless variants of some these standards.

Now the problem isn't the lack of semantics, or incompatible formats, it's the proliferation of XML Document standards, each of which has taken a different approach toward producing content. All of these standards have valid practical applications (which I will not discuss here - suffice it to say that each application takes into account different forms and functions toward producing content).

In an evironment where collaboration and new types of partnerships are emerging, the conundrum now is how do I share/re-use/re-purpose content from multiple partners using different XML standards? How do I mitigate risks to my budget and schedule by leveraging disparate content? In my experience, this is sort of like a Chinese Finger Puzzle: It's easy to adopt a standard (that maps to your own processes), but it's a whole lot more complicated to work around it when other standards are introduced!

This is why content interoperability is so important. In my experience, this problem has often been a gating factor in collaboration. As I mentioned in a previous entry, organizations often develop their "language" and semantics around well established processes. Also true of many organizations, they will "reject" processes that run counter to their own. The end result is an "informational impasse" - information that doesn't fit the organization model is quickly dismissed.

Yet many organizations have been asked to collaborate with other organizations using different "languages." This also applies to XML standards. Some organizations we've consulted with are producing books, and collections of articles, which is best suited for DocBook. Yet in the future, they may partner with other organizations to produce learning content, which is not one of DocBook's strengths. Another organization is has an OEM partnership where both are using different DITA specializations. And within that same company, there are organizations using DocBook variants. All of this content is leveraged and reused in many different ways.

There are several strategies for dealing with disparate standards. And I'll discuss these in future posts.

No comments: