Monday, February 19, 2007

Types of XML Content Interoperability: Pros and Cons

In my last post, I talked about why we need XML interoperability. Now, let's talk about different strategies for implementing interoperability. We'll also discuss the pros and cons for each approach.

There is a common thread with each approach: XSLT. What makes XML remarkably flexible and resilient (and widely adopted) is its ability to transformed into so many different formats for both human and computer consumption. It's also why XML Interoperability can even be discussed.

Types of XML Interoperability

There are three basic strategies for acheiving interoperability between XML Document Standards:

  • Content Model Interoperability
  • Processing Interoperability
  • Roundtrip Interoperabilty

Each of these approaches has valid use case scenarios, and should not be dismissed out of hand. Yet, each of these approaches makes certain assumptions about the business processes, and environments that will work in some circumstances, but are less than optimal in others.


Content Model Interoperability

Content Model Interoperability is centered around enabling all or part of one standard's content model to be included as part of another standard. For example, DITA's specialization capabilities could be employed to create custom topic types for DocBook sections or refentries (in a DITA-like way). Conversely, DocBook's DTDs are designed to create customizations on top of the core content model.

In addition to customizing the DTDs (or Schemas), there is an additional step to support the new content in the standard: You need to account for these custom elements in the XSLT stylesheets - for each intended output format.

While on the surface this approach appears to be the most logical way to ensure that your content can interoperate with another standard, this is not an approach to be undertaken for the faint of heart. Working with DTD's and schemas is doable, but will require a thorough understanding of both standards before you begin. There are other limitations:

  1. This approach allows you to accept content from one standard, but doesn't allow you to share or leverage this content with other collaration partners. In effect, this approach is "shoehorning" content from one standard into yours. However, if you are dealing with receiving content from only one partner (and you aren't sharing content elsewhere), this could be a viable approach. But keep in mind...
  2. You and your partner are now both bound to a fixed version of the standards that will be sharing content. If either you or your partner decide to move to a later version of the respective standards, you may have to rework your customizations to support the new content models. You also run the risk that your legacy content won't validate against the new DTDs or schemas.
  3. Be aware that while content in different namespaces may provide "short-term" relief, it can also cause "long-term" headaches (much in the same way that Microsoft's COM architecture introduced us all to "DLL Hell"). It also means that your content must also be in a namespace (even if it is the default one).
Processing Interoperability

In this approach, content from one standard is either transformed or pre-processed into the other using XSLT. This approach is less risky in some ways compared to Content Model Interoperability: You don't have to maintain a set of DTDs to enable content interoperability, and it's whole lot easier to share the transformed content with partners once it's transformed into a single DTD.

There is a slightly different angle you you can take: You could say that you won't preprocess the content into your DTD, but instead use your XSLT stylesheets to incorporate the "foreign" content into the final output. For some cases, where you may be simply "rebranding" content, this might be a viable approach, yet keep in mind that this might mean some additional investment in incorporating other tools in your tool chain. For example, DITA and DocBook content employ very different processing models (i.e., the DITA Open Toolkit vs. the DocBook XSLT stylesheets). This may require a hefty development effort to integrate these tools properly in your environment. However, if you intend to leverage the content elsewhere in your own content, this angle can become a lot harder to implement.

For organizations sharing content back and forth, or for groups that are receiving content from one partner and are sharing it with other partners in the pipeline, this could be a reasonable approach. Still, there are potential risks here:

  1. This "uni-directional" approach is more flexible than than Content Model Interoperability, but, you still potentially have the same DTD/Schema version problem. And it only works realistically for one pair of standards, for example DocBook and DITA.
  2. If your partner begins creating content in a newer version of their DTD, you may have to upgrade your transforms to enable the content to be used by you.
  3. You still need to be well-versed in both standards to ensure each plays nicely in your environment

  4. Be prepared for dealing with validation issues. While each standard does include markup for standard content components like lists, tables, images, etc., there are structures that do not map cleanly. In these cases, you will need to make some pretty hard decisions about how they will (or will not) be marked up.


Roundtrip Interoperability

This is perhaps the most ambitious approach to creating interoperable content and encompasses being able to transform one standard into another and round trip that content back into the original standard. Like Processing Interoperability, you still have some very tricky issues to contend with:

  1. How do you handle round tripping between different versions of the standards? The net result is that you will need multiple stylesheets to support each version permutation.

  2. It's bi-directional, meaning that the round trip only works between the two standards (and with specific versions of those standards).
The following figures (taken from Scott Hudson and my presentation at DITA 2007 West) illustrate the problem:




In this example, we're only dealing with two standards, DocBook and DITA. But as you can see, there are numerous permutations that are potential round trip candidates. Now let's add another standard, like ODF






You can see that this quickly becomes a very unmanageable endeavor.



Conclusion

I've gone over three different strategies for approaching XML interoperability, situations where they work well, and some of the problems you may encounter when choosing one of these strategies. In my next post, I'll look at another approach for handling XML interoperability.

Saturday, February 17, 2007

Why XML Content Interoperability Is Critical

Content re-use, sharing, and re-purposing has long been the "Holy Grail" of publishing. There have been numerous attempts at solving this problem, all with varying degrees of success. Yet as the publishing has grown and evolved over time, technologies have emerged to solve one set of problems, only to find that other technical challenges have "filled the void."

Back In The Day...

In the 1980's, desktop publishing tools opened the doors to a brand new set of professionals to create and produce high quality and lower cost publications. Tools like Frame, Quark, WordStar, WordPerfect, and Word would allow veritable novices to quickly and easily create professional-looking typeset documents.

Yet, while these tools opened the door to cheaper, faster publication cycles, they also created a very significant problem: Each tool created content in their own proprietary formats. And these formats were not compatible with other tools! Sure, this wasn't a problem if all your content was in the same format, and you didn't have to leverage, share or reuse content from other tools. But the moment you needed to incorporate information from any other tool, life became very difficult.

The unintended consequences of trying to convert incompatible formats to your own usually resulted in "One-Offs", but more importantly, the incurred costs (including the pyschological and physical trauma) were very high. It wasn't uncommon that a conversion would take a single writer/editor/DTP professional several weeks to convert a manual. Now what if your organization was sharing content from several partners using different formats - now we're talking about real money!

Enter SGML...

In the late 80's, early 90's Standardized General Markup Language (SGML) emerged with the promise that it would solve the format conundrum that desktop publishing tools wrought on us. The idea was brilliant: All content would be stored as a text, but formatting and other semantics would be included using "tags" and "attributes" that could easily be interpreted.

Robert Burns could not foresee that his oft quoted verse would apply here:

"The best-laid schemes o' mice an' men
Gang aft agley,
An' lea'e us nought but grief an' pain,
For promis'd joy"

The problem with SGML was there were few tools that supported this very complex standard. What tools that were available were very expensive and out of reach but for the larger organizations that could afford the investment.

HTML and World Wide Web

The mid-90's introduced HTML and the World Wide Web, along with a brand new medium for sharing content. HTML is in fact an SGML application. But unlike other SGML applications, it was widely adopted because of a unique piece of software, made popular by the likes of Mark Andreessen, the web browser. Another factor in its adoption was the very small, easy to learn tag set - in essence, it contained the basic structural components found in most publications:

  • Headings (Sections)
  • Paragraphs
  • Lists
  • Tables
  • Images
  • Inline formatting elements (bold, italic, monospace)

Still, there were several problems with HTML. The markup was primarily focused on presentation. Content semantics were lost: Was an ordered list item intended to be a procedural step or the ith item in a numbered list? Is the monospaced phrase a command or environment variable?

The second problem was that web browsers enabled the misuse and abuse of elements in such a way that even the most basic semantics were buried in the mishmash of tags.

XML!

In the late 1990's the W3C developed the eXtensible Markup Language (XML). It promised a lightweight version of SGML that could be enabled for the web. It has lived up to its promise, and then some. Like HTML, XML is now widely adopted because of the availability of cheap (even free) tools. Along with these tools, major programming languages (Java, VB, C++, .NET) support XML through easy to use APIs.

SGML applications, like DocBook, soon ported their DTDs to XML. Editing tools like XMetaL and Arbortext Editor supported XML and made it easier for the "uninitiated" to create structured content.

The widespread adoption of XML also brought with it another problem: The number of XML Document Standards has grown significantly in the past 10 years. And today, standards like TEI, DocBook, DITA, and even ODF (Open Document Format - used by the Open Office tools) are widely used. There are also countless variants of some these standards.

Now the problem isn't the lack of semantics, or incompatible formats, it's the proliferation of XML Document standards, each of which has taken a different approach toward producing content. All of these standards have valid practical applications (which I will not discuss here - suffice it to say that each application takes into account different forms and functions toward producing content).

In an evironment where collaboration and new types of partnerships are emerging, the conundrum now is how do I share/re-use/re-purpose content from multiple partners using different XML standards? How do I mitigate risks to my budget and schedule by leveraging disparate content? In my experience, this is sort of like a Chinese Finger Puzzle: It's easy to adopt a standard (that maps to your own processes), but it's a whole lot more complicated to work around it when other standards are introduced!

This is why content interoperability is so important. In my experience, this problem has often been a gating factor in collaboration. As I mentioned in a previous entry, organizations often develop their "language" and semantics around well established processes. Also true of many organizations, they will "reject" processes that run counter to their own. The end result is an "informational impasse" - information that doesn't fit the organization model is quickly dismissed.

Yet many organizations have been asked to collaborate with other organizations using different "languages." This also applies to XML standards. Some organizations we've consulted with are producing books, and collections of articles, which is best suited for DocBook. Yet in the future, they may partner with other organizations to produce learning content, which is not one of DocBook's strengths. Another organization is has an OEM partnership where both are using different DITA specializations. And within that same company, there are organizations using DocBook variants. All of this content is leveraged and reused in many different ways.

There are several strategies for dealing with disparate standards. And I'll discuss these in future posts.

Thursday, February 8, 2007

A "Sociology" of XML Languages

In a perfect world, XML content could easily be shared between different users and organizations because everyone would be sharing the same markup and semantics. Information interchange could be seamless; content could be repurposed and reused with minimal effort between different functional teams; XML processing tools could be optimized.

Yet, there are numerous reasons why we see so many different XML grammars used by different organizations. I'll focus on two of these briefly:
  1. Organizational Dynamics
  2. Multiple XML Standards

Organizational Dynamics

I never thought that my background in Sociology would ever be useful, but it certainly is applicable here (I'm very rusty since I haven't consciously thought about this subject in over 10 years): Looking back at the works of Emile Durkheim, Max Weber, and Frederick Winslow Taylor, we see that organizations are structured around distinct divisions of labor to enable individuals to specialize their skills and work on discrete aspects of the "production process" (keep in mind that most of these theories during the peak of the Industrial Revolution).

What's more interesting here is how groups are organized. And in part, this shaped by many different factors including the industry vertical, the size of the organization, the relationships with other organizations. There is plenty of literature about these subjects that I won't delve into them here.

The key takeaway is that all these factors have a direct effect on the organization's processes, meaning that for information development groups (Tech Pubs, Training, etc.), this affects how information (content) is created, managed, and distributed.

Through an organization's processes, there is interesting side-effect on language. Organizations create their own vocabulary to express, and even rationalize their processes (there are other implications of this like "group identification" at work here too). For example, there is an often quoted line from the movie, Office Space, "Did you get the memo about the TPS Reports?" Even within the same industry where there are common terms (like "GUI" or "Menu" in software), there are distinct "dialects" that evolve over time, much in the same way that there are different Spanish dialects: A spanish speaker from Spain could probably converse with a spanish speaker from Argentina, but there might be word or phrases that aren't understood.

And this manifests itself in the XML syntax adopted by these organizations used to create content. A logical strategy for these organizations is to adopt known XML standards like DocBook or DITA that fit their organization's process the closest, and modify these standards to incorporate words or phrases of their own into their XML syntax.

Multiple XML Standards

One of the incredibly powerful aspects of XML is its ability to evolve over time to support different syntaxes. The unintended consequences, however, is that we now have several well known XML Document standards like DocBook and DITA. While they are different architecturally, and to some extent, semantically, they're both targeted at virtually the same audience (information developers), produce the same kinds of output formats (PDF, HTML Help, HTML, Java Help), and probably more important, contain the same kinds of structural components (paragraphs, lists, tables, images, formatting markup), albeit using different element names ("A rose by any other name would smell as sweet" - Romeo and Juliet).

Yet, by having multiple standards, it can create an "informational impasse," where the DTDs get in the way of sharing content across organizations. And for many organizations, this is a real problem. Across all industry verticals, we're seeing news forms of collaboration and partnering across companies (and organizations), along with consolidation (mergers and acquisitions). And from an XML content perspective, the question is, "My content is in 'X' and my partners' content is in 'Y' and 'Z'. How do I reconcile these disparate document types?"

And therein lies the need for a Doc Standards Interoperability Framework, which I will describe in future posts.

DITA 2007 West: Day 3

Had some great conversations with France Baril and Yas Ettesam on a wide variety of topics including DTD development and maintenance, data and process modelling, observations and trends in publishing - definitely very interesting.

Scott and I gave our presentation on our proposed DocStandards Interoperability Framework. We got great feedback from France and Eric Hennum. The main purpose of the presentation was to stimulate some interest and further discussion on enabling interchange between disparate standards, and we accomplished what we set out to do. The next step is to use the momentum we got from this conference and start the process of making this an OASIS Technical Committee.

Overall, a very good conference. I met some really sharp people that I'll definitely hope to keep in touch with.

On to OpenPublish in Baltimore...

Wednesday, February 7, 2007

DITA West 2007: Day 2 - Dimented and Topic-oriented

Amber Swope from Just Systems gave a great keynote presentation on Localization issues with DITA content. Interestingly, it doesn't just apply to DITA, and has broad implications to any organization that does/will have L10N or I18N as part of their deliverable stream, regardless of whether they're using structured/unstructured content. With that said, DITA does have unique advantages in minimizing L10n/I18N costs in that it can encapsulate (read: mitigate) the volume of content that needs to be translated.

In the same vein, Amber gave another presentation on "controlling" your content using a CMS, and other processes to provide efficiency, content integrity, and protection from liability (life, limb and legal)

The highlight however, was Paul Masalsky's (EMC) presentation. He spoke about the trials and tribulations of integrating DITA in an enterprise environment. The most memorable part of his presentation (and the conference thus far) was his DITA rap. It was AMAZING! I, along with the rest of the audience was completely wowed! If I can get a hold of the lyrics, I will post them here. I don't remember all of the lyrics, but one verse rhymed "dimented" and "topic-oriented" It was fantastic! Who says geeks are one-dimensional?!!!

Scott Hudson and I are working on the last minute preparations for our presentation tomorrow. It promises to be thought-provoking, and provide some neat demonstrations of how content from different standards like DocBook, DITA, ODF and others can interoperate. I think that this definitely has potential for enabling heterogenous environments to solve the difficult problem of "how do I reconcile content that doesn't quite fit my model?"

Hope to see you there.

Monday, February 5, 2007

DITA West 2007: Day 1

It was finally nice to put faces to names I've worked with, in addition to meeting some new folks. So far it looks like the conference has about 80 or so attendees. It could be a result of the Super Bowl that the attendance was a little low. Hopefully more will show up tomorrow. Unfortunately, Michael Priestly and Don Day aren't here. Was looking forward to talking with him more about the Interoperability Framework.

Lou Iuppa from XyEnterprises gave the opening Keynote. Essentially the thesis was that DITA would benefit by the use of a CMS

Looks like the majority of the attendees are Technical Writers either new to DITA, or in the process of implementing a DITA solution in their organization. Definitely hoping to see some more technical presos, though Yas Etessam's presentation on "Enabling Specializations in XMetaL Author" was very interesting.

A fair number of vendors booths in the hallway outside the conference rooms, including one from Flatirons Solutions. Just Systems, PTC, MarkLogic were among some.

Microsoft to Support XSLT 2.0

Things are looking up all over! Microsoft announces that it will support the XSLT 2.0 Standard: http://blogs.msdn.com/xmlteam/archive/2007/01/29/xslt-2-0.aspx

Saturday, February 3, 2007

XML - It's not just for data. Really.

I had an illuminating moment a few weeks ago when I met with some developers at a client about an XML project that I am working on for them. The project involves some pretty sophisticated Java code to execute a transformation of XML content into multiple formats like HTML, PDF and Microsoft HTML Help. Essentially, the work involves extending an existing framework to support transforming any Document Markup Language (DITA, DocBook, TEI, custom DTD/Schema, you name it) instance to whatever the desired output should be.

Yet after a few days with these extremely bright developers, it became quite clear to me that they viewed XML strictly as a data format, something that contained configuration data, or content manifests. They had no experience or context to think about XML as content. And here I was explaining to them that XML is a very rich medium for technical manuals and even commercial publications (this is a whole other topic).

Having worked with DocBook for the last 6 years, and more recently DITA, my world view of XML was focused strongly on XML for documentation. Yes, I've created lots of XML for configuration files and other data (SOAP packets) in my development work. But generally I correlate XML with technical document content.

And then it dawned on me: Within the development community at-large, there appears to be a large disconnect about how XML is and can be used. Like the developers I spoke about, XML is rich data format with the ability to hold all kinds of data in a hierarchical way - something that Java .properties files and flat text files can't do easily; for others like myself, XML is a rich content format that provides a way of semantically describing the content. And in fact, the content becomes data, not just text.

There are a proliferation of XML standards: RSS, WSDL, SMIL and so on. These are powerful data structures that have advanced the Web from serving up static HTML pages to delivering a wide variety of new services and applications. Yet, ironically, all of these XML standards have their orgins in SGML (indirectly), of which, some of the earliest applications were used to mark up content: HTML being perhaps the most well known. DocBook started out as an SGML application in the early 90's and ported to XML later on.

This goes to show that XML is an immensly powerful technology that lives up to its name: eXtensible Markup Language. It has become so firmly entrenched in so many technologies. Interestingly, it's because XML is both data and content that we find two naturally occurring views in the development community. We can't know everything, and as a consequence of that, we tend to focus on specific technologies to become expert in (myself included). Look at Java or C# as examples, there are so many different distinct areas that these languages can be used, we can't possibly be expert in all of them.

This meeting proved to be illuminating in other ways. I began thinking about yesterday's post and Microsoft's decision not to support XSLT 2.0 or XPath 2.0. Perhaps the view of XML as data is a much more common than that of viewing XML as content. And maybe this distinction is an artifact of "conventional" ways of thinking about "data" and "human-readable" content. Data lives in databases, and configuration files; content lives Word and HTML files, and ne'er the two shall meet, and for the most part they didn't.

XML puts data and content on an equal playing field, which opens up a whole variety of possible applications, some of which are only beginning to emerge. It has also opened up existing applications to new ways of sharing information. For example, Microsoft's Office Open XML standard will allow interchange to and from a wide variety of sources, perhaps even from DocBook XML content!

My Technorati Profile

I registered for a Technorati Profile. Still busy preparing for the DITA 2007 West Conference. My day job keeps in the way :)

Friday, February 2, 2007

XSLT 2.0 is Fantastic, but there are some hurdles

When XSLT 1.0 became a W3C Recommendation back in 2001, I thought it was the coolest thing out there. Oh the things I could do with XML+XSLT 1.0+XalanSaxon! Later on, when I wanted to do things like grouping and outputting to more than one result file, I realized this wasn't built in. Even now, I can't fully wrap my head around the Meunchian Method for grouping; and for outputting multiple result files, I had to rely on XSLT extensions to support this. This meant that my stylesheets now were bound to a particular XSLT processor. This completely sent shivers up my spine - The whole idea behind XSLT in my (perhaps idealistic, naive) view was that you should be able to take an XML file and any compliant XSLT engine to create an output result (set). Still, despite the warts and shortcomings, XSLT 1.0 proved to be a faithful companion to my XML content.

Enter XSLT 2.0. In so many ways it is so much better than its predecessor! Built-in grouping functionaly, multiple output result documents were now part of the specification! Huzzah!

But wait! There's more! In-memory DOMs (Very nice!), Functions (Very handy), XQuery, XPath 2.0, unstructured-text processing (very handy for things like embedding CSS stylesheets, processing CSV files), better string manipulation functions, including Regex processing. This is just a taste of things in the latest version.

It just became a WC3 Recommendation (along with XQuery and XPath 2.0) last month. Yeah! Finally!

Still, this latest version has major obstacles to overcome before it can enjoy widespread adoption. There's only one notable XSLT 2.0 compliant engine: Saxon 8 by Dr. Michael Kay. It is developed in Java, but there is a .NET port (via the IKVM Libraries).

Not that I have anything against Saxon. It is outstanding. Yet where is Xalan? MSXSL? Why haven't they come to party? Scouring the blogs and mailing lists, there doesn't appear to be activity on Xalan toward an XSLT 2.0 implementation. Microsoft's current priority is XLinq, and has decided that it will support XQuery, but not XSLT 2.0 or XPath 2.0.

Microsoft's decision not to implement XSLT 2.0 and XPath 2.0 could have an unfortunate effect on adoption of these standards. While XQuery is extremely powerful (and wicked fast) and can do all the things that XSLT can do, I wouldn't necessarily recommend trying to create XQuery scripts to transform a DocBook XML instance (the XSLT is already complex enough).

I would rather write matches against the appropriate template than attempt to write a long complex set of switch cases to handle the complex content model. That said, it could be done, but it won't be a trivial task.

XSLT 2.0 is amazingly powerful with many of the features that were lacking in the 1.0 Recommendation. In fact, for the DocStandards Interop Framework intends to use XSLT 2.0 to take advantage of many of these new features to support different things like generating topic maps or bookmaps from the interchange format. Looks like Saxon will be the de facto engine of choice, though not a hard choice to make.

Go to DITA West, Young Man

My colleague, Scott Hudson and I are presenting a paper at the DITA 2007 West Conference in San Jose, February 5-7. I am very excited about this.

The thesis of the paper focuses on proposing a DocStandards Interoperability Framework to enable various document markup languages like (but not limited to) DocBook, DITA, and ODF to share and leverage content by using an interchange format that each standard can write to and read from.

There are several advantages to this approach:
  • It doesn't impede future development for any standard, since the interchange is a "neutral" format. This means that new versions of a document markup standard can leverage content from earlier versions
  • Since it is neutral, it can potentially be used by virtually any document markup standard

This work stems from Scott's and my involvement in the DocStandards Interoperability List, an OASIS forum. We're hoping to spark interest in the XML community to push this along and create a new OASIS Technical Committee for DocStandards Interoperability.

We're still in the process of editing the whitepaper, which will be posted on Flatirons Solutions' website in the near future.