Jim's Thoughtspot: 2007

Saturday, October 6, 2007

DITA East 2007

Being heads-down with other work, I haven't found (or made) the time to add entries.

Amber Swope opened up the conference with a presentation about the business case for DITA in the context of localization.

DITA East 2007 just wrapped up. I had three presentations: The Interoperability Framework, a new one where I spoke about where some directions we would like the DITA standard to move toward, and I presented Kevin Dorr's presentation about DITA and Content Exchange. Overall, I think they were received well.

There were a lot of good presentations. Robert Anderson (IBM), gave several good presentations around the DITA OT and specialization. France Baril gave a presentation about reuse strategies.

Joe Gollner gave a very good and very insightful impromptu presentation on Saturday. Essentially, standards (like DITA and S1000D) are tools - not solutions. These should be used to enable process, but they should not define it.

The discussion panel at the end was truly illuminating for me. The key takeaway for me was that several people are very interested in the "best practices" for implementing and using DITA in their environment.

For so long, I've been focused on DITA-as-technology, which is intriguing in its own right. Interoperability, specializations are definitely fascinating and important to understand. Still, from the discussion, I interpreted the "best practices" remarks to reflect a fundamental facet of the standard that needs more focus: DITA-as-process.

Many know the basic benefits of DITA: modularity, reuse, localization, the concept and benefits of specialization, etc. But most are really interested in answering very gut-level questions. How will DITA affect how I create and publish content? How (Where) do I change how I design my content with DITA? Where and when do I reuse? When should I conref content?

With that in mind, I learned quite a bit about where DITA needs to focus.

Monday, April 16, 2007

Where is the ROI with XML Document Interoperability?

Up to this point, most of my posts about XML Interoperability have focused on identifying the problem space that a standards-based Interoperability Framework would attempt to solve from a technical perspective. But technical merit alone isn't a sufficient reason for an organization to adopt and embrace a technology.

Instead, typical organizations want solutions to real problems that have real implications to productivity, resources, costs, or profit. They want a certain degree of confidence that any technology investment will a) solve the problem, and b) they will see a return on that investment in a reasonable amount of time.

So where is the return on investment for XML interoperability?

Significant Investments Already Made

Consider the investment organizations have made thus far to migrate their content from proprietary formats to XML. Typically, the "up front" costs include:

content model development: DTDs, schemas
XML-based production tools: rendering engines, servers, XSLT, FO, etc.
new editing tools
content management systems
training

This is a substantial investment for many organizations with the intent that it will remain in service at least long enough to recover these costs, and hopefully longer.

Bill Trippe wrote:

Organizations are more diverse, more likely to be sharing content between operating groups and with other organizations, and more likely to be sourcing content from a variety of partners, customers, and suppliers. Needless to say, not all of these sources of content will be using the same XML vocabulary;

With organizations already vested in their own XML infrastructure, changes to this environment to support one or more additional XML vocubularies from different partners is bound to met with resistence.

Content Sharing Today

Despite this, partners do share content today. They convert (transform) the content from one vocabulary into another, or modify DTDs or schemas to fit the other content model. Yet these solutions make two assumptions:

Each partner is a terminus in the content sharing pipeline
Each partner's XML vocabulary will not change

In some cases, these assumptions are valid, and XML interoperability on the scale of one-way content conversion or schema/DTD integration is quite manageable and efficient. In this case, implementing the proposed XML interoperability framework may simply add more overhead than provide reasonable ROI. However, if either one of these assumptions is not true, then interoperability (and scalabililty) is a real issue and the framework may provide a mechanism for mitigating the costs and risks of trying to interoperate between numerous or changing vocabularies.

The Shortest Distance Between Two Points isn't a Straight Line

XML itself doesn't make any claims to enabling content reusability. But standards like DITA, and to some extent DocBook, provide mechanisms to enable content fragments to be reused in many target documents. For example, a procedural topic (section) written in for a User's manual could also be used in training material or support documentation. Frequently I've seen many cases where a Tech Pubs group is using a different XML grammar than "downstream" partners like Training or Support. There's the additive cost in time and resources to convert the content to fit their DTDs. More importantly, the semantics from the original source have now gone through two different transformations. It's almost like the children's game of "Telephone" where one child whispers a phrase in the next child's ear and so on down the line until the final child hears something entirely different. By enabling a shared interchange, you can reduce the number of semantic deviations to only one.

The Only Constant is Change

The other reality is that even when all partners are using XML standards like DITA, DocBook, ODF, or S1000D, the standards continually evolve, adding and changing content models to meet their constituents' needs. Since these standards aren't explicitly interoperable, the costs of managing changes between different standards goes up considerably. And here is another area where having an Interoperability format makes the most sense: If interchange is channeled through a neutral format, it can be interpreted to and from different standards (and versions) with a fewer number of transformation permutations. So if one partner moves to a different version standard, using an interchange format reduces costs and risks to your own toolchain and processes.

Lingua Franca

One of the design principles we've proposed with the Interoperability format is to leverage existing standards like XHTML. Because of this, we minimize the learning curve required for organizations to come up to speed to enable interoperability between different XML grammars. Just as English is the lingua franca for international business, aviation and science, a standardized interchange format for XML grammars provides a vehicle to enable content sharing between XML standards.

For example, an organization that has invested in a DITA XML infrastructure will not likely have a lot of in-house expertise on DocBook, ODF, or S1000D. Now, the amount of time and effort to enable content interoperability goes up signficantly. Add subsequent XML grammars into the interoperability mix and the level of complexity and cost are even higher.

With that said, a common XML interchange format isn't intended to be a translation vehicle, which is more in line with the Content and Processing interchange strategies I've described in previous posts. These strategies do have a place in the whole discourse around interoperability. They make perfect sense for linear, end-to-end interchange where both parties understand each other's "language" very well. And in reality, these strategies are likely to be more cost-effective than employing a "middle man."

Rather, a standardized XML Interoperability Framework will provide the highest ROI under the following conditions:

Content is leveraged/shared among many different consumers using different languages
A corollary to the above: Content sharing is non-linear
Business demands (time-to-market, lack of in-house expertise, partner relationships) make direct XML grammar translation cost-prohibitive

Friday, March 9, 2007

XHTML2 Working Group Gets Charter

Very interesting news:

7 March 2007: XHTML2 Working Group created The XHTML2 Working Group was chartered today with the mission of fulfilling the promise of XML for applying XHTML to a wide variety of platforms with proper attention paid to internationalization, accessibility, device-independence, usability and document structuring. The group will provide an essential piece for supporting rich Web content that combines XHTML with other W3C work on areas such as math, scalable vector graphics, synchronized multimedia, and forms, in cooperation with other Working Groups.

Let's hope that we will see real results soon. You can find out more at:

http://www.w3.org/MarkUp/

Open Publish 2007 - Baltimore, Day 1

It's been a while since I've posted. That's what happens during project crunch time.

I presented the Interoperability Framework that Scott Hudson and I have been developing at the Open Publish conference in Baltimore. This is the first time this conference has been offered in hte US (Allette Systems hosts this conference in Australia - need to find a way to get down under!).

It's a small, but very enthusiastic group of 50 or so attendees. About a dozen attended my presentation which was received with great interest. From my read of the audience, most haven't yet had to worry too much about leveraging content from different standards. Still, it led to great discussions afterward. I will be writing more about this topic later on!

In the first keynote, Michael Wash from the Government Printing Office presented how the GPO is changing to incorporate content-centric publishing, from a wide variety of sources. Very enlightening to learn about how the GPO is required to take content from any government agency in virtually any format including napkins!

In the second keynotes, Paul Jensen from Wolters-Kluwer provided some very interesting insight into the "state" of the Publishing Industry. It's often assumed that the entire publishing industry is diminishing. Yet Paul analyzed different segments within the industry and demonstrated that certain segments are certainly in trouble, like newspaper publishing. Still, other segments like professional publishing (legal, tax, etc.) are doing very well.

Ann Michael gave a presentation about an often unlooked aspect to implementing CMS solutions in an environment: the people! Who are the stakeholders? Who stands to win? Who stands to lose? How do we identify dissenters (skeptics, passive-aggressives, even us - the consultants!)? How do we identify and manage organizational/environmental changes like management turnover? Certainly, managing a CMS implementation project is no trivial task, managing the personalities and people must be integral part of the process!

Steve Manning from the Rockley Group presented how choosing the right CMS is often as challenging as planning to implement a CMS!

Bob DuCharme is slated to be speaking today. I'm looking forward to seeing his presentation.

Monday, February 19, 2007

Types of XML Content Interoperability: Pros and Cons

In my last post, I talked about why we need XML interoperability. Now, let's talk about different strategies for implementing interoperability. We'll also discuss the pros and cons for each approach.

There is a common thread with each approach: XSLT. What makes XML remarkably flexible and resilient (and widely adopted) is its ability to transformed into so many different formats for both human and computer consumption. It's also why XML Interoperability can even be discussed.

Types of XML Interoperability

There are three basic strategies for acheiving interoperability between XML Document Standards:

Content Model Interoperability
Processing Interoperability
Roundtrip Interoperabilty

Each of these approaches has valid use case scenarios, and should not be dismissed out of hand. Yet, each of these approaches makes certain assumptions about the business processes, and environments that will work in some circumstances, but are less than optimal in others.

Content Model Interoperability

Content Model Interoperability is centered around enabling all or part of one standard's content model to be included as part of another standard. For example, DITA's specialization capabilities could be employed to create custom topic types for DocBook sections or refentries (in a DITA-like way). Conversely, DocBook's DTDs are designed to create customizations on top of the core content model.

In addition to customizing the DTDs (or Schemas), there is an additional step to support the new content in the standard: You need to account for these custom elements in the XSLT stylesheets - for each intended output format.

While on the surface this approach appears to be the most logical way to ensure that your content can interoperate with another standard, this is not an approach to be undertaken for the faint of heart. Working with DTD's and schemas is doable, but will require a thorough understanding of both standards before you begin. There are other limitations:

This approach allows you to accept content from one standard, but doesn't allow you to share or leverage this content with other collaration partners. In effect, this approach is "shoehorning" content from one standard into yours. However, if you are dealing with receiving content from only one partner (and you aren't sharing content elsewhere), this could be a viable approach. But keep in mind...
You and your partner are now both bound to a fixed version of the standards that will be sharing content. If either you or your partner decide to move to a later version of the respective standards, you may have to rework your customizations to support the new content models. You also run the risk that your legacy content won't validate against the new DTDs or schemas.
Be aware that while content in different namespaces may provide "short-term" relief, it can also cause "long-term" headaches (much in the same way that Microsoft's COM architecture introduced us all to "DLL Hell"). It also means that your content must also be in a namespace (even if it is the default one).

Processing Interoperability

In this approach, content from one standard is either transformed or pre-processed into the other using XSLT. This approach is less risky in some ways compared to Content Model Interoperability: You don't have to maintain a set of DTDs to enable content interoperability, and it's whole lot easier to share the transformed content with partners once it's transformed into a single DTD.

There is a slightly different angle you you can take: You could say that you won't preprocess the content into your DTD, but instead use your XSLT stylesheets to incorporate the "foreign" content into the final output. For some cases, where you may be simply "rebranding" content, this might be a viable approach, yet keep in mind that this might mean some additional investment in incorporating other tools in your tool chain. For example, DITA and DocBook content employ very different processing models (i.e., the DITA Open Toolkit vs. the DocBook XSLT stylesheets). This may require a hefty development effort to integrate these tools properly in your environment. However, if you intend to leverage the content elsewhere in your own content, this angle can become a lot harder to implement.

For organizations sharing content back and forth, or for groups that are receiving content from one partner and are sharing it with other partners in the pipeline, this could be a reasonable approach. Still, there are potential risks here:

This "uni-directional" approach is more flexible than than Content Model Interoperability, but, you still potentially have the same DTD/Schema version problem. And it only works realistically for one pair of standards, for example DocBook and DITA.
If your partner begins creating content in a newer version of their DTD, you may have to upgrade your transforms to enable the content to be used by you.
You still need to be well-versed in both standards to ensure each plays nicely in your environment

Be prepared for dealing with validation issues. While each standard does include markup for standard content components like lists, tables, images, etc., there are structures that do not map cleanly. In these cases, you will need to make some pretty hard decisions about how they will (or will not) be marked up.

Roundtrip Interoperability

This is perhaps the most ambitious approach to creating interoperable content and encompasses being able to transform one standard into another and round trip that content back into the original standard. Like Processing Interoperability, you still have some very tricky issues to contend with:

How do you handle round tripping between different versions of the standards? The net result is that you will need multiple stylesheets to support each version permutation.

It's bi-directional, meaning that the round trip only works between the two standards (and with specific versions of those standards).

The following figures (taken from Scott Hudson and my presentation at DITA 2007 West) illustrate the problem:

In this example, we're only dealing with two standards, DocBook and DITA. But as you can see, there are numerous permutations that are potential round trip candidates. Now let's add another standard, like ODF

You can see that this quickly becomes a very unmanageable endeavor.

Conclusion

I've gone over three different strategies for approaching XML interoperability, situations where they work well, and some of the problems you may encounter when choosing one of these strategies. In my next post, I'll look at another approach for handling XML interoperability.

Saturday, February 17, 2007

Why XML Content Interoperability Is Critical

Content re-use, sharing, and re-purposing has long been the "Holy Grail" of publishing. There have been numerous attempts at solving this problem, all with varying degrees of success. Yet as the publishing has grown and evolved over time, technologies have emerged to solve one set of problems, only to find that other technical challenges have "filled the void."

Back In The Day...

In the 1980's, desktop publishing tools opened the doors to a brand new set of professionals to create and produce high quality and lower cost publications. Tools like Frame, Quark, WordStar, WordPerfect, and Word would allow veritable novices to quickly and easily create professional-looking typeset documents.

Yet, while these tools opened the door to cheaper, faster publication cycles, they also created a very significant problem: Each tool created content in their own proprietary formats. And these formats were not compatible with other tools! Sure, this wasn't a problem if all your content was in the same format, and you didn't have to leverage, share or reuse content from other tools. But the moment you needed to incorporate information from any other tool, life became very difficult.

The unintended consequences of trying to convert incompatible formats to your own usually resulted in "One-Offs", but more importantly, the incurred costs (including the pyschological and physical trauma) were very high. It wasn't uncommon that a conversion would take a single writer/editor/DTP professional several weeks to convert a manual. Now what if your organization was sharing content from several partners using different formats - now we're talking about real money!

Enter SGML...

In the late 80's, early 90's Standardized General Markup Language (SGML) emerged with the promise that it would solve the format conundrum that desktop publishing tools wrought on us. The idea was brilliant: All content would be stored as a text, but formatting and other semantics would be included using "tags" and "attributes" that could easily be interpreted.

Robert Burns could not foresee that his oft quoted verse would apply here:

"The best-laid schemes o' mice an' men
Gang aft agley,
An' lea'e us nought but grief an' pain,
For promis'd joy"

The problem with SGML was there were few tools that supported this very complex standard. What tools that were available were very expensive and out of reach but for the larger organizations that could afford the investment.

HTML and World Wide Web

The mid-90's introduced HTML and the World Wide Web, along with a brand new medium for sharing content. HTML is in fact an SGML application. But unlike other SGML applications, it was widely adopted because of a unique piece of software, made popular by the likes of Mark Andreessen, the web browser. Another factor in its adoption was the very small, easy to learn tag set - in essence, it contained the basic structural components found in most publications:

Headings (Sections)
Paragraphs
Lists
Tables
Images
Inline formatting elements (bold, italic, monospace)

Still, there were several problems with HTML. The markup was primarily focused on presentation. Content semantics were lost: Was an ordered list item intended to be a procedural step or the ith item in a numbered list? Is the monospaced phrase a command or environment variable?

The second problem was that web browsers enabled the misuse and abuse of elements in such a way that even the most basic semantics were buried in the mishmash of tags.

XML!

In the late 1990's the W3C developed the eXtensible Markup Language (XML). It promised a lightweight version of SGML that could be enabled for the web. It has lived up to its promise, and then some. Like HTML, XML is now widely adopted because of the availability of cheap (even free) tools. Along with these tools, major programming languages (Java, VB, C++, .NET) support XML through easy to use APIs.

SGML applications, like DocBook, soon ported their DTDs to XML. Editing tools like XMetaL and Arbortext Editor supported XML and made it easier for the "uninitiated" to create structured content.

The widespread adoption of XML also brought with it another problem: The number of XML Document Standards has grown significantly in the past 10 years. And today, standards like TEI, DocBook, DITA, and even ODF (Open Document Format - used by the Open Office tools) are widely used. There are also countless variants of some these standards.

Now the problem isn't the lack of semantics, or incompatible formats, it's the proliferation of XML Document standards, each of which has taken a different approach toward producing content. All of these standards have valid practical applications (which I will not discuss here - suffice it to say that each application takes into account different forms and functions toward producing content).

In an evironment where collaboration and new types of partnerships are emerging, the conundrum now is how do I share/re-use/re-purpose content from multiple partners using different XML standards? How do I mitigate risks to my budget and schedule by leveraging disparate content? In my experience, this is sort of like a Chinese Finger Puzzle: It's easy to adopt a standard (that maps to your own processes), but it's a whole lot more complicated to work around it when other standards are introduced!

This is why content interoperability is so important. In my experience, this problem has often been a gating factor in collaboration. As I mentioned in a previous entry, organizations often develop their "language" and semantics around well established processes. Also true of many organizations, they will "reject" processes that run counter to their own. The end result is an "informational impasse" - information that doesn't fit the organization model is quickly dismissed.

Yet many organizations have been asked to collaborate with other organizations using different "languages." This also applies to XML standards. Some organizations we've consulted with are producing books, and collections of articles, which is best suited for DocBook. Yet in the future, they may partner with other organizations to produce learning content, which is not one of DocBook's strengths. Another organization is has an OEM partnership where both are using different DITA specializations. And within that same company, there are organizations using DocBook variants. All of this content is leveraged and reused in many different ways.

There are several strategies for dealing with disparate standards. And I'll discuss these in future posts.

Thursday, February 8, 2007

A "Sociology" of XML Languages

In a perfect world, XML content could easily be shared between different users and organizations because everyone would be sharing the same markup and semantics. Information interchange could be seamless; content could be repurposed and reused with minimal effort between different functional teams; XML processing tools could be optimized.

Yet, there are numerous reasons why we see so many different XML grammars used by different organizations. I'll focus on two of these briefly:

Organizational Dynamics
Multiple XML Standards

Organizational Dynamics

I never thought that my background in Sociology would ever be useful, but it certainly is applicable here (I'm very rusty since I haven't consciously thought about this subject in over 10 years): Looking back at the works of Emile Durkheim, Max Weber, and Frederick Winslow Taylor, we see that organizations are structured around distinct divisions of labor to enable individuals to specialize their skills and work on discrete aspects of the "production process" (keep in mind that most of these theories during the peak of the Industrial Revolution).

What's more interesting here is how groups are organized. And in part, this shaped by many different factors including the industry vertical, the size of the organization, the relationships with other organizations. There is plenty of literature about these subjects that I won't delve into them here.

The key takeaway is that all these factors have a direct effect on the organization's processes, meaning that for information development groups (Tech Pubs, Training, etc.), this affects how information (content) is created, managed, and distributed.

Through an organization's processes, there is interesting side-effect on language. Organizations create their own vocabulary to express, and even rationalize their processes (there are other implications of this like "group identification" at work here too). For example, there is an often quoted line from the movie, Office Space, "Did you get the memo about the TPS Reports?" Even within the same industry where there are common terms (like "GUI" or "Menu" in software), there are distinct "dialects" that evolve over time, much in the same way that there are different Spanish dialects: A spanish speaker from Spain could probably converse with a spanish speaker from Argentina, but there might be word or phrases that aren't understood.

And this manifests itself in the XML syntax adopted by these organizations used to create content. A logical strategy for these organizations is to adopt known XML standards like DocBook or DITA that fit their organization's process the closest, and modify these standards to incorporate words or phrases of their own into their XML syntax.

Multiple XML Standards

One of the incredibly powerful aspects of XML is its ability to evolve over time to support different syntaxes. The unintended consequences, however, is that we now have several well known XML Document standards like DocBook and DITA. While they are different architecturally, and to some extent, semantically, they're both targeted at virtually the same audience (information developers), produce the same kinds of output formats (PDF, HTML Help, HTML, Java Help), and probably more important, contain the same kinds of structural components (paragraphs, lists, tables, images, formatting markup), albeit using different element names ("A rose by any other name would smell as sweet" - Romeo and Juliet).

Yet, by having multiple standards, it can create an "informational impasse," where the DTDs get in the way of sharing content across organizations. And for many organizations, this is a real problem. Across all industry verticals, we're seeing news forms of collaboration and partnering across companies (and organizations), along with consolidation (mergers and acquisitions). And from an XML content perspective, the question is, "My content is in 'X' and my partners' content is in 'Y' and 'Z'. How do I reconcile these disparate document types?"

And therein lies the need for a Doc Standards Interoperability Framework, which I will describe in future posts.

DITA 2007 West: Day 3

Had some great conversations with France Baril and Yas Ettesam on a wide variety of topics including DTD development and maintenance, data and process modelling, observations and trends in publishing - definitely very interesting.

Scott and I gave our presentation on our proposed DocStandards Interoperability Framework. We got great feedback from France and Eric Hennum. The main purpose of the presentation was to stimulate some interest and further discussion on enabling interchange between disparate standards, and we accomplished what we set out to do. The next step is to use the momentum we got from this conference and start the process of making this an OASIS Technical Committee.

Overall, a very good conference. I met some really sharp people that I'll definitely hope to keep in touch with.

On to OpenPublish in Baltimore...

Wednesday, February 7, 2007

DITA West 2007: Day 2 - Dimented and Topic-oriented

Amber Swope from Just Systems gave a great keynote presentation on Localization issues with DITA content. Interestingly, it doesn't just apply to DITA, and has broad implications to any organization that does/will have L10N or I18N as part of their deliverable stream, regardless of whether they're using structured/unstructured content. With that said, DITA does have unique advantages in minimizing L10n/I18N costs in that it can encapsulate (read: mitigate) the volume of content that needs to be translated.

In the same vein, Amber gave another presentation on "controlling" your content using a CMS, and other processes to provide efficiency, content integrity, and protection from liability (life, limb and legal)

The highlight however, was Paul Masalsky's (EMC) presentation. He spoke about the trials and tribulations of integrating DITA in an enterprise environment. The most memorable part of his presentation (and the conference thus far) was his DITA rap. It was AMAZING! I, along with the rest of the audience was completely wowed! If I can get a hold of the lyrics, I will post them here. I don't remember all of the lyrics, but one verse rhymed "dimented" and "topic-oriented" It was fantastic! Who says geeks are one-dimensional?!!!

Scott Hudson and I are working on the last minute preparations for our presentation tomorrow. It promises to be thought-provoking, and provide some neat demonstrations of how content from different standards like DocBook, DITA, ODF and others can interoperate. I think that this definitely has potential for enabling heterogenous environments to solve the difficult problem of "how do I reconcile content that doesn't quite fit my model?"

Hope to see you there.

Monday, February 5, 2007

DITA West 2007: Day 1

It was finally nice to put faces to names I've worked with, in addition to meeting some new folks. So far it looks like the conference has about 80 or so attendees. It could be a result of the Super Bowl that the attendance was a little low. Hopefully more will show up tomorrow. Unfortunately, Michael Priestly and Don Day aren't here. Was looking forward to talking with him more about the Interoperability Framework.

Lou Iuppa from XyEnterprises gave the opening Keynote. Essentially the thesis was that DITA would benefit by the use of a CMS

Looks like the majority of the attendees are Technical Writers either new to DITA, or in the process of implementing a DITA solution in their organization. Definitely hoping to see some more technical presos, though Yas Etessam's presentation on "Enabling Specializations in XMetaL Author" was very interesting.

A fair number of vendors booths in the hallway outside the conference rooms, including one from Flatirons Solutions. Just Systems, PTC, MarkLogic were among some.

Microsoft to Support XSLT 2.0

Things are looking up all over! Microsoft announces that it will support the XSLT 2.0 Standard: http://blogs.msdn.com/xmlteam/archive/2007/01/29/xslt-2-0.aspx

Saturday, February 3, 2007

XML - It's not just for data. Really.

I had an illuminating moment a few weeks ago when I met with some developers at a client about an XML project that I am working on for them. The project involves some pretty sophisticated Java code to execute a transformation of XML content into multiple formats like HTML, PDF and Microsoft HTML Help. Essentially, the work involves extending an existing framework to support transforming any Document Markup Language (DITA, DocBook, TEI, custom DTD/Schema, you name it) instance to whatever the desired output should be.

Yet after a few days with these extremely bright developers, it became quite clear to me that they viewed XML strictly as a data format, something that contained configuration data, or content manifests. They had no experience or context to think about XML as content. And here I was explaining to them that XML is a very rich medium for technical manuals and even commercial publications (this is a whole other topic).

Having worked with DocBook for the last 6 years, and more recently DITA, my world view of XML was focused strongly on XML for documentation. Yes, I've created lots of XML for configuration files and other data (SOAP packets) in my development work. But generally I correlate XML with technical document content.

And then it dawned on me: Within the development community at-large, there appears to be a large disconnect about how XML is and can be used. Like the developers I spoke about, XML is rich data format with the ability to hold all kinds of data in a hierarchical way - something that Java .properties files and flat text files can't do easily; for others like myself, XML is a rich content format that provides a way of semantically describing the content. And in fact, the content becomes data, not just text.

There are a proliferation of XML standards: RSS, WSDL, SMIL and so on. These are powerful data structures that have advanced the Web from serving up static HTML pages to delivering a wide variety of new services and applications. Yet, ironically, all of these XML standards have their orgins in SGML (indirectly), of which, some of the earliest applications were used to mark up content: HTML being perhaps the most well known. DocBook started out as an SGML application in the early 90's and ported to XML later on.

This goes to show that XML is an immensly powerful technology that lives up to its name: eXtensible Markup Language. It has become so firmly entrenched in so many technologies. Interestingly, it's because XML is both data and content that we find two naturally occurring views in the development community. We can't know everything, and as a consequence of that, we tend to focus on specific technologies to become expert in (myself included). Look at Java or C# as examples, there are so many different distinct areas that these languages can be used, we can't possibly be expert in all of them.

This meeting proved to be illuminating in other ways. I began thinking about yesterday's post and Microsoft's decision not to support XSLT 2.0 or XPath 2.0. Perhaps the view of XML as data is a much more common than that of viewing XML as content. And maybe this distinction is an artifact of "conventional" ways of thinking about "data" and "human-readable" content. Data lives in databases, and configuration files; content lives Word and HTML files, and ne'er the two shall meet, and for the most part they didn't.

XML puts data and content on an equal playing field, which opens up a whole variety of possible applications, some of which are only beginning to emerge. It has also opened up existing applications to new ways of sharing information. For example, Microsoft's Office Open XML standard will allow interchange to and from a wide variety of sources, perhaps even from DocBook XML content!

My Technorati Profile

I registered for a Technorati Profile. Still busy preparing for the DITA 2007 West Conference. My day job keeps in the way :)

Friday, February 2, 2007

XSLT 2.0 is Fantastic, but there are some hurdles

When XSLT 1.0 became a W3C Recommendation back in 2001, I thought it was the coolest thing out there. Oh the things I could do with XML+XSLT 1.0+XalanSaxon! Later on, when I wanted to do things like grouping and outputting to more than one result file, I realized this wasn't built in. Even now, I can't fully wrap my head around the Meunchian Method for grouping; and for outputting multiple result files, I had to rely on XSLT extensions to support this. This meant that my stylesheets now were bound to a particular XSLT processor. This completely sent shivers up my spine - The whole idea behind XSLT in my (perhaps idealistic, naive) view was that you should be able to take an XML file and any compliant XSLT engine to create an output result (set). Still, despite the warts and shortcomings, XSLT 1.0 proved to be a faithful companion to my XML content.

Enter XSLT 2.0. In so many ways it is so much better than its predecessor! Built-in grouping functionaly, multiple output result documents were now part of the specification! Huzzah!

But wait! There's more! In-memory DOMs (Very nice!), Functions (Very handy), XQuery, XPath 2.0, unstructured-text processing (very handy for things like embedding CSS stylesheets, processing CSV files), better string manipulation functions, including Regex processing. This is just a taste of things in the latest version.

It just became a WC3 Recommendation (along with XQuery and XPath 2.0) last month. Yeah! Finally!

Still, this latest version has major obstacles to overcome before it can enjoy widespread adoption. There's only one notable XSLT 2.0 compliant engine: Saxon 8 by Dr. Michael Kay. It is developed in Java, but there is a .NET port (via the IKVM Libraries).

Not that I have anything against Saxon. It is outstanding. Yet where is Xalan? MSXSL? Why haven't they come to party? Scouring the blogs and mailing lists, there doesn't appear to be activity on Xalan toward an XSLT 2.0 implementation. Microsoft's current priority is XLinq, and has decided that it will support XQuery, but not XSLT 2.0 or XPath 2.0.

Microsoft's decision not to implement XSLT 2.0 and XPath 2.0 could have an unfortunate effect on adoption of these standards. While XQuery is extremely powerful (and wicked fast) and can do all the things that XSLT can do, I wouldn't necessarily recommend trying to create XQuery scripts to transform a DocBook XML instance (the XSLT is already complex enough).

I would rather write matches against the appropriate template than attempt to write a long complex set of switch cases to handle the complex content model. That said, it could be done, but it won't be a trivial task.

XSLT 2.0 is amazingly powerful with many of the features that were lacking in the 1.0 Recommendation. In fact, for the DocStandards Interop Framework intends to use XSLT 2.0 to take advantage of many of these new features to support different things like generating topic maps or bookmaps from the interchange format. Looks like Saxon will be the de facto engine of choice, though not a hard choice to make.

Go to DITA West, Young Man

My colleague, Scott Hudson and I are presenting a paper at the DITA 2007 West Conference in San Jose, February 5-7. I am very excited about this.

The thesis of the paper focuses on proposing a DocStandards Interoperability Framework to enable various document markup languages like (but not limited to) DocBook, DITA, and ODF to share and leverage content by using an interchange format that each standard can write to and read from.

There are several advantages to this approach:

It doesn't impede future development for any standard, since the interchange is a "neutral" format. This means that new versions of a document markup standard can leverage content from earlier versions
Since it is neutral, it can potentially be used by virtually any document markup standard

This work stems from Scott's and my involvement in the DocStandards Interoperability List, an OASIS forum. We're hoping to spark interest in the XML community to push this along and create a new OASIS Technical Committee for DocStandards Interoperability.

We're still in the process of editing the whitepaper, which will be posted on Flatirons Solutions' website in the near future.