Jim's Thoughtspot: XML

Showing posts with label XML. Show all posts

Saturday, January 9, 2016

It's been a while

It's been a while since I've posted on this blog. A lot has happened in the intervening months (years). Mostly, I've moved forward and backward in the tech world. I've hummed and hawed over the direction of my career. I've also been somewhat distracted by local events that required my attention. You can label it a higher calling; a change in priorities from a completely geeky world that I have embraced as my own to one that encompasses the future of my geeks-in-training.

Needless to say, I haven't abandoned the world of XML, XSLT, XPath, XQuery entirely. I've evolved. I've had a gap year (or two) and seen the world outside the comfortable confines of angle brackets and FLOWR statements, and it has changed me - a bit.

For those who've read some of my posts, I drank the kool-aid in the 90's, and wanted everyone else to share from the same cup. What the last few years have shown is that kool-aid is for kids. It's time to grow up. The technology and content worlds have changed, and I need to change with it.

Primarily, what has changed is my thinking about the role of XML technologies in the landscape. It has a place, and honestly, it's a very important player in the wide ranging landscape - just not in the way I perceived it 5, 10, or even 15 years ago. In fact, I'm not sure that Sir Berners-Lee would have envisioned the path that markup languages have taken. Nonetheless, it's time to embrace these changes for what they are.

XML is here to stay. It's mature, it's lived up to its promise of extensibility, and it won't go away.
XML technologies are stable. There is little in the way of implementation variability among different providers now. Whether you are using Java, a classic 'P' (Python, PHP, Perl) language, or any one of the newer languages, they all must honor XML in order to be complete.
Incremental changes in XML technologies are principally to support scale. DOMs are nice, elegant, and easy-to-use structures, but quickly turn into boat-anchors when we attempt to embrace internet-scale data within them. Streaming XML is the new sexy.
Virtually any data model can be represented in XML for a myriad of business purposes with self describing semantics and the capability to flex its node hierarchy based on the data. For this reason alone, XML has been, and will continue to be, a workhorse. Think about Spring, one of the web's most successful Java frameworks. XML is the underlying data for nearly part of it.
As a data persistence layer, XML plays well with tabular, relational, and hierachical structures. With its rich semantics and vendor-agnostic format, XML technologies are powerful, flexible, and scalable. Yes, it's also a great storage model for pernicious content models - like DITA, DocBook, and, gulp, OOXML (I'll shower later for that).
From XML, I can deliver and/or display content/data in virtually any format imaginable, even to the point of backward compatibility to legacy formats (ask me about HTML 3.2/EDGAR transformations sometime)

With all that XML has going for it, what can go wrong? Well, depending on who you ask, the answer will vary. Some criticize XML for not living up to the hype of Web 2.0. XML's initial purpose was to be the "SGML for the web." To some degree, it is, but it is far from ubiquitous. That isn't to say that we didn't try. From XML Data Islands to XMLHttpRequest objects in Javascript, XML was given first class status on the web. The problem was (and is) that, as a DOM, extracting data often relied on a lot of additional code to recurse through the XML content. For some, the browser's tools felt like a blunt instrument when finer grained precision was needed. Eventually, JSON became the lingua franca for web data, and rightfully so.

Perhaps its biggest limitation or failure is the countless attempts to make XML usable for the masses. I'll admit that I was one of the biggest evangelists. I honestly believed that we could build authoring tools that were intuitive and easy-to-use back by powerful semantic markup. We would be able to enrich the web (and by proxy, the world) with content that had meaning - it could searched intelligently, reused, repurposed, translated, and delivered anywhere. As one of my friends and mentor, Eric Severson, said, XML has the capability of making content personal, designed for a wide audience and personalized for an "audience of one."

Intrinsically, I still have some faith in the idea, but the implementation never lived up to the hype. For over twenty years, we've tried to build tools that could manage XML authoring workflows from creation to delivery. Back in the late 90's and early 2000's, I remember evangelizing for XML authoring solutions to a group of technical writers for a big technology firm. I was surprised by the resistance and push back I got. Despite the benefits of XML authoring, the tools were still too primitive, and instead of making them more productive, it slowed them down. Nevertheless, I kept evangelizing like Linus in the pumpkin patch.

Eventually, the tools did improve. They did make authoring easier... for some. What we often glossed over was the level of effort required to make the tools easier to use. Instead of being tools that could be used by virtually anybody who didn't want to see angle brackets (tools for the masses), we made built-for-purpose applications. For folks like me who understood the magical incantations and sorcery behind these applications, they were fantastic. They were powerful. They also came with a hefty price tag, too. And, because they were often heavily customized, users were locked in to the tools, the content model, and the processes designed to support it.

Even if we attempted to standardize on the grammar to enable greater interchange, it still required high priests and wizards to make it work. The bottom line is that the cost of entry is just too high for many. The net result is that XML authoring is a niche, specialized craft left to highly trained technical writers and the geekiest of authors.

Years ago, I read Thomas Kuhn's The Structure of Scientific Revolutions. The main premise is that we continue to practice our crafts under the premise of well-accepted theory. Over time, through the course of repeated testing, anomalies emerge. Initially, we discard these anomalies, but as they continue to accumulate, we realize that we can't ignore these anomalies anymore. New theories emerge. However, we reject these new ideas and vigorously debate that the old theories are still valid, until enough evidence disproves them entirely. At that moment, a new paradigm emerges.

We are at that moment of paradigmatic shift. No longer can XML be thought of as a universal theory of information and interchange. Instead, we need to reshape our thinking to accept that XML solves many difficult problems, and has a place in our toolbox of technology, but other technologies and ideas are emerging that are easier, cheaper, faster methods for content authoring. For many, the answers to "intelligent content" aren't about embedding semantics within, but rather to extend content with rich metadata about the content that live as wrappers on the content - that can be dynamic, contextual, and mutable.

Before I'm labeled a heretic, let me be clear. XML isn't going away, nor is it inherently a failed technology. Quite the opposite. Its genius is in its relative simplicity and flexibility to be widely used in a vast number of technologies in an effective manner. The difference is that we've learned that we could never get enough inertia behind the idea of XML as a universal data model for content authoring, and it was too cumbersome for web browsers to manipulate. We have other tools for that.

Tuesday, September 24, 2013

XML Schemas and the KISS Principle

I recently had the opportunity to work on an interesting XML schema.. The intent was to create an HTML 5 markup grammar to create digital content for EPUB and the web primarily, then ultimately for print. The primary design goal is to create an authoring grammar that facilitates some level of semantic tagging and that is natively HTML 5 compliant, i.e., there is no transformation required to move between the authoring format and HTML5.

What is interesting about this particular schema is that it resembles similar design patterns used for microformats. The markup semantics for typographic structures such as a bibliography or a figure are tagged with standard HTML elements and with additional typographic semantics express using the class attribute. For example, a figure heading structure must look like the following:

<figure>
    <h2><span class="caption">Figure </span>
    <span class="caption_number">1.1 </span>Excalibur and the Lady of the Lake</h2>
</figure>

Notice the <span> tags. From the perspective of describing our typographic semantics (figures must have captions and captions must have a number), this isn’t too bad. However from a schema perspective, it’s much more complex, because the underlying HTML5 grammar is quite complex at the level of <div>, <h2> and <span> elements. In addition to the required “caption” and“caption_number” semantics applied to the <span> tag, the <h2> element also allows text, other inline flow elements, such as <strong>, <em>, and, of course, other <span> tags that apply other semantics.

To enforce the mandate that a figure heading must have a label and number as the first two nodes of the <h2> element, we can use XML Schema 1.1 assertions . Assertions allow us to apply business rules to the markup that cannot be expressed directly in the content model sequences. Assertions allow us to use a limited subset of XPath axes and functions that return a boolean result.

Alternately, Schematron could be used independently (or in addition to assertions) as a means of enforcing the business rules in the markup. The issue here is that a Schematron rule set resides outside of the XML schema, therefore requiring additional tooling integration from the authoring environment to apply these rules.

So, for our heading above, we must apply the following assertion:

<xs:assert test="child::h2/node()[1][@class='caption']/following-sibling::span[@class='caption_number']""/>

In this case, the assertion is stating that the <h2> element’s first node must have a class attribute value of “caption”, followed immediately by an element with its class attribute value of “caption_number.” After that, any acceptable text or inline element defined by the HTML5 grammar is allowed.

This is a very simple example of how the existing HTML5 grammar alone cannot enforce the semantic structure we wish to express. There are numerous other examples within the content model that would leverage the same design pattern.

We have done several successful projects with this approach and the value of having a single authoring/presentation grammar (HTML 5) is very appealing. However, there can be issues and difficulties with this approach. Consider:

Microformats are clever applications that give semantic meaning to simple HTML formatting tags. It’s valid HTML by virtue of tags and attributes, with additional semantics expressed through the value of certain attribute such as the class attribute. In general, these microformat markup documents are small, discrete documents, as they are intended to be machine readable to give the application its functionality. From an authoring perspective, it’s relatively simple to create a form that captures the essential data that is processed by machine to generate the final microformat data (or for the markup and microformat savvy, create it by hand – but we are in the minority). Think of microformat instances as small pieces of functionality embedded as a payload within a larger document that are only accessed by applications with a strong understanding of the format. If we take the notion of microformats and use them throughout a document, we can run into tooling issues, because we’re now asking a broader range of applications (e.g. XML editors) to understand our microformat.
The “concrete” structural semantics (how to model figures and captions) are specified with “abstract” formatting HTML tags. Conflating presentation and structural semantics in this way is contrary to a very common design principle in use today in many languages and programming frameworks, namely to separate the semantics/structure from the formatting of content.
The schema’s maintainability is decreased by the vast number of assertions that must be enforced for each typographical structure. Any changes to any one structure may have ripple effects to other content classes.
Not all XML authoring tools are created equal. Some don’t honor assertions. Others do not support XML 1.1 Schemas at all. Consequently, this means that your holistic XML strategy becomes significantly more complex to implement. It might mean maintaining two separate schemas, and it might also mean additional programming is required to enforce the structural semantics that we wish to be managed in the authoring tool.
A corollary to the previous point, creating a usable authoring experience will require significant development overhead to ensure users can apply the right typographical structures with the correct markup. It could be as simple as binding templates with menus or toolbars, but it could easily extend into much more. Otherwise, the alternative is to make sure you invest in authors/editors who are trained extensively to create the appropriate markup. Now consider point #3. Any changes to the schema have ripple effects to the user experience also.
Instead of simplifying the transformation process, tag overloading can have the reverse effect. You end up having to create templates for each and every class value, and it’s not difficult to end up with so many permutations that an ambiguous match results in the wrong output. Having gone down this road with another transformation pipeline for another client, I can tell you that unwinding this is not a trivial exercise (I’ll share this in another post).
Assertion violation messages coming from the XML parser are extremely cryptic:
```
cvc-assertion: Assertion evaluation ('child::node()[1]/@class='label'') for element 'summary' on schema type 'summary.class' did not succeed.
```
For any non-XML savvy practitioners, this kind of message is the precursor to putting their hands up and calling tech support. Even if you use something like Schematron on the back end to validate and provide more friendly error messages, you’ve already made the system more complex.
It violates the KISS principle. The schema, at first glance, appears to be an elegant solution. If used correctly, it mitigates what is a big problem for publishers: How do I faithfully render the content to appear as prescribed in the content? Theoretically, this schema would only require very light transformation to achieve the desired effect. Yet, it trades one seemingly intractable problem for several others that I’ve described above.

Several years ago, I recommended using microformats as an interoperability format for managing content between DITA, DocBook, and other XML markups. The purpose of the format was specifically to be able to generated and read with a set of XSLT stylesheets do the heavy lifting of converting between standards. The real benefit is that you create a transformation once for each input and output, rather than building “one-off” transformations for each version of the standard. Once in the converted markup, the content could leverage its transformations to produce the desired output.

I think the key distinction is that XML Interoperability Framework was never intended to be an authoring medium. Users would create content in the target format, using the tools designed for that format. This schema’s strategy is to author directly into the interop, and the unintended consequences described above only make the complexity of implementing, using, and maintaining it far greater than it needs to be. Sometimes, cutting out the middle man is not cheaper or easier.

Here’s another alternative to consider:

A meaning for everything: create a schema with clear, discrete semantics with specific content models for each structure. Yes, it explicitly means you have to create stylesheets with some greater degrees of freedom to support the output styling you want, and perhaps it’s always a one-off effort, but overall, it’s easier to manipulate a transformation with overrides or parameters than trying to overload semantics.
For example, consider our example above: If we want to mandate a figure heading must have a caption label and a caption number, then semantically tagging them as such gives you greater freedom for your inline tagging markup like <span>. Using this principle, I could see a markup like the following:
```
<figure> 
    <figtitle>
        <caption_label>Figure</caption_label> 
        <caption_number>1.1</caption_number> 
        Excalibur and the Lady of the Lake 
    </figtitle> 
</figure> 
```
Which might be rendered in HTML5 as:
```
<figure> 
    <h2>
        <span class="caption">Figure </span> 
        <span class="caption_number">1.1 </span> 
        Excalibur and the Lady of the Lake 
    </h2> 
</figure>
```
That also allows me to also distinguish from other types of headings that have different markup requirements. For example, a section title might not have the same caption and numbering mandate:
```
<section> 
    <title>The Relationship Between Arthur and Merlin</title> 
    <subtitle>Merlin as Mentor</subtitle> 
    ... 
</section>
```
Which might rendered in HTML5 as:
```
<section> 
    <h1>The Relationship Between Arthur and Merlin</h1> 
    <h2>Merlin as Mentor</h2> 
    ... 
</section>
```
Notice that in both cases we’re not throwing all the HTML5 markup overboard (figure and section are HTML5 elements), we’re just providing more explicit semantics that model our business rules more precisely. Moreover, it’s substantially easier to encapsulate and enforce these distinctive models in the schema, without assertions or Schematron rules, unless there are specific business rules within the text or inline markup that must be enforced independently from the schema.

Of course, if you change the schema, you may have also make changes in the authoring environment and/or downstream processing. However, that would be true in either case. And, irrespective of whether I use an HTML 5-like or a semantically-explicit schema, I still need to apply some form of transformation on content written against earlier versions of the schema to update to the most current version. The key takeaway is that there is little in the way of development savings with the HTML5 approach.
Design the system with the author as your first priority. For example, most XML authoring tools make it easy by inserting the correct tags for required markup (e.g., our figure heading), especially when each tag’s name is distinct. Many of these same tools also provide functionality to “hide” or “alias” the tags in a way that’s more intuitive to use. Doing this in an overloaded tagging approach will require a lot more development effort to provide same ease of use. Without that effort, and left to their own devices, authors are going to struggle to create valid content, and you are almost certain to have a very difficult time with adoption.
Recognize that tools change over time. The less you have to customize to make the authoring experience easy, the more likely you can take advantage of new features and functionality without substantial rework, which also means lower TCO and subsequently, higher ROI.
Back end changes are invisible to authors. By all means, it’s absolutely vital to optimize your downstream processes to deliver content more efficiently and to an ever-growing number of digital formats. However, the tradeoffs for over-simplifying the backend might end up costing more

HTML5 will become the base format for a wide range of digital media, ranging from EPUB to mobile and the web. On the surface, it would appear that using HTML5 makes sense as both a source format and a target format. The idea has a lot of appeal particularly because of the numerous challenges that still persist today with standard or custom markup grammars that have impacted both authoring and backend processes.

Microformats’ appeal is the ability to leverage a well-known markup (HTML) to create small, discrete semantic data structures targeted for applications with a strong understanding of the format. Leveraging the simplicity of HTML5, we had hoped to create a structured markup that was easy to use for content creation, and with little to no overhead on the back end to process and deliver the content. However, I discovered that it doesn’t scale well when we try applying the same design pattern to a larger set of rich semantic structures within a schema designed for formatting semantics.

Instead, the opposite appears to be true: I see greater complexity in the schema design due to the significant overloading of the class attribute to imply semantic meaning. I also see limitations in current XML authoring tools to support a schema with that level of complexity, without incurring a great deal of technical debt to implement and support a usable authoring environment.

I also discussed how implementing an HTML5 schema with overloaded class attributes likely won’t provide development savings compared to more semantically-explicit schemas when changes occur. In fact, the HTML5 schema may incur greater costs due to its dependency on assertions or Schematron to enforce content rules.

Rather than overloading tags with different structural semantics, an alternative might be the use of a “blended” model. Leverage HTML5 tags where it makes sense: article, section, figure, paragraphs, lists, inline elements, and so on. Where there are content model variations or the need for more constrained models, use more explicit semantics. The advantages to this kind of approach takes advantage of built in features and functionality available in today’s XML authoring tools, and mitigates the level programming or training required. Also, the underlying schema is much easier to maintain long term. Of course, there are trade-offs in that back-end processing pipelines must transform the content. However, with the right level of design, the transformations can be made flexible and extensible enough to support most output and styling scenarios. With this in mind, this type of tradeoff is acceptable if the authoring experience isn’t compromised.

Tuesday, July 24, 2012

Enumerated Constants in XQuery

I’ve been working on a little project that allows me to merge my love of baseball with my knowledge of XML technologies. In the process of working through this project, I am creating XQuery modules that encapsulate the logic for the data. Part of the data that I’m looking at must account for different outcomes during the June amateur draft.

It turns out that the MLB June Amateur draft is quite interesting in that drafting prospects is a big gamble. Drafts may or may not sign in any given year, and remain eligible for drafts in subsequent years. If they don’t sign during that year, they could be drafted by another team in following years. Alternately, they could be selected by the same team and signed. However, even if they do sign, there’s no guarantee that they’ll make it to big leagues. And even if they do, they might not make it with the same team they signed with initially (in other words, they were traded before reaching the MLB).

In effect there are several scenarios, depending how the data is aggregated or filtered. However, these scenarios are well defined and constrained to a finite set of possibilities:

All draft picks
All signed draft picks
All signed draft picks who never reach the MLB (the vast majority don’t)
All signed draft picks who reached the MLB with the club that signed them
All signed draft picks who reached the MLB with another club
All unsigned draft picks
All unsigned draft picks who reached the MLB with a different club
All unsigned draft picks who reach with the same club, but at a later time
All unsigned draft picks who never reach the MLB

All of these scenarios essentially create subsets of information that I can work with, depending whether I’m interested in analyzing a single draft year, or all draft years in range. They’re essentially the same queries, with minor variations to filter to meet a specific scenario.

Working with various strongly typed languages like C# or Java, I would use a construct like an enum to encapsulate these possibilities into one object. Then I can pass this into a single method that will allow me to conditionally process the data based on the specified enum value. Pretty straightforward. For example, in C# or Java I would write:

public enum DraftStatus {
   ALL,  //All draft picks (signed and unsigned)
   UNSIGNED, //All unsigned draft picks
   UNSIGNED_MLB, //All unsigned picks who made it to the MLB
   SIGNED,  //All signed draft picks
   SIGNED_NO_MLB, //Signed but never reached the MLB
   SIGNED_MLB_SAME_TEAM, //signed and reached MLB with the same team
   SIGNED_MLB_DIFF_TEAM  //signed and reached with another club   
};

The important aspect of enumerations is that each item in an enumeration can be descriptive and also map to a constant integer value. For example UNSIGNED is much more intuitive and meaningful than 1, even though they are equivalent.

Working with XQuery, I don’t have the luxury of an enumeration. Well, at least in the OOP sense. I could write separate functions for each of the scenarios above and perform the specific query and return a the desired subset I need. But that’s just added maintenance down the road.

At first I toyed with the idea of using an XML fragment containing a list of elements that mapped the element name to an integer value:

<draftstates>
    <ALL>0</ALL>
    <UNSIGNED>1</UNSIGNED>
    <UNSIGNED_MLB>2</UNSIGNED_MLB>
    <SIGNED>3</SIGNED>
    <SIGNED_NO_MLB>4</SIGNED_NO_MLB>
    <SIGNED_MLB>5</SIGNED_MLB>
    <SIGNED_MLB_SAME_TEAM>6</SIGNED_MLB_SAME_TEAM>
    <SIGNED_MLB_DIFF_TEAM>7</SIGNED_MLB_DIFF_TEAM>
</draftstates>

And then using a variable declaration in my XQuery:

module namespace ds="http://ghotibeaun.com/mlb/draftstates";
declare variable $ds:draftstates := collection("/mlb")/draftstates;

To use it, I need to cast the element value to an integer. Using an example, let's assume that I want all signed draftees who reached the MLB with the same team:

declare function gb:getDraftPicksByState($draftstate as xs:integer, $team as xs:string) as item()* {
   let $picks := 
       if ($draftstate = 
           xs:integer($ds:draftstates/SIGNED_MLB_SAME_TEAM)) then
           let $results := 
               /drafts/pick[Signed="Yes"][G != 0][Debut_Team=$team]
           return $results
       (: more cases... :)
       else ()
   return $picks
};

(:call the function:)
let $sameteam := 
    gb:getDraftPicks(xs:integer($ds:draftstates/SIGNED_MLB_SAME_TEAM), 
                     "Rockies")
return $sameteam

It works, but it’s not very elegant. Every value in the XML fragment has to be extracted through the xs:integer() function which is added logic and makes the code less readable. Add to that, IDEs like Oxygen that enable code completion (and code hinting) doesn’t work with this approach.

What does work well (at least in Oxygen, and I suspect in other XML/XQuery IDEs) are code completion for variables and functions, which led me to another idea. Prior to Java 5, there weren’t enum structures. Instead, enumerated constants were created through the declaration of constants encapsulated in a class:

public class DraftStatus {
    public static final int ALL = 0;
    public static final int UNSIGNED = 1;
    public static final int UNSIGNED_MLB = 2;
    public static final int SIGNED = 3;
    public static final int SIGNED_NO_MLB = 4;
    public static final int SIGNED_MLB = 5;
    public static final int SIGNED_MLB_SAME_TEAM = 6;
    public static final int SIGNED_MLB_DIFF_TEAM = 7;   
}

This allowed static access to the constant values via the class, e.g., DraftStatus.SIGNED_MLB_SAME_TEAM.
The same principle can be applied to XQuery. Although there isn’t the notion of object encapsulation by class, we do have encapsulation by namespace. Likewise, XQuery supports code modularity by allowing little bits of XQuery to be stored in individual files, much like .java files. To access class members, you (almost always) have to import the class into the current class. The same is true in XQuery. You can import various modules into a current module by declaring the referenced module’s namespace and location.
Using this approach, we get the following:

mlbdrafts-draftstates.xqy

xquery version "1.0";

module namespace ds="http://ghotibeaun.com/mlb/draftstates";

declare variable $ds:ALL as xs:integer := 0;
declare variable $ds:UNSIGNED as xs:integer := 1;
declare variable $ds:UNSIGNED_MLB as xs:integer := 2;
declare variable $ds:SIGNED := 3;
declare variable $ds:SIGNED_NO_MLB := 4;
declare variable $ds:SIGNED_MLB := 5;
declare variable $ds:SIGNED_MLB_SAME_TEAM := 6;
declare variable $ds:SIGNED_MLB_DIFF_TEAM := 7;

Now we reference this in another module:

import module namespace ds="http://ghotibeaun.com/mlb/draftstates" at "mlbdrafts-draftstates.xqy";

Which gives as direct access to all the members like an enumeration:

The bottom line is that this approach has worked really well for me. I can use descriptive constant names that map to specific values throughout my code and shows how you can add a little rigor to your XQuery coding.

Tuesday, January 17, 2012

A First Look at ODRL v2

With other things taking high priority over the last 6 months, this is the first opportunity I’ve had to look at the progression of ODRL Version 2.0, and evaluating where it’s improved from the earlier versions.

First things first, ODRL has migrated to the W3C as a Community Working Group. Overall, this is a good thing. It opens it up to the wider W3C community, gives greater credence to the effort and more importantly, more exposure. Well done.

On to my first impressions:

1 . The model has been greatly simplified. With ODRL 1.x, it was possible to express the same rights statement in several different ways. The obvious implication was that it was virtually impossible to build a generalized API for processing IP Rights, save running XJC on the schema, which isn't necessarily always what I want. It wasn’t all bad news though, the 1.x extension model was extremely flexible and enabled the model to support additional business-specific rights logic.

2. Flexible Semantic Model. The 2.0 model has a strong RDF-like flavor to it. Essentially, all of the entities, assets, parties (e.g., rightsholders, licensees), permissions, prohibitions, and constraints are principally URI-based resource pointers that imply semantics to each of the entities. Compared to 1.x, this is a vast improvement to its tag-based semantics, which meant that you were invariably extending either the ODRL content model, data dictionary, or both.

3. Needs More Extensibility. The current normative schema, still in draft, does need some additional design. Out of the box testing with Oxygen shows that only one element is exposed (policy). All of the other element definitions are embedded within the complexType models, which means makes it difficult to extend the model with additional structural semantics. This is extremely important on a number of fronts:

The current model exposes assets as explicit members of a permission or prohibition. Each “term” (i.e., permission or prohibition) is defined by an explicit action (print, modify, sell, display). It’s not uncommon to have a policy that covers dozens or hundreds of assets. So for each term, I have to explicitly call out each asset. This seems a little redundant. The 1.x model had the notion of terms that applied to all declared assets at the beginning of the policy (or in the 1.x semantics, rights). I’d like to see this brought back into the 2.0 model.
The constraint model is too flat. The new model is effectively a tuple of: constraintName, operator, operand. This works well for simple constraints like the following psuedo-code : “print”, “less than”, “20000”, but doesn’t work well for situations where exceptions may occur (e.g., I have exclusive rights to use the asset in the United States until 2014, except in the UK; or I have worldwide rights to use the asset in print, except for North Korea, and the Middle East). Instead, I have to declare the same constraint twice: once within a permission, and second time as a prohibition. I’d like the option to extend the constraint model to enable more complex expressions like the ones above.

Additionally list values within constraints are expressed tokenized strings within the rightOperand attribute. While completely valid to store values in this, I have a nit against these types of token lists, especially if the set of values is particularly long, like it can for things like countries using ISO-3166 codes.

I shouldn’t have to extend the whole complexType declaration in order to extend the model with my own semantics. However the current schema is structured that way. Instead, I’d like to see each entity type exposed as an “abstract” element, bound to a type, which ensures that my extension elements would have to at least conform to the base model.

Takeaways

I’m looking forward to using this with our Rights Management platform. The model is simple and clean and has a robust semantics strategy modeled on an RDF-like approach. This will make it easier to use the out-of-the-box model. That said, it’s missing some key structures that would make it easier to use and extend if I have to, but that can be address with a few modifications to the schema. (I have taken a stab at refactoring to test this theory – it’s pretty clean and I’m able to add my “wish list” extensions with very little effort.

Link: http://dl.dropbox.com/u/29013483/odrl-v2-proposed.xsd

Friday, February 13, 2009

XMetaL Reviewer Webinar

I attended a webinar yesterday hosted by Just Systems for their XMReviewer product. The problem space is that conventional reviewing processes are cumbersome and inefficient, particularly when there are multiple reviewers that need to review a document concurrently. In general, most review processes rely on either multiple draft copies being sent out, one to each reviewer, and then it’s up to the author to “merge” the comment feedback into the source.

With XMReviewer, the entire review process is centralized on the XM Reviewer server: Reviewers simply access the document online, provide their comments and submit. What’s really cool is that reviewers are notified in almost real time when another reviewer has submitted their comments and can integrate their fellow reviewer’s comments into their own.

The real advantage is that authors have all reviewer comments integrated and merged into a single XML instance, and in context. Very Nice.

There’s also a web service API that allows you to integrate XMReviewer with other systems including a CMS that can automatically deploy your content to the XMReviewer server.

There are some nice general reporting/auditing features built in as well. However, I didn’t see anything that would allow me to customize the reports or to manipulate the data, but I wouldn’t consider that a show stopper.

For folks used to “offline” reviews, e.g., providing comments at home, or on a plane, this won’t work for you as it is a server application. Nonetheless, having the ability to have full control and context for review comments far outweighs the minor inconvenient requirement of being online and getting access to the server (most companies these days have VPN, so it’s not a showstopper). Though, I can envision the possibility of the server downloading and installing a small-footprint application that would allow users to review the document “offline” and being able to “submit” the comments back to the server when the reviewer is back online.

The only other limitation right now is that XMReviewer doesn’t support DITA map-level reviews in which you can provide comments on multiple topics within a map. This is currently in development for a future release – stay tuned.

Overall, XMReviewer looks great and can simplify your content review process. Check it out.

Monday, February 9, 2009

Implementing XML in a Recession

With the economic hard times, a lot of proposed projects that would allow companies to leverage the real advantages of XML are being shelved until economic conditions improve. Obviously, in my position, I would love to see more companies pushing to using XML throughout the enterprise. We’ve all heard of the advantages of XML: reuse, repurposing, distributed authoring, personalized content, and so on. These are underlying returns on investment for implementing an XML solution. The old business axiom goes, “you have to spend money to make money.” A corollary to that might suggest that getting the advantages of XML must mean spending lots of money.

However, here’s the reality: implementing an Enterprise-wide XML strategy doesn’t have to break the bank. In fact, with numerous XML standards that are ready to use out of the box, like DITA and DocBook for publishing and XBRL for business, the cost of entry is reduced dramatically compared to a customized grammar.

And while no standard is always a 100 percent perfect match for any organization’s business needs, at least one is likely to support at least 80 percent. We often consult our clients to use a standard directly out of the box (or with very little customization) until they have a good “feel” of how well it works in their environment before digging into the real customization work. Given that funding for XML projects is likely to be reduced, this is the perfect opportunity to begin integrating one of these standards into your environment, try it on for size while the economy is slow, and when the economy improves, then consider how to customize your XML content to fit your environment.

Any XML architecture must encompass the ability to create content and to deliver it, even one on a budget. Here again, most XML authoring tools available on the market have built-in support for many of these standards, with little to no effort, you can use these authoring environments out of the box and get up to speed.

On the delivery side, these same standards, and in many cases the authoring tools have prebuilt rendering implementations that can be tweaked to deliver high-quality content, with all of the benefits that XML offers. In this case, you might want to spend a little more to hire an expert in XSLT. But it doesn’t have to break the bank to make it look good.

The bottom line: A recessionary economy is a golden opportunity to introduce XML into the enterprise. In the short term, keep it simple, leverage other people’s work and industry best practices and leave your options open for when you can afford to do more. Over time when funding returns, then you can consider adding more “bells and whistles” that will allow you to more closely align your XML strategy with your business process.

Friday, April 25, 2008

Context-driven Transclusion

I recently had to implement a really interesting set of functionality for a client where the core content could include supplementary content that was edited and maintained separately. Since supplemental content could change on a regular basis, we wanted to ensure that the supplemental content was always up to date within the core content. The core DITA topic templates could be reused in different map templates that formed the basis of a final form publication, and the client wanted slightly different supplement content to be displayed based on which map template it was contained in.

Conref wouldn't work since we couldn't point to a static resource. Applying a profile also wouldn't since the inclusion wasn't based on a static profile type and also had another negative side effect: by assign the value with a topic, the context of any dynamic transclusion is limited by the current known universe of map templates. Since this client would continue to add/remove/change the map templates, the underlying topic template would have to be touched each time a map template was changed.

The approach I devised was to assign each supplement block with one required and one optional attribute:

name: This would identify the supplement with an identifier that could be referenced within the topic
map-type: This is a "conditional" attribute that identifies that type of map this particular supplement will appear in. If the supplement didn't include this attribute, it would be considered "global" and would appear in all map types.

The supplement itself was a domain specialization that allowed me to create a specialized topic that contained nothing but supplements, and to embed the domain into my main content topic specialization. Here's a quick sample of the supplement content:


<supplements>
...
    <!-- global supplement: appears in all map types -->
<supplement name="introduction">
 ...
</supplement>

    <!-- conditional supplement: appears in specified map types -->
<supplement name="getting-started" map-type="type1">
 Do steps 1, 2, 3 and 4
</supplement>
<supplement name="getting-started" map-type="type2">
 Do steps 1, 3, 5 and 7
</supplement>
</supplements>

So in each of my content topics, I created an anchor using the same element name, but this time I use a third attribute I created on the supplement called sup-ref. The sup-ref attribute acted like an IDREF by referencing a supplement element with the same name. Let's assume I have a topic with a file name of "topic1.dita":


<mytopic id="topic1">
<title>Title</title>
<mytopicbody>
 <supplement sup-ref="introduction"/>
 <supplement sup-ref="getting-started"/>
</mytopicbody>
</mytopic>

So in this case, I have supplements that the reference content for introduction which is a global (unconditional) supplement, and a second supplement, getting-started, that is conditional and will only be included into my content topic if the topic is referenced in the context of a map-type with a value that matches.

Now let's assume that I have two different map types that are defined by an attribute called map-type (I could have also created two separate map specializations with different names depending on what you need, your mileage may vary). This attribute stores a defined map's type name.


<mymap map-type="type1">
<topicref href="topic1.dita"/>
</mymap>

This attribute is primarily used as metadata for identifying and organizing maps within a content store (CMS, XML Database, etc.), but we can also use it for driving our transclusion.

In our XSLT, we simply create a variable that stores the map's map-type value:


<xsl:variable name="map.type" select="@map-type">

When we process our content topic and encounter our supplement reference, we perform a two-stage selection that

collects all supplements with a name attribute that matches the current sup-ref attribute - we do this because we don't know yet if the supplement source is global or context-specific.
With this collection, we refine our search by testing if there are more than one supplement elements, if so, we filter the search by obtaining the supplement that has a map-type attribute that matches our map.type variable. Otherwise, we run a simple test to see if our single supplement is intended for a specific map context or not. If not, include the supplement. If there is a map context, we can emit an error indicating that there isn't a matching supplement.

If we have a match, we replace the anchor supplement element in our content topic, with the supplement element in our external source.

The cool part of all of this is that I can keep the supplemental material separate so that it can be edited and updated when it needs and I can supply different supplemental content to the content topic based on its context as a member of a map and the map's type.

While this is a specific scenario in DITA (the names and functions of the elements have been changed for client confidentiality), the same approach can also be applied to other scenarios that require similar functionality for virtually any XML grammar!

Tuesday, April 15, 2008

Review and Annotation Markup Language?

This subject seems to pop up frequently in my client engagements. Standards like DocBook and DITA both have review and annotation markup. Yet, many structured authoring tools use their own markup for reviewing and annotating XML draft content. Some use processing instructions; others use special namespaced elements. None, to my knowledge, recognize the specific elements as annotations.

For many of our clients, a clearly defined review process is critical to the overall lifecycle of the content. If vendors can't support specific each standards-specific annotation markup, it makes me think that a common markup language for reviewing and annotation would be extremely useful.

Sunday, April 13, 2008

Do We Need Structured Document Formats?

Eric Armstrong has posed a very interesting question about structured document markup languages. And there is a great deal of merit to his question. I want to take a look at some of his points and provide my own thoughts.

Is Markup Too Complicated?

Eric writes:

Those observations explain why structured document formats are so difficult to use: They force you to memorize the tagging structure. They require training, as a result, because it's virtually impossible for the average user to be productive without it.

The editing situation is much better with DITA (120 topic tags, plus 80 for maps) than it is with DocBook (800 tags), or even Solbook (400 tags), but it is still way more difficult than simple HTML--80 tags, many of which can nest, but few of which have to.

But even with a relatively simple format like HTML, we have manual-editing horror stories. In one instance, a title heading was created with 18 non-breaking spaces and a 21-point font. (Try putting that through your automated processor.)

If I had a nickel every time I've heard someone tell me, "I don't care about what tag I use, I just want to write my document", I could retire right now and live off the interest. There's no doubt that transitioning from traditional unstructured desktop authoring tools to structured authoring tools often causes turmoil and cognitive dissonance. Which brings up an interesting question in my mind: Are all semantic markup languages are inherently problematic?

And this where I think Eric and I have a slight difference in opinion. Eric suggests that Wikis offer an alternative to the "tag jambalaya" (my term) of markup languages. Wikis are incredibly good at enabling users to quickly create content without being encumbered by a whole lot of structure or learning curve. For content like Wikipedia, enabling users of various skills to contribute their knowledge to this resource, this makes sense.

However, if I'm writing a manual (collaboratively or not - we'll touch on this later), a reasonable amount of structure is desirable. I agree that a typical user will likely never use a majority of the tags that are built in to DITA, DocBook, or even HTML - this is the price of being an open standard: content models tend to become "bloated" with markup deemed necessary by a wide range of interests. In the past, I wrote manuals for a Unix operating system using DocBook. Of the 400 or so elements in the grammar, I only used 70 or 80 of these elements. The rest didn't apply to the subject matter. I also can't recall the last time I used the samp tag in HTML. It's there, but I don't have to use it.

Even for many of our clients, we end up creating new DITA DTD shells specifically to strip out unnecessary domains to simplify the content model. I will say that's often easier to remove what you don't need than it is to integrate something that isn't there. The new DocBook 5 schemas (developed with RelaxNG) makes it very easy to both remove unwanted elements and add new ones. The DocBook Publisher's Subcommittee schema (currently under development) removes many existing DocBook elements that aren't needed while creating a few additional elements that are relevant for publishers.

This also leads me to another question: which wiki markup? There are literally dozens of wiki markup languages out there, each a little different than the others. Where is the interoperability?

Standard structured markup languages like DocBook and DITA (and even XHTML) are essentially like contracts that state that if you follow the rules within the schema, the document can be rendered into any supported format, and the markup can be shared with others using the same schema. I can even leverage the content into other markup formats.

But where structured, semantic markup shines is in the case where business rules dictate that each DITA task topic must contain a context element (it doesn't now, but you could enforce such a rule in the DTD), or that all tables must contain a title. Unstructured markup like wikis will have a hard time enforcing that, as will HTML. But structured markup with a DTD or schema makes this very easy.

A not so ancillary point to structured semantic markup is the ability to identify that content for its intended meaning - an admonition tagged as a caution or warning is much easier to find (and reuse) than a text block (or generic HTML div or table) that starts with the word "Caution" or "Warning" despite the fact that they might be rendered the same way. And if the admonition contains more than one paragraph of text, having that containment within markup to indicate the start and end of a particular structure is very useful. This is not to mention that from an Localization perspective, tagged semantic markup is the way to go.

Eric rightfully points out that tools like Open Office allow users to create content without knowing that the native format is a markup language. The same is true for many WYSIWYG HTML editors these days (and there's pretty cool web-based gadgets out there too!). Most users never have to see what the underlying HTML looks like. This is where we need to focus our attention. It isn't that markup languages themselves are difficult. Rather, it's that the tools that we use to create the underlying markup are perhaps too difficult for authors to use.

And the excuse we use is that going from unstructured to structured authoring means that authors have to sacrifice some of the flexibility. There's no question that this response is wearing thin, and that most authors (both professional and casual) believe that there has to be a better way.

Conditional Metadata

Eric's point about conditional metadata filtering has had some serious discussion recently on the Yahoo DITA Users Forum. And arguably, there is merit in some of the ideas presented there. Eric's point here deserves mention:

But the fact that such a thing can be done does not mean that it is necessarily desirable to do so. Experience suggests that reuse gets tricky when your environment changes--because your metadata reflects your environment. If your environment doesn't change, then your metadata is fixed. You define it, assign it and that's the end of it. The metadata tagging adds some complexity to your information, but you can live with it, and it buys you a lot.

Metadata is only meaningful when it has context. Context is this case means that there is a relationship between the content and some known "variable" - a particular audience group, an operating platform, or other target that scopes the content's applicability. Where I see churn is in the area of "filtering" content, i.e., suppressing or rendering content based metadata values. To me, this is an implementation problem rather than a design problem.

In the classic case of conditionality, overloading any markup with multiple filtering aspects purely for rendering or suppressing content can lead to serious problems, and requires special treatment and another discussion. However, if we look at metadata as a means of creating a relationship between the tagged content and specific target(s) - the potential for more targeted search and focused, dynamic content assembly expands greatly.

Transclusion and Reuse:

So maybe a really minimal transclusion-capability is all we really need for reuse. Maybe we need to transclude boilerplate sections, and that's about all.

There's no question that transclusion can be abused to the point that a document is cobbled together like Frankenstein's Monster. However, there are cases when transcluding content does make sense, and not just for boilerplate content. We're only beginning to really see the possibilities of providing users with the right amount of information, when they want it, and targeted for that user's level of detail based on metadata (see Flatirons Solutions Whitepaper:
Dynamic Content Delivery Using DITA). Essentially, content can be assembled from a wide range of content objects (topics, sections, chapters, boilerplate, etc.). I would be reluctant to suggest that "boilerplate" or standardized content is the only form of reuse we need.

Still, Eric's question is valid - what is optimal reuse? The answer is that it depends. For some applications, standard boilerplate is right; for others the ability to transclude "approved" admonitions is necessary. And for some, transclusion of whole topics, or sections or chapters is appropriate. The point is that the information design, based on a thorough analysis of the business and its goals, along with evaluating the content will dictate the right amount of reuse.

From a collaborative and distributive authoring perspective, enabling writers to focus on their own content and assemble everything together in a cohesive manner definitely makes a great deal of sense. Wikis work well if you're dealing with collaboration on the same content, but don't really solve the problem of contributing content to a larger deliverable.

Formatting and Containment

Eric's argument is that HTML pretty much got it right because it limited required nesting and containment to lists and tables. Now if I were working with ATA or S1000D all the time, I would agree wholeheartedly. Even DocBook has some odd containment structures (mediaobject comes to mind, but there are benefits for this container that I also understand). From the point of pure simplicity and pure formatting intent, he's right. But the wheels get a little wobbly if we always assume that we're working with a serial content stream solely for format.

One area where containment makes a great deal of sense is in the area of Localization. By encapsulating new and/or changed content into logical units of information, you can realize real time savings and reduced translation costs.

Containment also makes transclusion more useful and less cumbersome. Assuming that we aren't creating Frankenstein's Monster, the ability to point to only the block of content I want, with out cutting and pasting is a distinct advantage.

Conclusion

At the heart of Eric's article, I believe, is the KISS principle. Inevitably, from a content perspective, when you boil structured document formats down to their essence, you get headings, paragraphs, lists, tables, images, and inline markup (look at the Interoperability Framework white paper that Scott Hudson and I wrote to illustrate this). So why use structured markup at all when my desktop word processor can do that right now? In my view, there are numerous reasons, some of them I've discussed here, and others like the potential for interoperability that make structured document markup languages extremely flexible and highly leverageable.

There is no doubt that today's structured markup tools don't always make it easy for users to create content without the markup peeking through the cracks. That doesn't mean that structured markup is the problem. For example, one of my web browsers won't display Scalable Vector Graphics (SVG) at all. It doesn't mean that the SVG standard is a problem, it means that I need to use a web browser that supports the standard.

Eric's article is thought-provoking and well done. It raises the level of discussion that we need to have around why we use structured content (and not because it's the coolest fad), and how we create that content. Let's keep this discussion going.

Monday, April 16, 2007

Where is the ROI with XML Document Interoperability?

Up to this point, most of my posts about XML Interoperability have focused on identifying the problem space that a standards-based Interoperability Framework would attempt to solve from a technical perspective. But technical merit alone isn't a sufficient reason for an organization to adopt and embrace a technology.

Instead, typical organizations want solutions to real problems that have real implications to productivity, resources, costs, or profit. They want a certain degree of confidence that any technology investment will a) solve the problem, and b) they will see a return on that investment in a reasonable amount of time.

So where is the return on investment for XML interoperability?

Significant Investments Already Made

Consider the investment organizations have made thus far to migrate their content from proprietary formats to XML. Typically, the "up front" costs include:

content model development: DTDs, schemas
XML-based production tools: rendering engines, servers, XSLT, FO, etc.
new editing tools
content management systems
training

This is a substantial investment for many organizations with the intent that it will remain in service at least long enough to recover these costs, and hopefully longer.

Bill Trippe wrote:

Organizations are more diverse, more likely to be sharing content between operating groups and with other organizations, and more likely to be sourcing content from a variety of partners, customers, and suppliers. Needless to say, not all of these sources of content will be using the same XML vocabulary;

With organizations already vested in their own XML infrastructure, changes to this environment to support one or more additional XML vocubularies from different partners is bound to met with resistence.

Content Sharing Today

Despite this, partners do share content today. They convert (transform) the content from one vocabulary into another, or modify DTDs or schemas to fit the other content model. Yet these solutions make two assumptions:

Each partner is a terminus in the content sharing pipeline
Each partner's XML vocabulary will not change

In some cases, these assumptions are valid, and XML interoperability on the scale of one-way content conversion or schema/DTD integration is quite manageable and efficient. In this case, implementing the proposed XML interoperability framework may simply add more overhead than provide reasonable ROI. However, if either one of these assumptions is not true, then interoperability (and scalabililty) is a real issue and the framework may provide a mechanism for mitigating the costs and risks of trying to interoperate between numerous or changing vocabularies.

The Shortest Distance Between Two Points isn't a Straight Line

XML itself doesn't make any claims to enabling content reusability. But standards like DITA, and to some extent DocBook, provide mechanisms to enable content fragments to be reused in many target documents. For example, a procedural topic (section) written in for a User's manual could also be used in training material or support documentation. Frequently I've seen many cases where a Tech Pubs group is using a different XML grammar than "downstream" partners like Training or Support. There's the additive cost in time and resources to convert the content to fit their DTDs. More importantly, the semantics from the original source have now gone through two different transformations. It's almost like the children's game of "Telephone" where one child whispers a phrase in the next child's ear and so on down the line until the final child hears something entirely different. By enabling a shared interchange, you can reduce the number of semantic deviations to only one.

The Only Constant is Change

The other reality is that even when all partners are using XML standards like DITA, DocBook, ODF, or S1000D, the standards continually evolve, adding and changing content models to meet their constituents' needs. Since these standards aren't explicitly interoperable, the costs of managing changes between different standards goes up considerably. And here is another area where having an Interoperability format makes the most sense: If interchange is channeled through a neutral format, it can be interpreted to and from different standards (and versions) with a fewer number of transformation permutations. So if one partner moves to a different version standard, using an interchange format reduces costs and risks to your own toolchain and processes.

Lingua Franca

One of the design principles we've proposed with the Interoperability format is to leverage existing standards like XHTML. Because of this, we minimize the learning curve required for organizations to come up to speed to enable interoperability between different XML grammars. Just as English is the lingua franca for international business, aviation and science, a standardized interchange format for XML grammars provides a vehicle to enable content sharing between XML standards.

For example, an organization that has invested in a DITA XML infrastructure will not likely have a lot of in-house expertise on DocBook, ODF, or S1000D. Now, the amount of time and effort to enable content interoperability goes up signficantly. Add subsequent XML grammars into the interoperability mix and the level of complexity and cost are even higher.

With that said, a common XML interchange format isn't intended to be a translation vehicle, which is more in line with the Content and Processing interchange strategies I've described in previous posts. These strategies do have a place in the whole discourse around interoperability. They make perfect sense for linear, end-to-end interchange where both parties understand each other's "language" very well. And in reality, these strategies are likely to be more cost-effective than employing a "middle man."

Rather, a standardized XML Interoperability Framework will provide the highest ROI under the following conditions:

Content is leveraged/shared among many different consumers using different languages
A corollary to the above: Content sharing is non-linear
Business demands (time-to-market, lack of in-house expertise, partner relationships) make direct XML grammar translation cost-prohibitive

Monday, February 19, 2007

Types of XML Content Interoperability: Pros and Cons

In my last post, I talked about why we need XML interoperability. Now, let's talk about different strategies for implementing interoperability. We'll also discuss the pros and cons for each approach.

There is a common thread with each approach: XSLT. What makes XML remarkably flexible and resilient (and widely adopted) is its ability to transformed into so many different formats for both human and computer consumption. It's also why XML Interoperability can even be discussed.

Types of XML Interoperability

There are three basic strategies for acheiving interoperability between XML Document Standards:

Content Model Interoperability
Processing Interoperability
Roundtrip Interoperabilty

Each of these approaches has valid use case scenarios, and should not be dismissed out of hand. Yet, each of these approaches makes certain assumptions about the business processes, and environments that will work in some circumstances, but are less than optimal in others.

Content Model Interoperability

Content Model Interoperability is centered around enabling all or part of one standard's content model to be included as part of another standard. For example, DITA's specialization capabilities could be employed to create custom topic types for DocBook sections or refentries (in a DITA-like way). Conversely, DocBook's DTDs are designed to create customizations on top of the core content model.

In addition to customizing the DTDs (or Schemas), there is an additional step to support the new content in the standard: You need to account for these custom elements in the XSLT stylesheets - for each intended output format.

While on the surface this approach appears to be the most logical way to ensure that your content can interoperate with another standard, this is not an approach to be undertaken for the faint of heart. Working with DTD's and schemas is doable, but will require a thorough understanding of both standards before you begin. There are other limitations:

This approach allows you to accept content from one standard, but doesn't allow you to share or leverage this content with other collaration partners. In effect, this approach is "shoehorning" content from one standard into yours. However, if you are dealing with receiving content from only one partner (and you aren't sharing content elsewhere), this could be a viable approach. But keep in mind...
You and your partner are now both bound to a fixed version of the standards that will be sharing content. If either you or your partner decide to move to a later version of the respective standards, you may have to rework your customizations to support the new content models. You also run the risk that your legacy content won't validate against the new DTDs or schemas.
Be aware that while content in different namespaces may provide "short-term" relief, it can also cause "long-term" headaches (much in the same way that Microsoft's COM architecture introduced us all to "DLL Hell"). It also means that your content must also be in a namespace (even if it is the default one).

Processing Interoperability

In this approach, content from one standard is either transformed or pre-processed into the other using XSLT. This approach is less risky in some ways compared to Content Model Interoperability: You don't have to maintain a set of DTDs to enable content interoperability, and it's whole lot easier to share the transformed content with partners once it's transformed into a single DTD.

There is a slightly different angle you you can take: You could say that you won't preprocess the content into your DTD, but instead use your XSLT stylesheets to incorporate the "foreign" content into the final output. For some cases, where you may be simply "rebranding" content, this might be a viable approach, yet keep in mind that this might mean some additional investment in incorporating other tools in your tool chain. For example, DITA and DocBook content employ very different processing models (i.e., the DITA Open Toolkit vs. the DocBook XSLT stylesheets). This may require a hefty development effort to integrate these tools properly in your environment. However, if you intend to leverage the content elsewhere in your own content, this angle can become a lot harder to implement.

For organizations sharing content back and forth, or for groups that are receiving content from one partner and are sharing it with other partners in the pipeline, this could be a reasonable approach. Still, there are potential risks here:

This "uni-directional" approach is more flexible than than Content Model Interoperability, but, you still potentially have the same DTD/Schema version problem. And it only works realistically for one pair of standards, for example DocBook and DITA.
If your partner begins creating content in a newer version of their DTD, you may have to upgrade your transforms to enable the content to be used by you.
You still need to be well-versed in both standards to ensure each plays nicely in your environment

Be prepared for dealing with validation issues. While each standard does include markup for standard content components like lists, tables, images, etc., there are structures that do not map cleanly. In these cases, you will need to make some pretty hard decisions about how they will (or will not) be marked up.

Roundtrip Interoperability

This is perhaps the most ambitious approach to creating interoperable content and encompasses being able to transform one standard into another and round trip that content back into the original standard. Like Processing Interoperability, you still have some very tricky issues to contend with:

How do you handle round tripping between different versions of the standards? The net result is that you will need multiple stylesheets to support each version permutation.

It's bi-directional, meaning that the round trip only works between the two standards (and with specific versions of those standards).

The following figures (taken from Scott Hudson and my presentation at DITA 2007 West) illustrate the problem:

In this example, we're only dealing with two standards, DocBook and DITA. But as you can see, there are numerous permutations that are potential round trip candidates. Now let's add another standard, like ODF

You can see that this quickly becomes a very unmanageable endeavor.

Conclusion

I've gone over three different strategies for approaching XML interoperability, situations where they work well, and some of the problems you may encounter when choosing one of these strategies. In my next post, I'll look at another approach for handling XML interoperability.

Saturday, February 17, 2007

Why XML Content Interoperability Is Critical

Content re-use, sharing, and re-purposing has long been the "Holy Grail" of publishing. There have been numerous attempts at solving this problem, all with varying degrees of success. Yet as the publishing has grown and evolved over time, technologies have emerged to solve one set of problems, only to find that other technical challenges have "filled the void."

Back In The Day...

In the 1980's, desktop publishing tools opened the doors to a brand new set of professionals to create and produce high quality and lower cost publications. Tools like Frame, Quark, WordStar, WordPerfect, and Word would allow veritable novices to quickly and easily create professional-looking typeset documents.

Yet, while these tools opened the door to cheaper, faster publication cycles, they also created a very significant problem: Each tool created content in their own proprietary formats. And these formats were not compatible with other tools! Sure, this wasn't a problem if all your content was in the same format, and you didn't have to leverage, share or reuse content from other tools. But the moment you needed to incorporate information from any other tool, life became very difficult.

The unintended consequences of trying to convert incompatible formats to your own usually resulted in "One-Offs", but more importantly, the incurred costs (including the pyschological and physical trauma) were very high. It wasn't uncommon that a conversion would take a single writer/editor/DTP professional several weeks to convert a manual. Now what if your organization was sharing content from several partners using different formats - now we're talking about real money!

Enter SGML...

In the late 80's, early 90's Standardized General Markup Language (SGML) emerged with the promise that it would solve the format conundrum that desktop publishing tools wrought on us. The idea was brilliant: All content would be stored as a text, but formatting and other semantics would be included using "tags" and "attributes" that could easily be interpreted.

Robert Burns could not foresee that his oft quoted verse would apply here:

"The best-laid schemes o' mice an' men
Gang aft agley,
An' lea'e us nought but grief an' pain,
For promis'd joy"

The problem with SGML was there were few tools that supported this very complex standard. What tools that were available were very expensive and out of reach but for the larger organizations that could afford the investment.

HTML and World Wide Web

The mid-90's introduced HTML and the World Wide Web, along with a brand new medium for sharing content. HTML is in fact an SGML application. But unlike other SGML applications, it was widely adopted because of a unique piece of software, made popular by the likes of Mark Andreessen, the web browser. Another factor in its adoption was the very small, easy to learn tag set - in essence, it contained the basic structural components found in most publications:

Headings (Sections)
Paragraphs
Lists
Tables
Images
Inline formatting elements (bold, italic, monospace)

Still, there were several problems with HTML. The markup was primarily focused on presentation. Content semantics were lost: Was an ordered list item intended to be a procedural step or the ith item in a numbered list? Is the monospaced phrase a command or environment variable?

The second problem was that web browsers enabled the misuse and abuse of elements in such a way that even the most basic semantics were buried in the mishmash of tags.

XML!

In the late 1990's the W3C developed the eXtensible Markup Language (XML). It promised a lightweight version of SGML that could be enabled for the web. It has lived up to its promise, and then some. Like HTML, XML is now widely adopted because of the availability of cheap (even free) tools. Along with these tools, major programming languages (Java, VB, C++, .NET) support XML through easy to use APIs.

SGML applications, like DocBook, soon ported their DTDs to XML. Editing tools like XMetaL and Arbortext Editor supported XML and made it easier for the "uninitiated" to create structured content.

The widespread adoption of XML also brought with it another problem: The number of XML Document Standards has grown significantly in the past 10 years. And today, standards like TEI, DocBook, DITA, and even ODF (Open Document Format - used by the Open Office tools) are widely used. There are also countless variants of some these standards.

Now the problem isn't the lack of semantics, or incompatible formats, it's the proliferation of XML Document standards, each of which has taken a different approach toward producing content. All of these standards have valid practical applications (which I will not discuss here - suffice it to say that each application takes into account different forms and functions toward producing content).

In an evironment where collaboration and new types of partnerships are emerging, the conundrum now is how do I share/re-use/re-purpose content from multiple partners using different XML standards? How do I mitigate risks to my budget and schedule by leveraging disparate content? In my experience, this is sort of like a Chinese Finger Puzzle: It's easy to adopt a standard (that maps to your own processes), but it's a whole lot more complicated to work around it when other standards are introduced!

This is why content interoperability is so important. In my experience, this problem has often been a gating factor in collaboration. As I mentioned in a previous entry, organizations often develop their "language" and semantics around well established processes. Also true of many organizations, they will "reject" processes that run counter to their own. The end result is an "informational impasse" - information that doesn't fit the organization model is quickly dismissed.

Yet many organizations have been asked to collaborate with other organizations using different "languages." This also applies to XML standards. Some organizations we've consulted with are producing books, and collections of articles, which is best suited for DocBook. Yet in the future, they may partner with other organizations to produce learning content, which is not one of DocBook's strengths. Another organization is has an OEM partnership where both are using different DITA specializations. And within that same company, there are organizations using DocBook variants. All of this content is leveraged and reused in many different ways.

There are several strategies for dealing with disparate standards. And I'll discuss these in future posts.

Thursday, February 8, 2007

A "Sociology" of XML Languages

In a perfect world, XML content could easily be shared between different users and organizations because everyone would be sharing the same markup and semantics. Information interchange could be seamless; content could be repurposed and reused with minimal effort between different functional teams; XML processing tools could be optimized.

Yet, there are numerous reasons why we see so many different XML grammars used by different organizations. I'll focus on two of these briefly:

Organizational Dynamics
Multiple XML Standards

Organizational Dynamics

I never thought that my background in Sociology would ever be useful, but it certainly is applicable here (I'm very rusty since I haven't consciously thought about this subject in over 10 years): Looking back at the works of Emile Durkheim, Max Weber, and Frederick Winslow Taylor, we see that organizations are structured around distinct divisions of labor to enable individuals to specialize their skills and work on discrete aspects of the "production process" (keep in mind that most of these theories during the peak of the Industrial Revolution).

What's more interesting here is how groups are organized. And in part, this shaped by many different factors including the industry vertical, the size of the organization, the relationships with other organizations. There is plenty of literature about these subjects that I won't delve into them here.

The key takeaway is that all these factors have a direct effect on the organization's processes, meaning that for information development groups (Tech Pubs, Training, etc.), this affects how information (content) is created, managed, and distributed.

Through an organization's processes, there is interesting side-effect on language. Organizations create their own vocabulary to express, and even rationalize their processes (there are other implications of this like "group identification" at work here too). For example, there is an often quoted line from the movie, Office Space, "Did you get the memo about the TPS Reports?" Even within the same industry where there are common terms (like "GUI" or "Menu" in software), there are distinct "dialects" that evolve over time, much in the same way that there are different Spanish dialects: A spanish speaker from Spain could probably converse with a spanish speaker from Argentina, but there might be word or phrases that aren't understood.

And this manifests itself in the XML syntax adopted by these organizations used to create content. A logical strategy for these organizations is to adopt known XML standards like DocBook or DITA that fit their organization's process the closest, and modify these standards to incorporate words or phrases of their own into their XML syntax.

Multiple XML Standards

One of the incredibly powerful aspects of XML is its ability to evolve over time to support different syntaxes. The unintended consequences, however, is that we now have several well known XML Document standards like DocBook and DITA. While they are different architecturally, and to some extent, semantically, they're both targeted at virtually the same audience (information developers), produce the same kinds of output formats (PDF, HTML Help, HTML, Java Help), and probably more important, contain the same kinds of structural components (paragraphs, lists, tables, images, formatting markup), albeit using different element names ("A rose by any other name would smell as sweet" - Romeo and Juliet).

Yet, by having multiple standards, it can create an "informational impasse," where the DTDs get in the way of sharing content across organizations. And for many organizations, this is a real problem. Across all industry verticals, we're seeing news forms of collaboration and partnering across companies (and organizations), along with consolidation (mergers and acquisitions). And from an XML content perspective, the question is, "My content is in 'X' and my partners' content is in 'Y' and 'Z'. How do I reconcile these disparate document types?"

And therein lies the need for a Doc Standards Interoperability Framework, which I will describe in future posts.

Wednesday, February 7, 2007

DITA West 2007: Day 2 - Dimented and Topic-oriented

Amber Swope from Just Systems gave a great keynote presentation on Localization issues with DITA content. Interestingly, it doesn't just apply to DITA, and has broad implications to any organization that does/will have L10N or I18N as part of their deliverable stream, regardless of whether they're using structured/unstructured content. With that said, DITA does have unique advantages in minimizing L10n/I18N costs in that it can encapsulate (read: mitigate) the volume of content that needs to be translated.

In the same vein, Amber gave another presentation on "controlling" your content using a CMS, and other processes to provide efficiency, content integrity, and protection from liability (life, limb and legal)

The highlight however, was Paul Masalsky's (EMC) presentation. He spoke about the trials and tribulations of integrating DITA in an enterprise environment. The most memorable part of his presentation (and the conference thus far) was his DITA rap. It was AMAZING! I, along with the rest of the audience was completely wowed! If I can get a hold of the lyrics, I will post them here. I don't remember all of the lyrics, but one verse rhymed "dimented" and "topic-oriented" It was fantastic! Who says geeks are one-dimensional?!!!

Scott Hudson and I are working on the last minute preparations for our presentation tomorrow. It promises to be thought-provoking, and provide some neat demonstrations of how content from different standards like DocBook, DITA, ODF and others can interoperate. I think that this definitely has potential for enabling heterogenous environments to solve the difficult problem of "how do I reconcile content that doesn't quite fit my model?"

Hope to see you there.

Monday, February 5, 2007

DITA West 2007: Day 1

It was finally nice to put faces to names I've worked with, in addition to meeting some new folks. So far it looks like the conference has about 80 or so attendees. It could be a result of the Super Bowl that the attendance was a little low. Hopefully more will show up tomorrow. Unfortunately, Michael Priestly and Don Day aren't here. Was looking forward to talking with him more about the Interoperability Framework.

Lou Iuppa from XyEnterprises gave the opening Keynote. Essentially the thesis was that DITA would benefit by the use of a CMS

Looks like the majority of the attendees are Technical Writers either new to DITA, or in the process of implementing a DITA solution in their organization. Definitely hoping to see some more technical presos, though Yas Etessam's presentation on "Enabling Specializations in XMetaL Author" was very interesting.

A fair number of vendors booths in the hallway outside the conference rooms, including one from Flatirons Solutions. Just Systems, PTC, MarkLogic were among some.