Jim's Thoughtspot: April 2008

Wednesday, April 30, 2008

Interoperability Framework Mentioned in DC Lab Article

Terry Mulvihill wrote an article for Data Conversion Labs, called "DocBook versus DITA: Will the Real Standard Please Stand Up?" In the article she mentioned the Interoperability Framework. Along with that, there's a quote from Eric Severson.

Saturday, April 26, 2008

Metadata Interoperability

We recently started working on our Interoperability Framework again (yeah!). In the course of looking through the design, we realized that we were missing a key facet: Metadata. So we started digging through the DITA and DocBook standards to determine how we could could map the metadata content models to a common metadata markup. But the question came up about which metadata model would we use in our interop framework.

My belief is that we should use and leverage existing standards. The core interop framework is designed around this principle, so the metadata should be too. Based on that, the logical choice is to use Dublin Core. The task now is to map the metadata content models that are used by DITA and DocBook to the Dublin Core standard.

This is when I realized that each standard has in large part reinvented the wheel around metadata. Both standards have metadata semantics that are also defined in Dublin Core. Both standards also include unique metadata markup presumably designed around the unique needs of the standard, which is probably why both haven't adopted the Dublin Core standard.

DCMI has been in existence since 1995 and is actively developing and evolving the standard. It encapsulates core metadata like author and publisher in the base element set. For other metadata, Dublin Core provides a model that is extremely extensible to enable any metadata to be assigned. In fact, it's relatively trivial to integrate other metadata models like DRM.

So going back to my argument about leaky abstractions, both standards have a problem here. Out of the box, both DocBook and DITA assume taxonomies that are relevant and applicable to their models. Other metadata can be incorporated through customization or specialization. This is all well and good, except that interoperability is greatly diminished when additional "non-standard" metadata markup is included within the content model.

Perhaps it's time that both standards consider integrating Dublin Core directly as the default metadata model. Right now, both standards can integrate DCMI along side their existing metadata, but there is a certain level of redundancy. The benefit of a standard metadata model is increasingly valuable as more and more content is managed via a CMS or XML database, and as more content is designed for reuse.

Friday, April 25, 2008

Context-driven Transclusion

I recently had to implement a really interesting set of functionality for a client where the core content could include supplementary content that was edited and maintained separately. Since supplemental content could change on a regular basis, we wanted to ensure that the supplemental content was always up to date within the core content. The core DITA topic templates could be reused in different map templates that formed the basis of a final form publication, and the client wanted slightly different supplement content to be displayed based on which map template it was contained in.

Conref wouldn't work since we couldn't point to a static resource. Applying a profile also wouldn't since the inclusion wasn't based on a static profile type and also had another negative side effect: by assign the value with a topic, the context of any dynamic transclusion is limited by the current known universe of map templates. Since this client would continue to add/remove/change the map templates, the underlying topic template would have to be touched each time a map template was changed.

The approach I devised was to assign each supplement block with one required and one optional attribute:

name: This would identify the supplement with an identifier that could be referenced within the topic
map-type: This is a "conditional" attribute that identifies that type of map this particular supplement will appear in. If the supplement didn't include this attribute, it would be considered "global" and would appear in all map types.

The supplement itself was a domain specialization that allowed me to create a specialized topic that contained nothing but supplements, and to embed the domain into my main content topic specialization. Here's a quick sample of the supplement content:


<supplements>
...
    <!-- global supplement: appears in all map types -->
<supplement name="introduction">
 ...
</supplement>

    <!-- conditional supplement: appears in specified map types -->
<supplement name="getting-started" map-type="type1">
 Do steps 1, 2, 3 and 4
</supplement>
<supplement name="getting-started" map-type="type2">
 Do steps 1, 3, 5 and 7
</supplement>
</supplements>

So in each of my content topics, I created an anchor using the same element name, but this time I use a third attribute I created on the supplement called sup-ref. The sup-ref attribute acted like an IDREF by referencing a supplement element with the same name. Let's assume I have a topic with a file name of "topic1.dita":


<mytopic id="topic1">
<title>Title</title>
<mytopicbody>
 <supplement sup-ref="introduction"/>
 <supplement sup-ref="getting-started"/>
</mytopicbody>
</mytopic>

So in this case, I have supplements that the reference content for introduction which is a global (unconditional) supplement, and a second supplement, getting-started, that is conditional and will only be included into my content topic if the topic is referenced in the context of a map-type with a value that matches.

Now let's assume that I have two different map types that are defined by an attribute called map-type (I could have also created two separate map specializations with different names depending on what you need, your mileage may vary). This attribute stores a defined map's type name.


<mymap map-type="type1">
<topicref href="topic1.dita"/>
</mymap>

This attribute is primarily used as metadata for identifying and organizing maps within a content store (CMS, XML Database, etc.), but we can also use it for driving our transclusion.

In our XSLT, we simply create a variable that stores the map's map-type value:


<xsl:variable name="map.type" select="@map-type">

When we process our content topic and encounter our supplement reference, we perform a two-stage selection that

collects all supplements with a name attribute that matches the current sup-ref attribute - we do this because we don't know yet if the supplement source is global or context-specific.
With this collection, we refine our search by testing if there are more than one supplement elements, if so, we filter the search by obtaining the supplement that has a map-type attribute that matches our map.type variable. Otherwise, we run a simple test to see if our single supplement is intended for a specific map context or not. If not, include the supplement. If there is a map context, we can emit an error indicating that there isn't a matching supplement.

If we have a match, we replace the anchor supplement element in our content topic, with the supplement element in our external source.

The cool part of all of this is that I can keep the supplemental material separate so that it can be edited and updated when it needs and I can supply different supplemental content to the content topic based on its context as a member of a map and the map's type.

While this is a specific scenario in DITA (the names and functions of the elements have been changed for client confidentiality), the same approach can also be applied to other scenarios that require similar functionality for virtually any XML grammar!

Tuesday, April 15, 2008

Review and Annotation Markup Language?

This subject seems to pop up frequently in my client engagements. Standards like DocBook and DITA both have review and annotation markup. Yet, many structured authoring tools use their own markup for reviewing and annotating XML draft content. Some use processing instructions; others use special namespaced elements. None, to my knowledge, recognize the specific elements as annotations.

For many of our clients, a clearly defined review process is critical to the overall lifecycle of the content. If vendors can't support specific each standards-specific annotation markup, it makes me think that a common markup language for reviewing and annotation would be extremely useful.

Monday, April 14, 2008

Cool Stuff - Read Dick Hamilton's Article on The Content Wrangler

Dick Hamilton has written a very insightful and balanced article about some considerations for when to choose DITA or DocBook on the Content Wrangler (Scott Abel's site). Check it out:

http://www.thecontentwrangler.com/article/choosing_an_xml_schema_docbook_or_dita/

Sunday, April 13, 2008

Do We Need Structured Document Formats?

Eric Armstrong has posed a very interesting question about structured document markup languages. And there is a great deal of merit to his question. I want to take a look at some of his points and provide my own thoughts.

Is Markup Too Complicated?

Eric writes:

Those observations explain why structured document formats are so difficult to use: They force you to memorize the tagging structure. They require training, as a result, because it's virtually impossible for the average user to be productive without it.

The editing situation is much better with DITA (120 topic tags, plus 80 for maps) than it is with DocBook (800 tags), or even Solbook (400 tags), but it is still way more difficult than simple HTML--80 tags, many of which can nest, but few of which have to.

But even with a relatively simple format like HTML, we have manual-editing horror stories. In one instance, a title heading was created with 18 non-breaking spaces and a 21-point font. (Try putting that through your automated processor.)

If I had a nickel every time I've heard someone tell me, "I don't care about what tag I use, I just want to write my document", I could retire right now and live off the interest. There's no doubt that transitioning from traditional unstructured desktop authoring tools to structured authoring tools often causes turmoil and cognitive dissonance. Which brings up an interesting question in my mind: Are all semantic markup languages are inherently problematic?

And this where I think Eric and I have a slight difference in opinion. Eric suggests that Wikis offer an alternative to the "tag jambalaya" (my term) of markup languages. Wikis are incredibly good at enabling users to quickly create content without being encumbered by a whole lot of structure or learning curve. For content like Wikipedia, enabling users of various skills to contribute their knowledge to this resource, this makes sense.

However, if I'm writing a manual (collaboratively or not - we'll touch on this later), a reasonable amount of structure is desirable. I agree that a typical user will likely never use a majority of the tags that are built in to DITA, DocBook, or even HTML - this is the price of being an open standard: content models tend to become "bloated" with markup deemed necessary by a wide range of interests. In the past, I wrote manuals for a Unix operating system using DocBook. Of the 400 or so elements in the grammar, I only used 70 or 80 of these elements. The rest didn't apply to the subject matter. I also can't recall the last time I used the samp tag in HTML. It's there, but I don't have to use it.

Even for many of our clients, we end up creating new DITA DTD shells specifically to strip out unnecessary domains to simplify the content model. I will say that's often easier to remove what you don't need than it is to integrate something that isn't there. The new DocBook 5 schemas (developed with RelaxNG) makes it very easy to both remove unwanted elements and add new ones. The DocBook Publisher's Subcommittee schema (currently under development) removes many existing DocBook elements that aren't needed while creating a few additional elements that are relevant for publishers.

This also leads me to another question: which wiki markup? There are literally dozens of wiki markup languages out there, each a little different than the others. Where is the interoperability?

Standard structured markup languages like DocBook and DITA (and even XHTML) are essentially like contracts that state that if you follow the rules within the schema, the document can be rendered into any supported format, and the markup can be shared with others using the same schema. I can even leverage the content into other markup formats.

But where structured, semantic markup shines is in the case where business rules dictate that each DITA task topic must contain a context element (it doesn't now, but you could enforce such a rule in the DTD), or that all tables must contain a title. Unstructured markup like wikis will have a hard time enforcing that, as will HTML. But structured markup with a DTD or schema makes this very easy.

A not so ancillary point to structured semantic markup is the ability to identify that content for its intended meaning - an admonition tagged as a caution or warning is much easier to find (and reuse) than a text block (or generic HTML div or table) that starts with the word "Caution" or "Warning" despite the fact that they might be rendered the same way. And if the admonition contains more than one paragraph of text, having that containment within markup to indicate the start and end of a particular structure is very useful. This is not to mention that from an Localization perspective, tagged semantic markup is the way to go.

Eric rightfully points out that tools like Open Office allow users to create content without knowing that the native format is a markup language. The same is true for many WYSIWYG HTML editors these days (and there's pretty cool web-based gadgets out there too!). Most users never have to see what the underlying HTML looks like. This is where we need to focus our attention. It isn't that markup languages themselves are difficult. Rather, it's that the tools that we use to create the underlying markup are perhaps too difficult for authors to use.

And the excuse we use is that going from unstructured to structured authoring means that authors have to sacrifice some of the flexibility. There's no question that this response is wearing thin, and that most authors (both professional and casual) believe that there has to be a better way.

Conditional Metadata

Eric's point about conditional metadata filtering has had some serious discussion recently on the Yahoo DITA Users Forum. And arguably, there is merit in some of the ideas presented there. Eric's point here deserves mention:

But the fact that such a thing can be done does not mean that it is necessarily desirable to do so. Experience suggests that reuse gets tricky when your environment changes--because your metadata reflects your environment. If your environment doesn't change, then your metadata is fixed. You define it, assign it and that's the end of it. The metadata tagging adds some complexity to your information, but you can live with it, and it buys you a lot.

Metadata is only meaningful when it has context. Context is this case means that there is a relationship between the content and some known "variable" - a particular audience group, an operating platform, or other target that scopes the content's applicability. Where I see churn is in the area of "filtering" content, i.e., suppressing or rendering content based metadata values. To me, this is an implementation problem rather than a design problem.

In the classic case of conditionality, overloading any markup with multiple filtering aspects purely for rendering or suppressing content can lead to serious problems, and requires special treatment and another discussion. However, if we look at metadata as a means of creating a relationship between the tagged content and specific target(s) - the potential for more targeted search and focused, dynamic content assembly expands greatly.

Transclusion and Reuse:

So maybe a really minimal transclusion-capability is all we really need for reuse. Maybe we need to transclude boilerplate sections, and that's about all.

There's no question that transclusion can be abused to the point that a document is cobbled together like Frankenstein's Monster. However, there are cases when transcluding content does make sense, and not just for boilerplate content. We're only beginning to really see the possibilities of providing users with the right amount of information, when they want it, and targeted for that user's level of detail based on metadata (see Flatirons Solutions Whitepaper:
Dynamic Content Delivery Using DITA). Essentially, content can be assembled from a wide range of content objects (topics, sections, chapters, boilerplate, etc.). I would be reluctant to suggest that "boilerplate" or standardized content is the only form of reuse we need.

Still, Eric's question is valid - what is optimal reuse? The answer is that it depends. For some applications, standard boilerplate is right; for others the ability to transclude "approved" admonitions is necessary. And for some, transclusion of whole topics, or sections or chapters is appropriate. The point is that the information design, based on a thorough analysis of the business and its goals, along with evaluating the content will dictate the right amount of reuse.

From a collaborative and distributive authoring perspective, enabling writers to focus on their own content and assemble everything together in a cohesive manner definitely makes a great deal of sense. Wikis work well if you're dealing with collaboration on the same content, but don't really solve the problem of contributing content to a larger deliverable.

Formatting and Containment

Eric's argument is that HTML pretty much got it right because it limited required nesting and containment to lists and tables. Now if I were working with ATA or S1000D all the time, I would agree wholeheartedly. Even DocBook has some odd containment structures (mediaobject comes to mind, but there are benefits for this container that I also understand). From the point of pure simplicity and pure formatting intent, he's right. But the wheels get a little wobbly if we always assume that we're working with a serial content stream solely for format.

One area where containment makes a great deal of sense is in the area of Localization. By encapsulating new and/or changed content into logical units of information, you can realize real time savings and reduced translation costs.

Containment also makes transclusion more useful and less cumbersome. Assuming that we aren't creating Frankenstein's Monster, the ability to point to only the block of content I want, with out cutting and pasting is a distinct advantage.

Conclusion

At the heart of Eric's article, I believe, is the KISS principle. Inevitably, from a content perspective, when you boil structured document formats down to their essence, you get headings, paragraphs, lists, tables, images, and inline markup (look at the Interoperability Framework white paper that Scott Hudson and I wrote to illustrate this). So why use structured markup at all when my desktop word processor can do that right now? In my view, there are numerous reasons, some of them I've discussed here, and others like the potential for interoperability that make structured document markup languages extremely flexible and highly leverageable.

There is no doubt that today's structured markup tools don't always make it easy for users to create content without the markup peeking through the cracks. That doesn't mean that structured markup is the problem. For example, one of my web browsers won't display Scalable Vector Graphics (SVG) at all. It doesn't mean that the SVG standard is a problem, it means that I need to use a web browser that supports the standard.

Eric's article is thought-provoking and well done. It raises the level of discussion that we need to have around why we use structured content (and not because it's the coolest fad), and how we create that content. Let's keep this discussion going.

Saturday, April 12, 2008

DITA's Leaky Abstractions

If you haven't read Joel Spolsky's Law of Leaky Abstractions before, here's the basic premise: constructs that are designed to "simplify" our lives can sometimes fail and result in even bigger problems than the abstraction intended to solve.

In DITA, there are two potential leaky abstractions:

Specialization

References

Before you think that I'm disparaging DITA, read on.

DITA is perhaps one of the most transformative ideas to come out of XML. It has enabled users to create content for a wide range of purposes and a wide range of industries - from the traditional Tech Pubs to Finance, Industrial, and Aerospace. And this is just scratching the surface. The door is just beginning to open up to the possibilities for adopting DITA. And the vendors who've jumped on the DITA bandwagon continues to grow.

There are so many reasons for adopting DITA as an XML platform: The architecture is designed with reuse in mind. Instead of thinking of content as large monolithic documents, DITA changes the paradigm by thinking of content as smaller, single units of information, that can be assembled into many different documents in many different ways. And with conref, you can reuse even smaller pieces of content, like product names or common terminology.

If reuse isn't a big selling point for you, the ability to create your own content types and semantics (specializiation) that fit your processes. No need for a one-size-fits-all content model. With specialization, you can derive new topic types or new semantic elements from existing DITA elements, provided that the underlying content model for these topics or semantic elements (inline elements, AKA "domain specializations" in DITA parlance) comply with the underlying content model pattern of the "parent". This is really cool. You can create wholly new content markup that you understand, or you can refine existing content models to be tighter based on what you need.

Where's the Leak?

By reading this far, you're probably confused. I've said that DITA has leaky abstractions, particularly with Specialization and References, and I also said that DITA's really cool because you can assemble documents from many different topics, you can conref content from other resources, and you can create specializations. So let me go back to Spolsky's Law of Leaky Abstractions. In his blog, Spolsky says:

"Abstractions fail. Sometimes a little, sometimes a lot. There's leakage. Things go wrong. It happens all over the place when you have abstractions"

The point here is that abstractions like specialization and conref aren't always problematic - in general they work well - but they can break, and when they do, they cause all kinds of problems. So now I'll explain where the leaks are in these constructs.

Leaky Abstraction #1: Specializations

Specialization allows you to create your own markup semantics that are meaningful to you. For example, you can create a specialized topic type for a memo that contains the following constructs:

To (who should read this memo)

From (who sent the memo)

Subject (what's the memo about)

Body (the contents of the memo)

And let's say that a memo's body can contain only paragraphs and lists.

No problem. Using the DITA Language Specification, I see that DITA's standard topic element has pretty much everything I need (and more), so I just need to create to weed out the elements I don't want, and add a few that I need that aren't yet defined. I open Eliot Kimber's fantastic specialization tutorial to guide me through the details and within an hour, I have my new memo topic DTD. Specialization works.

Now let's look at where specialization is leaky. I need to create a parts list for a plane assembly that contains an optional title and some special metadata elements that identify the planes' tail numbers that this list is effective for. The list can also nest for sub-parts using the same metadata elements to further refine the effectivity to a subset of tail numbers declared in the parent list. Oh, it can can appear in a wide variety of content blocks. Oops. <ul>only allows <li>elements. <dl>? Well... maybe. I might be able specialize <dlhead>. But it's stretch. And there's a lot of overhead to acheive what I want. We have a leak. A small one, but a leak nonetheless.

Leaky Abstraction #2: References

Conref is a transclusion mechanism that can reference content from another source and include it in another context, provided that the conref'ed content is allowed within the current context. Cool. I can create standard warning notices and simply conref them into the right location:

warning-notices.dita


<topic id="warnings">
    <body>
        <note id="empty.fuel.tank.warning" type="warning">
           <p>
              Make sure that the aft fuel tank is completely empty before
              starting this procedure.
           </p>
        </note>
        <note id="warning2" type="warning">
            ...
        </note>
    </body>
</topic>

proctopic1.dita


<topic id="my.topic">
    <body>
        <p>...</p>
        <note conref="warning-notices.dita#warnings/empty.fuel.tank.warning"/>
    </body>
</topic>

That's OK. Straightforward and what conref was intended to do. Here's the rub: it works like a charm if you're managing the links on a local file system.

Things start getting really hairy for example, if you have a shared resource, like the common warnings example above, on say a Windows file server, where I've mapped it to my Z: drive. Now my conref must point to the physical location of that file. Here's the first potential leak: If Joe Writer maps the file server to his Y: drive and Jane Author maps the same to her W: drive, and we all start sharing topics that each of us has written, we all could have broken conrefs. Guess what. The same holds true for topicrefs and potentially any other topic-based link. The referencing logic is heavily dependent on the physical location of the file.

Introduce a CMS, many of which manage topics as individual objects with references handled by by some form a relationship mechanism (e.g., a relationship table in a database with object IDs rather than physical file addresses), and the leaky abstraction can be a gaping hole.

Plugging the Leaks

While these examples fit the definition of leaky abstraction, much of what DITA offers is solid - so there's no need to abandon DITA at all. In fact, DITA works like it should most of the time. But like any abstraction, there are potential gotchas. Considering how new DITA is, the level of sophistication and stability is pretty darn good. And these aren't excruciatingly difficult problems to solve. But it will require careful thought along with smart dialog with vendors and implementors who believe DITA has the capability to transforming the paradigms of how content is created.

Wednesday, April 2, 2008

Doxsl 1.0.1 RC-1 Released!

I started a new SourceForge Project called Doxsl. Doxsl is an XSLT 2.0 processing engine that generates documentation from your XSLT stylesheets, much like javadoc, .NET help, or Perl POD.

However, one of the more interesting features is the ability to generate documentation into DITA, HTML, and in the near future DocBook. This means you can create documentation for your stylesheets, integrate it into other documentation, or more simply leverage existing standards to output into any format those standards support (for example PDF and CHM).

You can find further details and download information at: http://doxsl.sourceforge.net