Sunday, April 13, 2008

Do We Need Structured Document Formats?

Eric Armstrong has posed a very interesting question about structured document markup languages. And there is a great deal of merit to his question. I want to take a look at some of his points and provide my own thoughts.

Is Markup Too Complicated?

Eric writes:

Those observations explain why structured document formats are so difficult to use: They force you to memorize the tagging structure. They require training, as a result, because it's virtually impossible for the average user to be productive without it.

The editing situation is much better with DITA (120 topic tags, plus 80 for maps) than it is with DocBook (800 tags), or even Solbook (400 tags), but it is still way more difficult than simple HTML--80 tags, many of which can nest, but few of which have to.

But even with a relatively simple format like HTML, we have manual-editing horror stories. In one instance, a title heading was created with 18 non-breaking spaces and a 21-point font. (Try putting that through your automated processor.)

If I had a nickel every time I've heard someone tell me, "I don't care about what tag I use, I just want to write my document", I could retire right now and live off the interest. There's no doubt that transitioning from traditional unstructured desktop authoring tools to structured authoring tools often causes turmoil and cognitive dissonance. Which brings up an interesting question in my mind: Are all semantic markup languages are inherently problematic?

And this where I think Eric and I have a slight difference in opinion. Eric suggests that Wikis offer an alternative to the "tag jambalaya" (my term) of markup languages. Wikis are incredibly good at enabling users to quickly create content without being encumbered by a whole lot of structure or learning curve. For content like Wikipedia, enabling users of various skills to contribute their knowledge to this resource, this makes sense.

However, if I'm writing a manual (collaboratively or not - we'll touch on this later), a reasonable amount of structure is desirable. I agree that a typical user will likely never use a majority of the tags that are built in to DITA, DocBook, or even HTML - this is the price of being an open standard: content models tend to become "bloated" with markup deemed necessary by a wide range of interests. In the past, I wrote manuals for a Unix operating system using DocBook. Of the 400 or so elements in the grammar, I only used 70 or 80 of these elements. The rest didn't apply to the subject matter. I also can't recall the last time I used the samp tag in HTML. It's there, but I don't have to use it.

Even for many of our clients, we end up creating new DITA DTD shells specifically to strip out unnecessary domains to simplify the content model. I will say that's often easier to remove what you don't need than it is to integrate something that isn't there. The new DocBook 5 schemas (developed with RelaxNG) makes it very easy to both remove unwanted elements and add new ones. The DocBook Publisher's Subcommittee schema (currently under development) removes many existing DocBook elements that aren't needed while creating a few additional elements that are relevant for publishers.

This also leads me to another question: which wiki markup? There are literally dozens of wiki markup languages out there, each a little different than the others. Where is the interoperability?

Standard structured markup languages like DocBook and DITA (and even XHTML) are essentially like contracts that state that if you follow the rules within the schema, the document can be rendered into any supported format, and the markup can be shared with others using the same schema. I can even leverage the content into other markup formats.

But where structured, semantic markup shines is in the case where business rules dictate that each DITA task topic must contain a context element (it doesn't now, but you could enforce such a rule in the DTD), or that all tables must contain a title. Unstructured markup like wikis will have a hard time enforcing that, as will HTML. But structured markup with a DTD or schema makes this very easy.

A not so ancillary point to structured semantic markup is the ability to identify that content for its intended meaning - an admonition tagged as a caution or warning is much easier to find (and reuse) than a text block (or generic HTML div or table) that starts with the word "Caution" or "Warning" despite the fact that they might be rendered the same way. And if the admonition contains more than one paragraph of text, having that containment within markup to indicate the start and end of a particular structure is very useful. This is not to mention that
from an Localization perspective, tagged semantic markup is the way to go.

Eric rightfully points out that tools like Open Office allow users to create content without knowing that the native format is a markup language. The same is true for many WYSIWYG HTML editors these days (and there's pretty cool web-based gadgets out there too!). Most users never have to see what the underlying HTML looks like. This is where we need to focus our attention. It isn't that markup languages themselves are difficult. Rather, it's that the tools that we use to create the underlying markup are perhaps too difficult for authors to use.

And the excuse we use is that going from unstructured to structured authoring means that authors have to sacrifice some of the flexibility. There's no question that this response is wearing thin, and that most authors (both professional and casual) believe that there has to be a better way.

Conditional Metadata

Eric's point about conditional metadata filtering has had some serious discussion recently on the Yahoo DITA Users Forum. And arguably, there is merit in some of the ideas presented there. Eric's point here deserves mention:

But the fact that such a thing can be done does not mean that it is necessarily desirable to do so. Experience suggests that reuse gets tricky when your environment changes--because your metadata reflects your environment. If your environment doesn't change, then your metadata is fixed. You define it, assign it and that's the end of it. The metadata tagging adds some complexity to your information, but you can live with it, and it buys you a lot.

Metadata is only meaningful when it has context. Context is this case means that there is a relationship between the content and some known "variable" - a particular audience group, an operating platform, or other target that scopes the content's applicability. Where I see churn is in the area of "filtering" content, i.e., suppressing or rendering content based metadata values. To me, this is an implementation problem rather than a design problem.

In the classic case of conditionality, overloading any markup with multiple filtering aspects purely for rendering or suppressing content can lead to serious problems, and requires special treatment and another discussion. However, if we look at metadata as a means of creating a relationship between the tagged content and specific target(s) - the potential for more targeted search and focused, dynamic content assembly expands greatly.

Transclusion and Reuse:

So maybe a really minimal transclusion-capability is all we really need for reuse. Maybe we need to transclude boilerplate sections, and that's about all.

There's no question that transclusion can be abused to the point that a document is cobbled together like Frankenstein's Monster. However, there are cases when transcluding content does make sense, and not just for boilerplate content. We're only beginning to really see the possibilities of providing users with the right amount of information, when they want it, and targeted for that user's level of detail based on metadata (see Flatirons Solutions Whitepaper:
Dynamic Content Delivery Using DITA
). Essentially, content can be assembled from a wide range of content objects (topics, sections, chapters, boilerplate, etc.). I would be reluctant to suggest that "boilerplate" or standardized content is the only form of reuse we need.

Still, Eric's question is valid - what is optimal reuse? The answer is that it depends. For some applications, standard boilerplate is right; for others the ability to transclude "approved" admonitions is necessary. And for some, transclusion of whole topics, or sections or chapters is appropriate. The point is that the information design, based on a thorough analysis of the business and its goals, along with evaluating the content will dictate the right amount of reuse.

From a collaborative and distributive authoring perspective, enabling writers to focus on their own content and assemble everything together in a cohesive manner definitely makes a great deal of sense. Wikis work well if you're dealing with collaboration on the same content, but don't really solve the problem of contributing content to a larger deliverable.

Formatting and Containment

Eric's argument is that HTML pretty much got it right because it limited required nesting and containment to lists and tables. Now if I were working with ATA or S1000D all the time, I would agree wholeheartedly. Even DocBook has some odd containment structures (mediaobject comes to mind, but there are benefits for this container that I also understand). From the point of pure simplicity and pure formatting intent, he's right. But the wheels get a little wobbly if we always assume that we're working with a serial content stream solely for format.

One area where containment makes a great deal of sense is in the area of Localization. By encapsulating new and/or changed content into logical units of information, you can realize real time savings and reduced translation costs.

Containment also makes transclusion more useful and less cumbersome. Assuming that we aren't creating Frankenstein's Monster, the ability to point to only the block of content I want, with out cutting and pasting is a distinct advantage.


At the heart of Eric's article, I believe, is the KISS principle. Inevitably, from a content perspective, when you boil structured document formats down to their essence, you get headings, paragraphs, lists, tables, images, and inline markup (look at the Interoperability Framework white paper that Scott Hudson and I wrote to illustrate this). So why use structured markup at all when my desktop word processor can do that right now? In my view, there are numerous reasons, some of them I've discussed here, and others like the potential for interoperability that make structured document markup languages extremely flexible and highly leverageable.

There is no doubt that today's structured markup tools don't always make it easy for users to create content without the markup peeking through the cracks. That doesn't mean that structured markup is the problem. For example, one of my web browsers won't display Scalable Vector Graphics (SVG) at all. It doesn't mean that the SVG standard is a problem, it means that I need to use a web browser that supports the standard.

Eric's article is thought-provoking and well done. It raises the level of discussion that we need to have around why we use structured content (and not because it's the coolest fad), and how we create that content. Let's keep this discussion going.

No comments: