Saturday, March 28, 2009

DocBook Going Modular

Scott Hudson, Dick Hamilton, Larry Rowland and I (AKA, “The Colorado DocBook Coalition”) recently drafted a proposal to support “modular” DocBook and presented it to the DocBook TC yesterday.  In general, this proposal is in response to huge demand for DITA-like capabilities for DocBook. 

Many core business factors are driving DocBook in this direction:
  • more distributed authoring: authors are responsible for specific content areas rather than whole manuals.  Content could be authored by many  different authors, even some in different organizations altogether.
  • content reuse: This has long been a "holy grail" of information architects:  write content once, reuse in many different contexts
  • change management:  isolate the content that has changed.  This is a key driver for companies that have localization needs.  By modularizing their content, they can drive down costs by targeting only the changed content  for translation.

Additionally, there are additional downstream opportunities for modularized content:

  • dynamic content assembly:  create "publications" on the fly using an external assembly file that identifies the sequence and hierarchy of modular components rather than creating a single canonical instance.

The following excerpts from the proposal detail the preliminary features (Important: these are not yet set in stone and are subject to change).  The final version will be delivered with the 5.1 release. 

Assemblies

The principle metaphor for Modular DocBook is the “assembly”.  An assembly defines the resources, hierarchy and relationships for a collection of DocBook components.  The <assembly> element can be the structural equivalent of any DocBook component, such as
a book, a chapter, or an article.  Here’s the proposed content model in RelaxNG Compact mode:

db.assembly =
  element assembly {
    db.info?, db.toc*, db.resources+, db.relationships*
  }

Resources

The <resources> element is high-level container that contains one or more resource objects that are managed by the <assembly>.  An <assembly> can contain 1 or more <resources> containers to allow users to organize content into logical groups based on profiling attributes.

Each <resources> element must contain 1 or more <resource> elements.

db.resources =
  
element resources {
      db.common.attributes, db.resource+
   }

Specifying Resources

The <resource> element identifies a "managed object" within the assembly. Typically, a <resource> will point to a content file that can be identified by a valid URI.  However a <resource> can also be a 'static' text value that behaves similarly to a text entity.

Every <resource> MUST have a unique ID value within the context of the entire <assembly>

db.resource =
  element resource {
    db.common.attributes,
    attribute fileref { text }?,
    attribute resid {text}?,
    text?
  }

Content-based resources can also be content fragments within a content file, similar to an URI fragment:  file.xml/#ID.

Additionally, a resource can point to another resource.  This allows users to create "master" resource that can be referenced in the current assembly, and indirectly point the underlying resource that the referenced resource identifies.

For example:

<resource
    id="master.resource" 
    fileref="errormessages.xml"/>
<resource
   id="class.not.found"
   resid="{master.resource}/#classnotfound"/>
<resource
   id="null.pointer"
   resid="{master.resource}/#nullpointer"/>

The added benefit of indirect references is that users can easily point the resource to a different content file, provided that it used the same underlying fragment ids internally.  It could also be used for creating locale-specific resources that reference the same resource id.

Text-based resources behave similarly to XML text entities.  A content-based resource can reference a resource, provided that both the text resource and the content resource are managed by the same assembly.

assembly.xml:

...
<resource id="company.name">Acme Tech, Inc.</resource>
<resource id="company.ticker">ACMT</resource>
...

file1.xml:

<para><phrase resid="company.name"/> (<phrase resid="company.ticker"/>) is a
publicly traded company...</para>

Organizing Resources into a Logical Hierarchy

The <toc> element defines the sequence and hierarchy of content-based resources that will be rendered in the final output.  It behaves in a similar fashion to a DITA map and topicrefs.  However, instead of each <tocentry> pointing to a URI, it points to a resource in the <resources> section of the assembly:

<toc>
    <tocentry linkend="file.1"/>
    <tocentry linkend="file.2">
        <tocentry linkend="file.3"/>
    </tocentry>
</toc>

<resources>
    <resource id="data.table" fileref="data.xml"/>
    <resource id="file.1" fileref="file1.en.xml"/>
    <resource id="file.2" fileref="file2.en.xml"/>
    <resource id="file.3" fileref="{data.table}/#table1"/>
</resources>

Creating Relationships Between Resources

One of the more clever aspects of DITA’s architecture is the capability to specify relationships between topics within the context of the map (and independent of the topics themselves).  The DocBook TC is currently considering several proposals that will enable resources to be related to each other within the assembly.

The Benefits of a Modular DocBook

There is a current mindset (whether it’s right or wrong is irrelevant) that DocBook markup is primarily targeted for “monolithic” manuscripts.  With this proposal, I think there many more possibilities for information architects to create new types of content: websites, true help systems, mashups, dynamically assembled content based on personalized facets (Web 2.0/3.0 capabilities), a simplified Localization strategy like that which has been advocated in DITA.

What’s more: the design makes no constraints on the type of content resources referenced in an assembly:  In fact they can be any type: sections, chapters, images, even separate books (or assemblies) to mimic DocBook’s set element.

The design takes into account existing DocBook content that currently exists as “monolithic” instances, but is flexible enough to support other applications like IMF manifests for SCORM-compliant content, making it easy to create e-Learning content.

As the first draft of the proposal, I would expect that there will be changes between now and the final spec.  Yet, the core of the proposal should remain relatively intact.  If you would like to get involved or have other ideas, let me know.  Stay tuned.

Technorati Tags: ,,
del.icio.us Tags: ,,

7 comments:

Scott Hudson said...

You beat me to it! I was thinking of posting about this too. :-)

I'm very excited about what this means for DocBook!

Jim Earley said...

The more blogs, the merrier from where I see things! I'm very pleased that the TC is moving forward. In some regards, DITA might be considered a "disruptive" technology to the extent that it truly has changed the underlying assumptions of authoring and deploying structured content. With that in mind, I think that DocBook's heritage of rich markup for virtually any type of manuscript (including the new Publisher's schema) along with this modular approach can and will elicit some very interesting discussions in the future.

Marcus said...

more power to you. assuming this did get going, would you spawn another project at sourceforge or join the existing xsl project?

Jim Earley said...

Interesting question. Given that the existing XSL project uses v1.0 stylesheets, my initial thinking is that proposal would benefit with XSLT 2.0, along with XProc (Calabash comes to mind - http://www.xmlcalabash.com). I do think, given the install base using DocBook XSL, it would make sense to leverage existing code where it makes sense. However, I can imagine several areas in the code that could be refactored to take advantage of built-in features of 2.0 (xsl:result-document and xsl:function come to mind immediately).

But to answer your question: I think it makes sense to join the existing project, but as a separate "application," much like what exists for PDF and HTML today

Jeff said...

Jim, Calabash and XProc in general look set to scratch some itches that I usually write Python or PHP code for.

But WRT your take on modular DocBook - I'd take very much to heart the sentiment expresssed in the flame war over on Calabash's project page's comment section; having a complete stack of FOSS components to serve as a reference implementation would be a Very Good Thing. Not that all of us are necessarily cheapskates. But as you're likely painfully aware, the greatest benefit of a complete FOSS toolchain in the documentation space is the guarantee that at some arbitrary future time, some competent practitioner(s) can faithfully recreate the output you intended from your document sources. For those of us who cut our writing teeth on DisplayWrite and friends, that is no small thing.

Michael Rempel said...

Did anything ever come of this? I am actively looking for best of breed current standards based solutions. DocBook 5.1 seems stalled. Is it?

Carlos said...

Jim,

While not as elaborate as the Assembly concept you've always been able to build modular documents with XInclude and write them as separate documents with the goal of creating reusable modular concepts.

My other concern is adoption and whether this is really needed/useful in Docbook or whether we can modify the transclusion mechanisms we already have to make this work. I particularly find the work that went into the dita2docbook stylesheet in the DITA toolkit intriguing and worth more research.