Saturday, March 28, 2009

DocBook Going Modular

Scott Hudson, Dick Hamilton, Larry Rowland and I (AKA, “The Colorado DocBook Coalition”) recently drafted a proposal to support “modular” DocBook and presented it to the DocBook TC yesterday.  In general, this proposal is in response to huge demand for DITA-like capabilities for DocBook. 

Many core business factors are driving DocBook in this direction:
  • more distributed authoring: authors are responsible for specific content areas rather than whole manuals.  Content could be authored by many  different authors, even some in different organizations altogether.
  • content reuse: This has long been a "holy grail" of information architects:  write content once, reuse in many different contexts
  • change management:  isolate the content that has changed.  This is a key driver for companies that have localization needs.  By modularizing their content, they can drive down costs by targeting only the changed content  for translation.

Additionally, there are additional downstream opportunities for modularized content:

  • dynamic content assembly:  create "publications" on the fly using an external assembly file that identifies the sequence and hierarchy of modular components rather than creating a single canonical instance.

The following excerpts from the proposal detail the preliminary features (Important: these are not yet set in stone and are subject to change).  The final version will be delivered with the 5.1 release. 

Assemblies

The principle metaphor for Modular DocBook is the “assembly”.  An assembly defines the resources, hierarchy and relationships for a collection of DocBook components.  The <assembly> element can be the structural equivalent of any DocBook component, such as
a book, a chapter, or an article.  Here’s the proposed content model in RelaxNG Compact mode:

db.assembly =
  element assembly {
    db.info?, db.toc*, db.resources+, db.relationships*
  }

Resources

The <resources> element is high-level container that contains one or more resource objects that are managed by the <assembly>.  An <assembly> can contain 1 or more <resources> containers to allow users to organize content into logical groups based on profiling attributes.

Each <resources> element must contain 1 or more <resource> elements.

db.resources =
  
element resources {
      db.common.attributes, db.resource+
   }

Specifying Resources

The <resource> element identifies a "managed object" within the assembly. Typically, a <resource> will point to a content file that can be identified by a valid URI.  However a <resource> can also be a 'static' text value that behaves similarly to a text entity.

Every <resource> MUST have a unique ID value within the context of the entire <assembly>

db.resource =
  element resource {
    db.common.attributes,
    attribute fileref { text }?,
    attribute resid {text}?,
    text?
  }

Content-based resources can also be content fragments within a content file, similar to an URI fragment:  file.xml/#ID.

Additionally, a resource can point to another resource.  This allows users to create "master" resource that can be referenced in the current assembly, and indirectly point the underlying resource that the referenced resource identifies.

For example:

<resource
    id="master.resource" 
    fileref="errormessages.xml"/>
<resource
   id="class.not.found"
   resid="{master.resource}/#classnotfound"/>
<resource
   id="null.pointer"
   resid="{master.resource}/#nullpointer"/>

The added benefit of indirect references is that users can easily point the resource to a different content file, provided that it used the same underlying fragment ids internally.  It could also be used for creating locale-specific resources that reference the same resource id.

Text-based resources behave similarly to XML text entities.  A content-based resource can reference a resource, provided that both the text resource and the content resource are managed by the same assembly.

assembly.xml:

...
<resource id="company.name">Acme Tech, Inc.</resource>
<resource id="company.ticker">ACMT</resource>
...

file1.xml:

<para><phrase resid="company.name"/> (<phrase resid="company.ticker"/>) is a
publicly traded company...</para>

Organizing Resources into a Logical Hierarchy

The <toc> element defines the sequence and hierarchy of content-based resources that will be rendered in the final output.  It behaves in a similar fashion to a DITA map and topicrefs.  However, instead of each <tocentry> pointing to a URI, it points to a resource in the <resources> section of the assembly:

<toc>
    <tocentry linkend="file.1"/>
    <tocentry linkend="file.2">
        <tocentry linkend="file.3"/>
    </tocentry>
</toc>

<resources>
    <resource id="data.table" fileref="data.xml"/>
    <resource id="file.1" fileref="file1.en.xml"/>
    <resource id="file.2" fileref="file2.en.xml"/>
    <resource id="file.3" fileref="{data.table}/#table1"/>
</resources>

Creating Relationships Between Resources

One of the more clever aspects of DITA’s architecture is the capability to specify relationships between topics within the context of the map (and independent of the topics themselves).  The DocBook TC is currently considering several proposals that will enable resources to be related to each other within the assembly.

The Benefits of a Modular DocBook

There is a current mindset (whether it’s right or wrong is irrelevant) that DocBook markup is primarily targeted for “monolithic” manuscripts.  With this proposal, I think there many more possibilities for information architects to create new types of content: websites, true help systems, mashups, dynamically assembled content based on personalized facets (Web 2.0/3.0 capabilities), a simplified Localization strategy like that which has been advocated in DITA.

What’s more: the design makes no constraints on the type of content resources referenced in an assembly:  In fact they can be any type: sections, chapters, images, even separate books (or assemblies) to mimic DocBook’s set element.

The design takes into account existing DocBook content that currently exists as “monolithic” instances, but is flexible enough to support other applications like IMF manifests for SCORM-compliant content, making it easy to create e-Learning content.

As the first draft of the proposal, I would expect that there will be changes between now and the final spec.  Yet, the core of the proposal should remain relatively intact.  If you would like to get involved or have other ideas, let me know.  Stay tuned.

Technorati Tags: ,,
del.icio.us Tags: ,,