Tuesday, September 24, 2013

XML Schemas and the KISS Principle

I recently had the opportunity to work on an interesting XML schema..  The intent was to create an HTML 5 markup grammar to create digital content for EPUB and the web primarily, then ultimately for print.  The primary design goal is to create an authoring grammar that facilitates some level of semantic tagging and that is natively HTML 5 compliant, i.e., there is no transformation required to move between the authoring format and HTML5.

What is interesting about this particular schema is that it resembles similar design patterns used for microformats.  The markup semantics for typographic structures such as a bibliography or a figure are tagged with standard HTML elements and with additional typographic semantics express using the class attribute.  For example, a figure heading structure must look like the following:

<figure>
    <h2><span class="caption">Figure </span>
    <span class="caption_number">1.1 </span>Excalibur and the Lady of the Lake</h2>
</figure>

Notice the <span> tags.  From the perspective of describing our typographic semantics (figures must have captions and captions must have a number), this isn’t too bad.  However from a schema perspective, it’s much more complex, because the underlying HTML5 grammar is quite complex at the level of <div>, <h2> and <span> elements.  In addition to the required “caption” and“caption_number” semantics applied to the <span> tag, the <h2> element also allows text, other inline flow elements, such as <strong>, <em>, and, of course, other <span> tags that apply other semantics.

To enforce the mandate that a figure heading must have a label and number as the first two nodes of the <h2> element, we can use XML Schema 1.1 assertions .  Assertions allow us to apply business rules to the markup that cannot be expressed directly in the content model sequences.  Assertions allow us to use a limited subset of XPath axes and functions that return a boolean result.

Alternately, Schematron could be used independently (or in addition to assertions) as a means of enforcing the business rules in the markup. The issue here is that a Schematron rule set resides outside of the XML schema, therefore requiring additional tooling integration from the authoring environment to apply these rules.

So, for our heading above, we must apply the following assertion:

<xs:assert test="child::h2/node()[1][@class='caption']/following-sibling::span[@class='caption_number']""/>

In this case, the assertion is stating that the <h2> element’s first node must have a class attribute value of “caption”, followed immediately by an element with its class attribute value of “caption_number.”  After that, any acceptable text or inline element defined by the HTML5 grammar is allowed.

This is a very simple example of how the existing HTML5 grammar alone cannot enforce the semantic structure we wish to express.  There are numerous other examples within the content model that would leverage the same design pattern.

We have done several successful projects with this approach and the value of having a single authoring/presentation grammar (HTML 5) is very appealing. However, there can be issues and difficulties with this approach. Consider:

  1. Microformats are clever applications that give semantic meaning to simple HTML formatting tags.  It’s valid HTML by virtue of tags and attributes, with additional semantics expressed through the value of certain attribute such as the class attribute.  In general, these microformat markup documents are small, discrete documents, as they are intended to be machine readable to give the application its functionality.  From an authoring perspective, it’s relatively simple to create a form that captures the essential data that is processed by machine to generate the final microformat data (or for the markup and microformat savvy, create it by hand – but we are in the minority). Think of microformat instances as small pieces of functionality embedded as a payload within a larger document that are only accessed by applications with a strong understanding of the format. If we take the notion of microformats and use them throughout a document, we can run into tooling issues, because we’re now asking a broader range of applications (e.g. XML editors) to understand our microformat.
  2. The “concrete” structural semantics (how to model figures and captions) are specified with “abstract” formatting HTML tags. Conflating presentation and structural semantics in this way is contrary to a very common design principle in use today in many languages and programming frameworks, namely to separate the semantics/structure from the formatting of content.
  3. The schema’s maintainability is decreased by the vast number of assertions that must be enforced for each typographical structure.  Any changes to any one structure may have ripple effects to other content classes.
  4. Not all XML authoring tools are created equal.  Some don’t honor assertions. Others do not support XML 1.1 Schemas at all.  Consequently, this means that your holistic XML strategy becomes significantly more complex to implement.  It might mean maintaining two separate schemas, and it might also mean additional programming is required to enforce the structural semantics that we wish to be managed in the authoring tool.
  5. A corollary to the previous point, creating a usable authoring experience will require significant development overhead to ensure users can apply the right typographical structures with the correct markup.  It could be as simple as binding templates with menus or toolbars, but it could easily extend into much more.  Otherwise, the alternative is to make sure you invest in authors/editors who are trained extensively to create the appropriate markup.  Now consider point #3.  Any changes to the schema have ripple effects to the user experience also.
  6. Instead of simplifying the transformation process, tag overloading can have the reverse effect.  You end up having to create templates for each and every class value, and it’s not difficult to end up with so many permutations that an ambiguous match results in the wrong output.  Having gone down this road with another transformation pipeline for another client, I can tell you that unwinding this is not a trivial exercise (I’ll share this in another post).
  7. Assertion violation messages coming from the XML parser are extremely cryptic:
    cvc-assertion: Assertion evaluation ('child::node()[1]/@class='label'') for element 'summary' on schema type 'summary.class' did not succeed.

    For any non-XML savvy practitioners, this kind of message is the precursor to putting their hands up and calling tech support.  Even if you use something like Schematron on the back end to validate and provide more friendly error messages, you’ve already made the system more complex.

  8. It violates the KISS principle.   The schema, at first glance, appears to be an elegant solution.  If used correctly, it mitigates what is a big problem for publishers:  How do I faithfully render the content to appear as prescribed in the content?  Theoretically, this schema would only require very light transformation to achieve the desired effect. Yet, it trades one seemingly intractable problem for several others that I’ve described above.

Several years ago, I recommended using microformats as an interoperability format for managing content between DITA, DocBook, and other XML markups.  The purpose of the format was specifically to be able to generated and read with a set of XSLT stylesheets do the heavy lifting of converting between standards.  The real benefit is that you create a transformation once for each input and output, rather than building “one-off” transformations for each version of the standard.  Once in the converted markup, the content could leverage its transformations to produce the desired output.

I think the key distinction is that XML Interoperability Framework was never intended to be an authoring medium.  Users would create content in the target format, using the tools designed for that format.  This schema’s strategy is to author directly into the interop, and the unintended consequences described above only make the complexity of implementing, using, and maintaining it far greater than it needs to be.  Sometimes, cutting out the middle man is not cheaper or easier.

Here’s another alternative to consider:

  1. A meaning for everything:  create a schema with clear, discrete semantics with specific content models for each structure.  Yes, it explicitly means you have to create stylesheets with some greater degrees of freedom to support the output styling you want, and perhaps it’s always a one-off effort, but overall, it’s easier to manipulate a transformation with overrides or parameters than trying to overload semantics.

    For example, consider our example above: If we want to mandate a figure heading must have a caption label and a caption number, then semantically tagging them as such gives you greater freedom for your inline tagging markup like <span>. Using this principle, I could see a markup like the following:

    <figure> 
        <figtitle>
            <caption_label>Figure</caption_label> 
            <caption_number>1.1</caption_number> 
            Excalibur and the Lady of the Lake 
        </figtitle> 
    </figure> 

    Which might be rendered in HTML5 as:

    <figure> 
        <h2>
            <span class="caption">Figure </span> 
            <span class="caption_number">1.1 </span> 
            Excalibur and the Lady of the Lake 
        </h2> 
    </figure>

    That also allows me to also distinguish from other types of headings that have different markup requirements. For example, a section title might not have the same caption and numbering mandate:

    <section> 
        <title>The Relationship Between Arthur and Merlin</title> 
        <subtitle>Merlin as Mentor</subtitle> 
        ... 
    </section>

    Which might rendered in HTML5 as:

    <section> 
        <h1>The Relationship Between Arthur and Merlin</h1> 
        <h2>Merlin as Mentor</h2> 
        ... 
    </section>

    Notice that in both cases we’re not throwing all the HTML5 markup overboard (figure and section are HTML5 elements), we’re just providing more explicit semantics that model our business rules more precisely. Moreover, it’s substantially easier to encapsulate and enforce these distinctive models in the schema, without assertions or Schematron rules, unless there are specific business rules within the text or inline markup that must be enforced independently from the schema.

    Of course, if you change the schema, you may have also make changes in the authoring environment and/or downstream processing. However, that would be true in either case. And, irrespective of whether I use an HTML 5-like or a semantically-explicit schema, I still need to apply some form of transformation on content written against earlier versions of the schema to update to the most current version. The key takeaway is that there is little in the way of development savings with the HTML5 approach.

  2. Design the system with the author as your first priority.  For example, most XML authoring tools make it easy by inserting the correct tags for required markup (e.g., our figure heading), especially when each tag’s name is distinct. Many of these same tools also provide functionality to “hide” or “alias” the tags in a way that’s more intuitive to use. Doing this in an overloaded tagging approach will require a lot more development effort to provide same ease of use. Without that effort, and left to their own devices, authors are going to struggle to create valid content, and you are almost certain to have a very difficult time with adoption.
  3. Recognize that tools change over time. The less you have to customize to make the authoring experience easy, the more likely you can take advantage of new features and functionality without substantial rework, which also means lower TCO and subsequently, higher ROI.
  4. Back end changes are invisible to authors. By all means, it’s absolutely vital to optimize your downstream processes to deliver content more efficiently and to an ever-growing number of digital formats. However, the tradeoffs for over-simplifying the backend might end up costing more

HTML5 will become the base format for a wide range of digital media, ranging from EPUB to mobile and the web. On the surface, it would appear that using HTML5 makes sense as both a source format and a target format. The idea has a lot of appeal particularly because of the numerous challenges that still persist today with standard or custom markup grammars that have impacted both authoring and backend processes.

Microformats’ appeal is the ability to leverage a well-known markup (HTML) to create small, discrete semantic data structures targeted for applications with a strong understanding of the format. Leveraging the simplicity of HTML5, we had hoped to create a structured markup that was easy to use for content creation, and with little to no overhead on the back end to process and deliver the content. However, I discovered that it doesn’t scale well when we try applying the same design pattern to a larger set of rich semantic structures within a schema designed for formatting semantics.

Instead, the opposite appears to be true: I see greater complexity in the schema design due to the significant overloading of the class attribute to imply semantic meaning. I also see limitations in current XML authoring tools to support a schema with that level of complexity, without incurring a great deal of technical debt to implement and support a usable authoring environment.

I also discussed how implementing an HTML5 schema with overloaded class attributes likely won’t provide development savings compared to more semantically-explicit schemas when changes occur. In fact, the HTML5 schema may incur greater costs due to its dependency on assertions or Schematron to enforce content rules.

Rather than overloading tags with different structural semantics, an alternative might be the use of a “blended” model. Leverage HTML5 tags where it makes sense: article, section, figure, paragraphs, lists, inline elements, and so on. Where there are content model variations or the need for more constrained models, use more explicit semantics. The advantages to this kind of approach takes advantage of built in features and functionality available in today’s XML authoring tools, and mitigates the level programming or training required. Also, the underlying schema is much easier to maintain long term. Of course, there are trade-offs in that back-end processing pipelines must transform the content. However, with the right level of design, the transformations can be made flexible and extensible enough to support most output and styling scenarios. With this in mind, this type of tradeoff is acceptable if the authoring experience isn’t compromised.

2 comments:

Unknown said...

Look at HTMLBook schema by O'Reilly Media https://github.com/oreillymedia/HTMLBook/blob/master/schema/htmlbook.xsd for a different approach that leverages the @data-type instead of @class in HTML5.

Jim Earley said...

indeed the O'Reilly schema imposes some level of constraint, which is interesting. I would argue that whether @data-type or @class is used, model (read: typographic structures) constraints still require enforcement via assertion or schematron.