I recently had the opportunity to work on an interesting XML schema.. The intent
was to create an HTML 5 markup grammar to create digital content for EPUB and the web
primarily, then ultimately for print. The primary design goal is to create an
authoring grammar that facilitates some level of semantic tagging and that is natively
HTML 5 compliant, i.e., there is no transformation required to move between the
authoring format and HTML5.
What is interesting about this particular schema is that it resembles similar design
patterns used for microformats. The markup
semantics for typographic structures such as a bibliography or a figure are tagged with
standard HTML elements and with additional typographic semantics express using the
class attribute. For example, a figure heading structure must look like
the following:
<figure>
<h2><span class="caption">Figure </span>
<span class="caption_number">1.1 </span>Excalibur and the Lady of the Lake</h2>
</figure>
Notice the <span> tags. From the perspective of describing our typographic
semantics (figures must have captions and captions must have a number), this isn’t too
bad. However from a schema perspective, it’s much more complex, because the
underlying HTML5 grammar is quite complex at the level of <div>, <h2> and
<span> elements. In addition to the required “caption”
and“caption_number” semantics applied to the <span> tag, the
<h2> element also allows text, other inline flow elements, such as <strong>,
<em>, and, of course, other <span> tags that apply other semantics.
To enforce the mandate that a figure heading must have a label and number as the first
two nodes of the <h2> element, we can use XML Schema 1.1 assertions
. Assertions allow us to apply business rules to the markup that cannot be
expressed directly in the content model sequences. Assertions allow us to use a
limited subset of XPath axes and functions that return a boolean result.
Alternately, Schematron could be used independently (or in addition to assertions) as a
means of enforcing the business rules in the markup. The issue here is that a Schematron
rule set resides outside of the XML schema, therefore requiring additional tooling
integration from the authoring environment to apply these rules.
So, for our
heading above, we must apply the following assertion:
<xs:assert test="child::h2/node()[1][@class='caption']/following-sibling::span[@class='caption_number']""/>
In this case, the assertion is stating that the <h2> element’s first node must have
a class attribute value of “caption”, followed immediately by an element with its
class attribute value of “caption_number.” After that, any acceptable
text or inline element defined by the HTML5 grammar is allowed.
This is a very simple example of how the existing HTML5 grammar alone cannot enforce the
semantic structure we wish to express. There are numerous other examples within
the content model that would leverage the same design pattern.
We have done
several successful projects with this approach and the value of having a single
authoring/presentation grammar (HTML 5) is very appealing. However, there can be issues
and difficulties with this approach. Consider:
- Microformats are clever applications that give semantic meaning to simple HTML
formatting tags. It’s valid HTML by virtue of tags and attributes, with
additional semantics expressed through the value of certain attribute such as the
class attribute. In general, these microformat markup documents are small,
discrete documents, as they are intended to be machine readable to give the
application its functionality. From an authoring perspective, it’s relatively
simple to create a form that captures the essential data that is processed by
machine to generate the final microformat data (or for the markup and microformat
savvy, create it by hand – but we are in the minority). Think of microformat
instances as small pieces of functionality embedded as a payload within a larger
document that are only accessed by applications with a strong understanding of the
format. If we take the notion of microformats and use them throughout a document, we
can run into tooling issues, because we’re now asking a broader range of
applications (e.g. XML editors) to understand our microformat.
- The “concrete” structural semantics (how to model figures and captions) are
specified with “abstract” formatting HTML tags. Conflating presentation and
structural semantics in this way is contrary to a very common design principle in
use today in many languages and programming frameworks, namely to separate the semantics/structure from the
formatting of content.
- The schema’s maintainability is decreased by the vast number of assertions that must
be enforced for each typographical structure. Any changes to any one structure
may have ripple effects to other content classes.
- Not all XML authoring tools are created equal. Some don’t honor assertions.
Others do not support XML 1.1 Schemas at all. Consequently, this means that
your holistic XML strategy becomes significantly more complex to implement. It
might mean maintaining two separate schemas, and it might also mean
additional programming is required to enforce the structural semantics that we wish
to be managed in the authoring tool.
- A corollary to the previous point, creating a usable
authoring experience will require significant development overhead to ensure
users can apply the right typographical structures with the correct markup. It
could be as simple as binding templates with menus or toolbars, but it could easily
extend into much more. Otherwise, the alternative is to make sure you invest
in authors/editors who are trained extensively to create the appropriate
markup. Now consider point #3. Any changes to the schema have ripple
effects to the user experience also.
- Instead of simplifying the transformation process, tag overloading can have the
reverse effect. You end up having to create templates for each and every class
value, and it’s not difficult to end up with so many permutations that an ambiguous
match results in the wrong output. Having gone down this road with another
transformation pipeline for another client, I can tell you that unwinding this is
not a trivial exercise (I’ll share this in another post).
- Assertion violation messages coming from the XML parser are extremely cryptic:
cvc-assertion: Assertion evaluation ('child::node()[1]/@class='label'') for element 'summary' on schema type 'summary.class' did not succeed.
For any non-XML savvy practitioners, this kind of message is the precursor to
putting their hands up and calling tech support. Even if you use something
like Schematron on the back end to validate and provide more friendly error
messages, you’ve already made the system more complex.
- It violates the KISS principle. The schema, at first glance, appears to
be an elegant solution. If used correctly, it mitigates what is a big problem
for publishers: How do I faithfully render the content to appear as prescribed
in the content? Theoretically, this schema would only require very light
transformation to achieve the desired effect. Yet, it trades one seemingly
intractable problem for several others that I’ve described above.
Several years ago, I recommended using microformats as an interoperability format for
managing content between DITA, DocBook, and other XML markups. The purpose of the
format was specifically to be able to generated and read with a set of XSLT
stylesheets do the heavy lifting of converting between standards. The real benefit
is that you create a transformation once for each input and output, rather than building
“one-off” transformations for each version of the standard. Once in the converted
markup, the content could leverage its transformations to produce the desired output.
I think the key distinction is that XML Interoperability Framework was never intended to
be an authoring medium. Users would create content in the target format, using the
tools designed for that format. This schema’s strategy is to author directly into
the interop, and the unintended consequences described above only make the complexity of
implementing, using, and maintaining it far greater than it needs to be.
Sometimes, cutting out the middle man is not cheaper or easier.
Here’s another alternative to consider:
- A meaning for everything: create a schema with clear, discrete semantics with
specific content models for each structure. Yes, it explicitly means you have
to create stylesheets with some greater degrees of freedom to support the output
styling you want, and perhaps it’s always a one-off effort, but overall, it’s easier
to manipulate a transformation with overrides or parameters than trying to overload
semantics.
For example, consider our example above: If we want to mandate a
figure heading must have a caption label and a caption number, then semantically
tagging them as such gives you greater freedom for your inline tagging markup
like <span>
. Using this principle, I could see a markup like
the following:
<figure>
<figtitle>
<caption_label>Figure</caption_label>
<caption_number>1.1</caption_number>
Excalibur and the Lady of the Lake
</figtitle>
</figure>
Which might be rendered in HTML5 as:
<figure>
<h2>
<span class="caption">Figure </span>
<span class="caption_number">1.1 </span>
Excalibur and the Lady of the Lake
</h2>
</figure>
That also allows me to also distinguish from other types of headings that have
different markup requirements. For example, a section title might not have the
same caption and numbering mandate:
<section>
<title>The Relationship Between Arthur and Merlin</title>
<subtitle>Merlin as Mentor</subtitle>
...
</section>
Which might rendered in HTML5 as:
<section>
<h1>The Relationship Between Arthur and Merlin</h1>
<h2>Merlin as Mentor</h2>
...
</section>
Notice that in both cases we’re not throwing all the HTML5 markup overboard
(figure and section are HTML5 elements), we’re just providing more explicit
semantics that model our business rules more precisely. Moreover, it’s
substantially easier to encapsulate and enforce these distinctive models in the
schema, without assertions or Schematron rules, unless there are specific
business rules within the text or inline markup that must be enforced
independently from the schema.
Of course, if you change the schema, you may have also make changes in the
authoring environment and/or downstream processing. However, that would be true
in either case. And, irrespective of whether I use an HTML 5-like or a
semantically-explicit schema, I still need to apply some form of transformation
on content written against earlier versions of the schema to update to the most
current version. The key takeaway is that there is little in the way of
development savings with the HTML5 approach.
- Design the system with the author as your first priority. For example, most
XML authoring tools make it easy by inserting the correct tags for required markup
(e.g., our figure heading), especially when each tag’s name is distinct. Many of
these same tools also provide functionality to “hide” or “alias” the tags in a way
that’s more intuitive to use. Doing this in an overloaded tagging approach will
require a lot more development effort to provide same ease of use. Without that
effort, and left to their own devices, authors are going to struggle to create valid
content, and you are almost certain to have a very difficult time with adoption.
- Recognize that tools change over time. The less you have to customize to make the
authoring experience easy, the more likely you can take advantage of new features
and functionality without substantial rework, which also means lower TCO and
subsequently, higher ROI.
- Back end changes are invisible to authors. By all means, it’s absolutely vital to
optimize your downstream processes to deliver content more efficiently and to an
ever-growing number of digital formats. However, the tradeoffs for over-simplifying
the backend might end up costing more
HTML5 will become the base format for a wide range of digital media, ranging from EPUB to
mobile and the web. On the surface, it would appear that using HTML5 makes sense as both
a source format and a target format. The idea has a lot of appeal particularly because
of the numerous challenges that still persist today with standard or custom markup
grammars that have impacted both authoring and backend processes.
Microformats’ appeal is the ability to leverage a well-known markup (HTML) to create
small, discrete semantic data structures targeted for applications with a strong
understanding of the format. Leveraging the simplicity of HTML5, we had hoped to create
a structured markup that was easy to use for content creation, and with little to no
overhead on the back end to process and deliver the content. However, I discovered that
it doesn’t scale well when we try applying the same design pattern to a larger set of
rich semantic structures within a schema designed for formatting semantics.
Instead, the opposite appears to be true: I see greater complexity in the schema design
due to the significant overloading of the class attribute to imply semantic meaning. I
also see limitations in current XML authoring tools to support a schema with that level
of complexity, without incurring a great deal of technical debt to implement and support
a usable authoring environment.
I also discussed how implementing an HTML5 schema with overloaded class attributes likely
won’t provide development savings compared to more semantically-explicit schemas when
changes occur. In fact, the HTML5 schema may incur greater costs due to its dependency
on assertions or Schematron to enforce content rules.
Rather than overloading tags with different structural semantics, an alternative might be
the use of a “blended” model. Leverage HTML5 tags where it makes sense: article,
section, figure, paragraphs, lists, inline elements, and so on. Where there are content
model variations or the need for more constrained models, use more explicit semantics.
The advantages to this kind of approach takes advantage of built in features and
functionality available in today’s XML authoring tools, and mitigates the level
programming or training required. Also, the underlying schema is much easier to maintain
long term. Of course, there are trade-offs in that back-end processing pipelines must
transform the content. However, with the right level of design, the transformations can
be made flexible and extensible enough to support most output and styling scenarios.
With this in mind, this type of tradeoff is acceptable if the authoring experience isn’t
compromised.