Sunday, November 29, 2009

Fun With XProc

I've been so busy with clients over the last 6 months that I haven't had much time to tinker with XProc much. I took the Thanksgiving holiday week off with the hope of having a little time to dabble with the language. Up until yesterday, I didn't open my computer once (has to be a new record) since we were busy with other things. A side note: If you're in Denver before February 7th, I highly recommend you see the Ghengis Khan exhibit at the Museum of Nature and Science.

As I often do, I had already done some preliminary reading beforehand. James Sulak's blog is a must-read. Another very useful and informative website is from EMC: "XProc: Step By Step", originally authored by Vojtech Toman. Even the W3C specification is generally helpful.

The biggest hurdle for me was to stop thinking of XProc working in the same way I think of Ant. While Ant does process XML content, it isn't the tool's principle focus - Ant was principally designed as Java implementation of MAKE tools. For that purpose, Ant has become the de facto standard. Before XProc was conceived, many of us used Ant as a way to control the sequencing of complex XML publishing pipelines. I worked on XIDI, a DocBook-based publishing system at HP, which was principally based on Ant scripts; the DITA Open Toolkit is an Ant-based build environment. For the most part Ant works admirably, but there are limitations. The biggest limitation is the xslt task's static parameter declarations, and the indirect nature by which parameter values are passed to an XSL Transformation through property values. It works, but it can get kludgy pretty fast for complex stylesheets. More importantly, Ant is primarily a developer's tool that acts like a Swiss Army knife that has a tool for just about every purpose. Most of these tools work very well for very specific tasks, but they aren't intended to perform specialized tasks. For that, you'll need to create custom Ant tasks or use other specialized tools. XProc is one of these specialized tools that is designed specifically for XML processing.

So the biggest conceptual difference to grok in XProc (I like this…) is how steps are connected together to form a complete pipeline process. Rather than using target dependencies and explicit target calls like you do in Ant, XProc uses the concept of pipes to connect the output of one step to the input of the next step. It's very much like Unix shell or DOS command line pipelines. For example:

ps -ax | tee processes.txt | more

Since many steps (including the p:xslt step) can have more than one input and one or more outputs (think of xsl:result-document in XSLT 2.0) we need to explicitly bind uniquely named output streams to input streams of subsequent steps. It's very analogous to plumbing, and another way that XProc is different than Ant: Ant's tasks are very dependent on the file system to process inputs and outputs; XProc pipelines are in-memory input and output streams until you explicitly serialize to the file system.

With this I was able to create my first "real" XProc pipeline to generate Doxsl output. Here it is:

<p:declare-step name="doxsl" type="dxp:generate-doxsl-docs"
psvi-required="false"
xmlns:p="
http://www.w3.org/ns/xproc"

    xmlns:dxp="urn:doxsl:xproc-pipeline:1.0">  
<p:input port="source" kind="document" primary="true"
sequence="false" />
<p:input port="parameters" kind="parameter" primary="false"
sequence="true"/>
<p:output port="result" primary="true" sequence="false" >
<p:pipe step="transform" port="result"/>
</p:output>
<p:output port="secondary" primary="false" sequence="true" >
<p:pipe step="transform" port="secondary" />
</p:output>
<p:option name="format" select="'dita'"/>
<p:choose name="select-stylesheet">
<p:when test="$format='dita'">
<p:output port="result" primary="true"
sequence="false" >
<p:pipe step="load-dita-stylesheet"
port="result"/>
</p:output>
<p:load name="load-dita-stylesheet">
<p:with-option name="href"
select="'../../dita/doxsl.xsl'" >
<p:empty/>
</p:with-option>
</p:load>
</p:when>
<p:when test="$format='html'">
<p:output port="result" primary="true"
sequence="false">
<p:pipe port="result"
step="load-html-stylesheet"/>
</p:output>
<p:load name="load-html-stylesheet">
<p:with-option name="href"
select="'../../html/doxsl.xsl'"/>
</p:load>
</p:when>
</p:choose>
<p:xslt name="transform">
<p:input port="source" >
<p:pipe step="doxsl" port="source"/>
</p:input>
<p:input port="stylesheet" >
<p:pipe step="select-stylesheet" port="result"/>
</p:input>
<p:input port="parameters">
<p:pipe port="parameters" step="doxsl"/>
</p:input>
<p:with-param name="debug" select="'true'"/>
</p:xslt>
</p:declare-step>



Here's a diagram, built with EMC's XProc Designer.  This tool is a great way to visualize and start your XProc scripts:




Essentially, I used the p:declare-step declaration so that I can declare it as a custom step (dxp:generate-doxsl-docs), which will allow it to be integrated into other pipelines. It has one option, format, which is used to specify which output format to generate (for Doxsl, 'html' and 'dita' are supported). The first step ("select-stylesheet") evaluates the format option and loads the appropriate stylesheet into the step's "result" output stream. This is used by the second step's ("transform") stylesheet port. The transform's source file (the XSLT stylesheet to be documented) is bound to the root step's source port, as is the parameters port. I also set the stylesheet's 'debug' parameter to true to inject output to the "transform" step's result port.

All of this is done in memory and not serialized to the file system. This is intentional so that other pipelines can integrate this custom step.

I've tested this with Calabash. I still need to evaluate with Calumet.

Right now, these are baby steps. I think that XProc has a lot of potential. I think the next big task is to consider an XProc implementation for DITA XML processing.