Jim's Thoughtspot

Saturday, January 9, 2016

It's been a while

It's been a while since I've posted on this blog. A lot has happened in the intervening months (years). Mostly, I've moved forward and backward in the tech world. I've hummed and hawed over the direction of my career. I've also been somewhat distracted by local events that required my attention. You can label it a higher calling; a change in priorities from a completely geeky world that I have embraced as my own to one that encompasses the future of my geeks-in-training.

Needless to say, I haven't abandoned the world of XML, XSLT, XPath, XQuery entirely. I've evolved. I've had a gap year (or two) and seen the world outside the comfortable confines of angle brackets and FLOWR statements, and it has changed me - a bit.

For those who've read some of my posts, I drank the kool-aid in the 90's, and wanted everyone else to share from the same cup. What the last few years have shown is that kool-aid is for kids. It's time to grow up. The technology and content worlds have changed, and I need to change with it.

Primarily, what has changed is my thinking about the role of XML technologies in the landscape. It has a place, and honestly, it's a very important player in the wide ranging landscape - just not in the way I perceived it 5, 10, or even 15 years ago. In fact, I'm not sure that Sir Berners-Lee would have envisioned the path that markup languages have taken. Nonetheless, it's time to embrace these changes for what they are.

XML is here to stay. It's mature, it's lived up to its promise of extensibility, and it won't go away.
XML technologies are stable. There is little in the way of implementation variability among different providers now. Whether you are using Java, a classic 'P' (Python, PHP, Perl) language, or any one of the newer languages, they all must honor XML in order to be complete.
Incremental changes in XML technologies are principally to support scale. DOMs are nice, elegant, and easy-to-use structures, but quickly turn into boat-anchors when we attempt to embrace internet-scale data within them. Streaming XML is the new sexy.
Virtually any data model can be represented in XML for a myriad of business purposes with self describing semantics and the capability to flex its node hierarchy based on the data. For this reason alone, XML has been, and will continue to be, a workhorse. Think about Spring, one of the web's most successful Java frameworks. XML is the underlying data for nearly part of it.
As a data persistence layer, XML plays well with tabular, relational, and hierachical structures. With its rich semantics and vendor-agnostic format, XML technologies are powerful, flexible, and scalable. Yes, it's also a great storage model for pernicious content models - like DITA, DocBook, and, gulp, OOXML (I'll shower later for that).
From XML, I can deliver and/or display content/data in virtually any format imaginable, even to the point of backward compatibility to legacy formats (ask me about HTML 3.2/EDGAR transformations sometime)

With all that XML has going for it, what can go wrong? Well, depending on who you ask, the answer will vary. Some criticize XML for not living up to the hype of Web 2.0. XML's initial purpose was to be the "SGML for the web." To some degree, it is, but it is far from ubiquitous. That isn't to say that we didn't try. From XML Data Islands to XMLHttpRequest objects in Javascript, XML was given first class status on the web. The problem was (and is) that, as a DOM, extracting data often relied on a lot of additional code to recurse through the XML content. For some, the browser's tools felt like a blunt instrument when finer grained precision was needed. Eventually, JSON became the lingua franca for web data, and rightfully so.

Perhaps its biggest limitation or failure is the countless attempts to make XML usable for the masses. I'll admit that I was one of the biggest evangelists. I honestly believed that we could build authoring tools that were intuitive and easy-to-use back by powerful semantic markup. We would be able to enrich the web (and by proxy, the world) with content that had meaning - it could searched intelligently, reused, repurposed, translated, and delivered anywhere. As one of my friends and mentor, Eric Severson, said, XML has the capability of making content personal, designed for a wide audience and personalized for an "audience of one."

Intrinsically, I still have some faith in the idea, but the implementation never lived up to the hype. For over twenty years, we've tried to build tools that could manage XML authoring workflows from creation to delivery. Back in the late 90's and early 2000's, I remember evangelizing for XML authoring solutions to a group of technical writers for a big technology firm. I was surprised by the resistance and push back I got. Despite the benefits of XML authoring, the tools were still too primitive, and instead of making them more productive, it slowed them down. Nevertheless, I kept evangelizing like Linus in the pumpkin patch.

Eventually, the tools did improve. They did make authoring easier... for some. What we often glossed over was the level of effort required to make the tools easier to use. Instead of being tools that could be used by virtually anybody who didn't want to see angle brackets (tools for the masses), we made built-for-purpose applications. For folks like me who understood the magical incantations and sorcery behind these applications, they were fantastic. They were powerful. They also came with a hefty price tag, too. And, because they were often heavily customized, users were locked in to the tools, the content model, and the processes designed to support it.

Even if we attempted to standardize on the grammar to enable greater interchange, it still required high priests and wizards to make it work. The bottom line is that the cost of entry is just too high for many. The net result is that XML authoring is a niche, specialized craft left to highly trained technical writers and the geekiest of authors.

Years ago, I read Thomas Kuhn's The Structure of Scientific Revolutions. The main premise is that we continue to practice our crafts under the premise of well-accepted theory. Over time, through the course of repeated testing, anomalies emerge. Initially, we discard these anomalies, but as they continue to accumulate, we realize that we can't ignore these anomalies anymore. New theories emerge. However, we reject these new ideas and vigorously debate that the old theories are still valid, until enough evidence disproves them entirely. At that moment, a new paradigm emerges.

We are at that moment of paradigmatic shift. No longer can XML be thought of as a universal theory of information and interchange. Instead, we need to reshape our thinking to accept that XML solves many difficult problems, and has a place in our toolbox of technology, but other technologies and ideas are emerging that are easier, cheaper, faster methods for content authoring. For many, the answers to "intelligent content" aren't about embedding semantics within, but rather to extend content with rich metadata about the content that live as wrappers on the content - that can be dynamic, contextual, and mutable.

Before I'm labeled a heretic, let me be clear. XML isn't going away, nor is it inherently a failed technology. Quite the opposite. Its genius is in its relative simplicity and flexibility to be widely used in a vast number of technologies in an effective manner. The difference is that we've learned that we could never get enough inertia behind the idea of XML as a universal data model for content authoring, and it was too cumbersome for web browsers to manipulate. We have other tools for that.

Friday, October 25, 2013

Intellectual Property Affects K-12 Students Too

My oldest daughter was asked to enter a national contest through her school with some photos she created, sponsored by the National Parent Teacher Organization. Last night they sent home a waiver form that I had to sign. After my daughter read the waiver, she was concerned and asked me to look at it. After looking it over, I was a bit alarmed. The provision that raised red flags for me was:

I grant to PTA an irrevocable, unlimited license to display, copy, sublicense, publish, and create and sell derivative works from my work submitted for the Reflections Program.

OK. I'm not under any delusion to think that my daughter or any other student should be paid or recompensed for submitting to a contest, nor am I contesting that the PTA shouldn't have the right to redistribute or derive the works. What I am contesting is that there isn't a single provision in the waiver that states that they will do so with the condition of proper attribution to the author for the original and any derivative works. Let me go on the record by saying that I don't believe that the PTA would ever act in a malicious way, nor are they trying to profit from students' creative work. In fact, the opposite is quite true - they are encouraging kids to be creative, and I applaud that heartily. Nonetheless, after working with numerous publishers on IP Rights issues, this is a sticky issue. My main point is that even though my daughter is in the K-12 school system and participating in a school function doesn't mean that any creative endeavor she pursues shouldn't be protected.

The way out, in my view, is that PTA should seriously consider that any waiver for this activity be governed by the Creative Commons License. It basically states that the author of the work grants others the rights to use, sell, derive the work, provided that the user must include proper attribution to the author of that work. It gives the PTA broad rights on how it can use these creative works, without the kids (my daughter) giving up all her rights to the work entirely.

For me, this is just another indicator that IP Rights are becoming more and more important, and that we need technology (ODRL, and other platforms) to support it. We've built such technology.

Tuesday, September 24, 2013

XML Schemas and the KISS Principle

I recently had the opportunity to work on an interesting XML schema.. The intent was to create an HTML 5 markup grammar to create digital content for EPUB and the web primarily, then ultimately for print. The primary design goal is to create an authoring grammar that facilitates some level of semantic tagging and that is natively HTML 5 compliant, i.e., there is no transformation required to move between the authoring format and HTML5.

What is interesting about this particular schema is that it resembles similar design patterns used for microformats. The markup semantics for typographic structures such as a bibliography or a figure are tagged with standard HTML elements and with additional typographic semantics express using the class attribute. For example, a figure heading structure must look like the following:

<figure>
    <h2><span class="caption">Figure </span>
    <span class="caption_number">1.1 </span>Excalibur and the Lady of the Lake</h2>
</figure>

Notice the <span> tags. From the perspective of describing our typographic semantics (figures must have captions and captions must have a number), this isn’t too bad. However from a schema perspective, it’s much more complex, because the underlying HTML5 grammar is quite complex at the level of <div>, <h2> and <span> elements. In addition to the required “caption” and“caption_number” semantics applied to the <span> tag, the <h2> element also allows text, other inline flow elements, such as <strong>, <em>, and, of course, other <span> tags that apply other semantics.

To enforce the mandate that a figure heading must have a label and number as the first two nodes of the <h2> element, we can use XML Schema 1.1 assertions . Assertions allow us to apply business rules to the markup that cannot be expressed directly in the content model sequences. Assertions allow us to use a limited subset of XPath axes and functions that return a boolean result.

Alternately, Schematron could be used independently (or in addition to assertions) as a means of enforcing the business rules in the markup. The issue here is that a Schematron rule set resides outside of the XML schema, therefore requiring additional tooling integration from the authoring environment to apply these rules.

So, for our heading above, we must apply the following assertion:

<xs:assert test="child::h2/node()[1][@class='caption']/following-sibling::span[@class='caption_number']""/>

In this case, the assertion is stating that the <h2> element’s first node must have a class attribute value of “caption”, followed immediately by an element with its class attribute value of “caption_number.” After that, any acceptable text or inline element defined by the HTML5 grammar is allowed.

This is a very simple example of how the existing HTML5 grammar alone cannot enforce the semantic structure we wish to express. There are numerous other examples within the content model that would leverage the same design pattern.

We have done several successful projects with this approach and the value of having a single authoring/presentation grammar (HTML 5) is very appealing. However, there can be issues and difficulties with this approach. Consider:

Microformats are clever applications that give semantic meaning to simple HTML formatting tags. It’s valid HTML by virtue of tags and attributes, with additional semantics expressed through the value of certain attribute such as the class attribute. In general, these microformat markup documents are small, discrete documents, as they are intended to be machine readable to give the application its functionality. From an authoring perspective, it’s relatively simple to create a form that captures the essential data that is processed by machine to generate the final microformat data (or for the markup and microformat savvy, create it by hand – but we are in the minority). Think of microformat instances as small pieces of functionality embedded as a payload within a larger document that are only accessed by applications with a strong understanding of the format. If we take the notion of microformats and use them throughout a document, we can run into tooling issues, because we’re now asking a broader range of applications (e.g. XML editors) to understand our microformat.
The “concrete” structural semantics (how to model figures and captions) are specified with “abstract” formatting HTML tags. Conflating presentation and structural semantics in this way is contrary to a very common design principle in use today in many languages and programming frameworks, namely to separate the semantics/structure from the formatting of content.
The schema’s maintainability is decreased by the vast number of assertions that must be enforced for each typographical structure. Any changes to any one structure may have ripple effects to other content classes.
Not all XML authoring tools are created equal. Some don’t honor assertions. Others do not support XML 1.1 Schemas at all. Consequently, this means that your holistic XML strategy becomes significantly more complex to implement. It might mean maintaining two separate schemas, and it might also mean additional programming is required to enforce the structural semantics that we wish to be managed in the authoring tool.
A corollary to the previous point, creating a usable authoring experience will require significant development overhead to ensure users can apply the right typographical structures with the correct markup. It could be as simple as binding templates with menus or toolbars, but it could easily extend into much more. Otherwise, the alternative is to make sure you invest in authors/editors who are trained extensively to create the appropriate markup. Now consider point #3. Any changes to the schema have ripple effects to the user experience also.
Instead of simplifying the transformation process, tag overloading can have the reverse effect. You end up having to create templates for each and every class value, and it’s not difficult to end up with so many permutations that an ambiguous match results in the wrong output. Having gone down this road with another transformation pipeline for another client, I can tell you that unwinding this is not a trivial exercise (I’ll share this in another post).
Assertion violation messages coming from the XML parser are extremely cryptic:
```
cvc-assertion: Assertion evaluation ('child::node()[1]/@class='label'') for element 'summary' on schema type 'summary.class' did not succeed.
```
For any non-XML savvy practitioners, this kind of message is the precursor to putting their hands up and calling tech support. Even if you use something like Schematron on the back end to validate and provide more friendly error messages, you’ve already made the system more complex.
It violates the KISS principle. The schema, at first glance, appears to be an elegant solution. If used correctly, it mitigates what is a big problem for publishers: How do I faithfully render the content to appear as prescribed in the content? Theoretically, this schema would only require very light transformation to achieve the desired effect. Yet, it trades one seemingly intractable problem for several others that I’ve described above.

Several years ago, I recommended using microformats as an interoperability format for managing content between DITA, DocBook, and other XML markups. The purpose of the format was specifically to be able to generated and read with a set of XSLT stylesheets do the heavy lifting of converting between standards. The real benefit is that you create a transformation once for each input and output, rather than building “one-off” transformations for each version of the standard. Once in the converted markup, the content could leverage its transformations to produce the desired output.

I think the key distinction is that XML Interoperability Framework was never intended to be an authoring medium. Users would create content in the target format, using the tools designed for that format. This schema’s strategy is to author directly into the interop, and the unintended consequences described above only make the complexity of implementing, using, and maintaining it far greater than it needs to be. Sometimes, cutting out the middle man is not cheaper or easier.

Here’s another alternative to consider:

A meaning for everything: create a schema with clear, discrete semantics with specific content models for each structure. Yes, it explicitly means you have to create stylesheets with some greater degrees of freedom to support the output styling you want, and perhaps it’s always a one-off effort, but overall, it’s easier to manipulate a transformation with overrides or parameters than trying to overload semantics.
For example, consider our example above: If we want to mandate a figure heading must have a caption label and a caption number, then semantically tagging them as such gives you greater freedom for your inline tagging markup like <span>. Using this principle, I could see a markup like the following:
```
<figure> 
    <figtitle>
        <caption_label>Figure</caption_label> 
        <caption_number>1.1</caption_number> 
        Excalibur and the Lady of the Lake 
    </figtitle> 
</figure> 
```
Which might be rendered in HTML5 as:
```
<figure> 
    <h2>
        <span class="caption">Figure </span> 
        <span class="caption_number">1.1 </span> 
        Excalibur and the Lady of the Lake 
    </h2> 
</figure>
```
That also allows me to also distinguish from other types of headings that have different markup requirements. For example, a section title might not have the same caption and numbering mandate:
```
<section> 
    <title>The Relationship Between Arthur and Merlin</title> 
    <subtitle>Merlin as Mentor</subtitle> 
    ... 
</section>
```
Which might rendered in HTML5 as:
```
<section> 
    <h1>The Relationship Between Arthur and Merlin</h1> 
    <h2>Merlin as Mentor</h2> 
    ... 
</section>
```
Notice that in both cases we’re not throwing all the HTML5 markup overboard (figure and section are HTML5 elements), we’re just providing more explicit semantics that model our business rules more precisely. Moreover, it’s substantially easier to encapsulate and enforce these distinctive models in the schema, without assertions or Schematron rules, unless there are specific business rules within the text or inline markup that must be enforced independently from the schema.

Of course, if you change the schema, you may have also make changes in the authoring environment and/or downstream processing. However, that would be true in either case. And, irrespective of whether I use an HTML 5-like or a semantically-explicit schema, I still need to apply some form of transformation on content written against earlier versions of the schema to update to the most current version. The key takeaway is that there is little in the way of development savings with the HTML5 approach.
Design the system with the author as your first priority. For example, most XML authoring tools make it easy by inserting the correct tags for required markup (e.g., our figure heading), especially when each tag’s name is distinct. Many of these same tools also provide functionality to “hide” or “alias” the tags in a way that’s more intuitive to use. Doing this in an overloaded tagging approach will require a lot more development effort to provide same ease of use. Without that effort, and left to their own devices, authors are going to struggle to create valid content, and you are almost certain to have a very difficult time with adoption.
Recognize that tools change over time. The less you have to customize to make the authoring experience easy, the more likely you can take advantage of new features and functionality without substantial rework, which also means lower TCO and subsequently, higher ROI.
Back end changes are invisible to authors. By all means, it’s absolutely vital to optimize your downstream processes to deliver content more efficiently and to an ever-growing number of digital formats. However, the tradeoffs for over-simplifying the backend might end up costing more

HTML5 will become the base format for a wide range of digital media, ranging from EPUB to mobile and the web. On the surface, it would appear that using HTML5 makes sense as both a source format and a target format. The idea has a lot of appeal particularly because of the numerous challenges that still persist today with standard or custom markup grammars that have impacted both authoring and backend processes.

Microformats’ appeal is the ability to leverage a well-known markup (HTML) to create small, discrete semantic data structures targeted for applications with a strong understanding of the format. Leveraging the simplicity of HTML5, we had hoped to create a structured markup that was easy to use for content creation, and with little to no overhead on the back end to process and deliver the content. However, I discovered that it doesn’t scale well when we try applying the same design pattern to a larger set of rich semantic structures within a schema designed for formatting semantics.

Instead, the opposite appears to be true: I see greater complexity in the schema design due to the significant overloading of the class attribute to imply semantic meaning. I also see limitations in current XML authoring tools to support a schema with that level of complexity, without incurring a great deal of technical debt to implement and support a usable authoring environment.

I also discussed how implementing an HTML5 schema with overloaded class attributes likely won’t provide development savings compared to more semantically-explicit schemas when changes occur. In fact, the HTML5 schema may incur greater costs due to its dependency on assertions or Schematron to enforce content rules.

Rather than overloading tags with different structural semantics, an alternative might be the use of a “blended” model. Leverage HTML5 tags where it makes sense: article, section, figure, paragraphs, lists, inline elements, and so on. Where there are content model variations or the need for more constrained models, use more explicit semantics. The advantages to this kind of approach takes advantage of built in features and functionality available in today’s XML authoring tools, and mitigates the level programming or training required. Also, the underlying schema is much easier to maintain long term. Of course, there are trade-offs in that back-end processing pipelines must transform the content. However, with the right level of design, the transformations can be made flexible and extensible enough to support most output and styling scenarios. With this in mind, this type of tradeoff is acceptable if the authoring experience isn’t compromised.

Tuesday, July 24, 2012

Enumerated Constants in XQuery

I’ve been working on a little project that allows me to merge my love of baseball with my knowledge of XML technologies. In the process of working through this project, I am creating XQuery modules that encapsulate the logic for the data. Part of the data that I’m looking at must account for different outcomes during the June amateur draft.

It turns out that the MLB June Amateur draft is quite interesting in that drafting prospects is a big gamble. Drafts may or may not sign in any given year, and remain eligible for drafts in subsequent years. If they don’t sign during that year, they could be drafted by another team in following years. Alternately, they could be selected by the same team and signed. However, even if they do sign, there’s no guarantee that they’ll make it to big leagues. And even if they do, they might not make it with the same team they signed with initially (in other words, they were traded before reaching the MLB).

In effect there are several scenarios, depending how the data is aggregated or filtered. However, these scenarios are well defined and constrained to a finite set of possibilities:

All draft picks
All signed draft picks
All signed draft picks who never reach the MLB (the vast majority don’t)
All signed draft picks who reached the MLB with the club that signed them
All signed draft picks who reached the MLB with another club
All unsigned draft picks
All unsigned draft picks who reached the MLB with a different club
All unsigned draft picks who reach with the same club, but at a later time
All unsigned draft picks who never reach the MLB

All of these scenarios essentially create subsets of information that I can work with, depending whether I’m interested in analyzing a single draft year, or all draft years in range. They’re essentially the same queries, with minor variations to filter to meet a specific scenario.

Working with various strongly typed languages like C# or Java, I would use a construct like an enum to encapsulate these possibilities into one object. Then I can pass this into a single method that will allow me to conditionally process the data based on the specified enum value. Pretty straightforward. For example, in C# or Java I would write:

public enum DraftStatus {
   ALL,  //All draft picks (signed and unsigned)
   UNSIGNED, //All unsigned draft picks
   UNSIGNED_MLB, //All unsigned picks who made it to the MLB
   SIGNED,  //All signed draft picks
   SIGNED_NO_MLB, //Signed but never reached the MLB
   SIGNED_MLB_SAME_TEAM, //signed and reached MLB with the same team
   SIGNED_MLB_DIFF_TEAM  //signed and reached with another club   
};

The important aspect of enumerations is that each item in an enumeration can be descriptive and also map to a constant integer value. For example UNSIGNED is much more intuitive and meaningful than 1, even though they are equivalent.

Working with XQuery, I don’t have the luxury of an enumeration. Well, at least in the OOP sense. I could write separate functions for each of the scenarios above and perform the specific query and return a the desired subset I need. But that’s just added maintenance down the road.

At first I toyed with the idea of using an XML fragment containing a list of elements that mapped the element name to an integer value:

<draftstates>
    <ALL>0</ALL>
    <UNSIGNED>1</UNSIGNED>
    <UNSIGNED_MLB>2</UNSIGNED_MLB>
    <SIGNED>3</SIGNED>
    <SIGNED_NO_MLB>4</SIGNED_NO_MLB>
    <SIGNED_MLB>5</SIGNED_MLB>
    <SIGNED_MLB_SAME_TEAM>6</SIGNED_MLB_SAME_TEAM>
    <SIGNED_MLB_DIFF_TEAM>7</SIGNED_MLB_DIFF_TEAM>
</draftstates>

And then using a variable declaration in my XQuery:

module namespace ds="http://ghotibeaun.com/mlb/draftstates";
declare variable $ds:draftstates := collection("/mlb")/draftstates;

To use it, I need to cast the element value to an integer. Using an example, let's assume that I want all signed draftees who reached the MLB with the same team:

declare function gb:getDraftPicksByState($draftstate as xs:integer, $team as xs:string) as item()* {
   let $picks := 
       if ($draftstate = 
           xs:integer($ds:draftstates/SIGNED_MLB_SAME_TEAM)) then
           let $results := 
               /drafts/pick[Signed="Yes"][G != 0][Debut_Team=$team]
           return $results
       (: more cases... :)
       else ()
   return $picks
};

(:call the function:)
let $sameteam := 
    gb:getDraftPicks(xs:integer($ds:draftstates/SIGNED_MLB_SAME_TEAM), 
                     "Rockies")
return $sameteam

It works, but it’s not very elegant. Every value in the XML fragment has to be extracted through the xs:integer() function which is added logic and makes the code less readable. Add to that, IDEs like Oxygen that enable code completion (and code hinting) doesn’t work with this approach.

What does work well (at least in Oxygen, and I suspect in other XML/XQuery IDEs) are code completion for variables and functions, which led me to another idea. Prior to Java 5, there weren’t enum structures. Instead, enumerated constants were created through the declaration of constants encapsulated in a class:

public class DraftStatus {
    public static final int ALL = 0;
    public static final int UNSIGNED = 1;
    public static final int UNSIGNED_MLB = 2;
    public static final int SIGNED = 3;
    public static final int SIGNED_NO_MLB = 4;
    public static final int SIGNED_MLB = 5;
    public static final int SIGNED_MLB_SAME_TEAM = 6;
    public static final int SIGNED_MLB_DIFF_TEAM = 7;   
}

This allowed static access to the constant values via the class, e.g., DraftStatus.SIGNED_MLB_SAME_TEAM.
The same principle can be applied to XQuery. Although there isn’t the notion of object encapsulation by class, we do have encapsulation by namespace. Likewise, XQuery supports code modularity by allowing little bits of XQuery to be stored in individual files, much like .java files. To access class members, you (almost always) have to import the class into the current class. The same is true in XQuery. You can import various modules into a current module by declaring the referenced module’s namespace and location.
Using this approach, we get the following:

mlbdrafts-draftstates.xqy

xquery version "1.0";

module namespace ds="http://ghotibeaun.com/mlb/draftstates";

declare variable $ds:ALL as xs:integer := 0;
declare variable $ds:UNSIGNED as xs:integer := 1;
declare variable $ds:UNSIGNED_MLB as xs:integer := 2;
declare variable $ds:SIGNED := 3;
declare variable $ds:SIGNED_NO_MLB := 4;
declare variable $ds:SIGNED_MLB := 5;
declare variable $ds:SIGNED_MLB_SAME_TEAM := 6;
declare variable $ds:SIGNED_MLB_DIFF_TEAM := 7;

Now we reference this in another module:

import module namespace ds="http://ghotibeaun.com/mlb/draftstates" at "mlbdrafts-draftstates.xqy";

Which gives as direct access to all the members like an enumeration:

The bottom line is that this approach has worked really well for me. I can use descriptive constant names that map to specific values throughout my code and shows how you can add a little rigor to your XQuery coding.

Tuesday, January 17, 2012

A First Look at ODRL v2

With other things taking high priority over the last 6 months, this is the first opportunity I’ve had to look at the progression of ODRL Version 2.0, and evaluating where it’s improved from the earlier versions.

First things first, ODRL has migrated to the W3C as a Community Working Group. Overall, this is a good thing. It opens it up to the wider W3C community, gives greater credence to the effort and more importantly, more exposure. Well done.

On to my first impressions:

1 . The model has been greatly simplified. With ODRL 1.x, it was possible to express the same rights statement in several different ways. The obvious implication was that it was virtually impossible to build a generalized API for processing IP Rights, save running XJC on the schema, which isn't necessarily always what I want. It wasn’t all bad news though, the 1.x extension model was extremely flexible and enabled the model to support additional business-specific rights logic.

2. Flexible Semantic Model. The 2.0 model has a strong RDF-like flavor to it. Essentially, all of the entities, assets, parties (e.g., rightsholders, licensees), permissions, prohibitions, and constraints are principally URI-based resource pointers that imply semantics to each of the entities. Compared to 1.x, this is a vast improvement to its tag-based semantics, which meant that you were invariably extending either the ODRL content model, data dictionary, or both.

3. Needs More Extensibility. The current normative schema, still in draft, does need some additional design. Out of the box testing with Oxygen shows that only one element is exposed (policy). All of the other element definitions are embedded within the complexType models, which means makes it difficult to extend the model with additional structural semantics. This is extremely important on a number of fronts:

The current model exposes assets as explicit members of a permission or prohibition. Each “term” (i.e., permission or prohibition) is defined by an explicit action (print, modify, sell, display). It’s not uncommon to have a policy that covers dozens or hundreds of assets. So for each term, I have to explicitly call out each asset. This seems a little redundant. The 1.x model had the notion of terms that applied to all declared assets at the beginning of the policy (or in the 1.x semantics, rights). I’d like to see this brought back into the 2.0 model.
The constraint model is too flat. The new model is effectively a tuple of: constraintName, operator, operand. This works well for simple constraints like the following psuedo-code : “print”, “less than”, “20000”, but doesn’t work well for situations where exceptions may occur (e.g., I have exclusive rights to use the asset in the United States until 2014, except in the UK; or I have worldwide rights to use the asset in print, except for North Korea, and the Middle East). Instead, I have to declare the same constraint twice: once within a permission, and second time as a prohibition. I’d like the option to extend the constraint model to enable more complex expressions like the ones above.

Additionally list values within constraints are expressed tokenized strings within the rightOperand attribute. While completely valid to store values in this, I have a nit against these types of token lists, especially if the set of values is particularly long, like it can for things like countries using ISO-3166 codes.

I shouldn’t have to extend the whole complexType declaration in order to extend the model with my own semantics. However the current schema is structured that way. Instead, I’d like to see each entity type exposed as an “abstract” element, bound to a type, which ensures that my extension elements would have to at least conform to the base model.

Takeaways

I’m looking forward to using this with our Rights Management platform. The model is simple and clean and has a robust semantics strategy modeled on an RDF-like approach. This will make it easier to use the out-of-the-box model. That said, it’s missing some key structures that would make it easier to use and extend if I have to, but that can be address with a few modifications to the schema. (I have taken a stab at refactoring to test this theory – it’s pretty clean and I’m able to add my “wish list” extensions with very little effort.

Link: http://dl.dropbox.com/u/29013483/odrl-v2-proposed.xsd

Saturday, December 31, 2011

Parallels between Punk and Anonymous

Prologue: Before starting my career in the tech world 15+ years ago, I was a graduate student in Sociology studying political movements and economies.

At any rate, what’s intriguing about technology is not only about 0s or 1s, data structures, angle brackets, optimized queries or distributed architectures (don’t get me wrong, I love elegant code and design as much as any other geek) – it’s also the intended and unintended consequences it has on society at large. As the automobile and large manufacturing re-shaped our society a hundred years ago, the internet and all of the emerging technologies are transforming our social interactions today.

2011 was a landmark year. We saw “Arab Spring” unfold before us in large part because of mobile devices and social media (granted, the other necessary ingredients – anger, resentment, disenfranchisement, chronic poverty and unemployment – have been brewing for many years). The “Occupy” movement harnessed the same political, social, economic, and technological ingredients along with a sprinkling of hyper-aggressive tactics of the NYPD and transformed a seemingly innocuous protest into a worldwide meme. WikiLeaks, rightly or not, also changed the way we view government, particularly when sensitive or embarrassing information is exposed. And to that end, this year demonstrated that the combination of mobile and social technology meant that information could spread virally, beyond the full control of any one entity. This has spurred new tensions between individuals who interact with data and entities who provide and/or control the data.

In this case, I see many interesting parallels between the Punk subculture of the 1970s and early 1980s and the nascent subculture of Anonymous that is growing today. Both have emerged during periods of economic turmoil, and both have a strong anti-authoritarian sentiment that are willing to challenge the current establishment.

I love the Sex Pistols (and the Smiths, the Cure, The Damned, Souixsie and Banshees, and so on, and on, etc.). I can listen to “Anarchy in the UK”, “God Save the Queen”, or “Pretty Vacant” any time. It’s loud and raucous. It’s fun. It’s… well, rebellious. Johnny Rotten’s menacing, sarcastic vocals epitomized the political, social and philosophical undertones of the Punk subculture of the mid-to-late 1970s.

From many accounts, the Punk subculture, particularly in the UK, emerged during the mid-1970s in part because of the poor economy. Disenfranchised youths with few economic prospects gravitated to a style of music and dress that was non-conformist by nature and expressed their anger and frustration against society and government.

The ethos, or ideology of Punk is well described here (source: http://www.bunnysneezes.net/page192.html):

It is passionate, preferring to encounter hostility rather than complacent indifference; working class in style and attitude if not in actual socio-economic background; defiant, unconventional, bizarre, shocking; starkly realistic, anti- euphemism, anti-hypocrisy, anti-bullshit, anti-escapist, happy to rub people's noses in realities they don't wish to acknowledge; angry, aggressive, confrontational, tough, willing to fight — yet this stance is derived from an underlying vulnerability, for the archetypal Punk is young, small, poor, and powerless, and he knows it very well; sceptical, especially of authority, romance, business, school, the mass media, promises, and the future; socially critical, politically aware, pro-outlaw, anarchistic, anti-military; expressive of feelings which polite society would censor out; anti-heroic, anti-"rock star" ("Every musician a fan and every fan in a band!"); disdainful of respectability and careerism; night-oriented; with a strong, ironic, satirical (often self-satirical), put-on-loving sense of humor, which is its saving grace; stressing intelligent thinking and deriding stupidity; frankly sexual, frequently obscene; apparently devoted to machismo, yet welcoming "tough" females as equals (and female Punks are often as defiant of the males as of anyone else) and welcoming bisexuals, gays, and sexual experimentation generally; hostile to established religions but sometimes deeply spiritual; disorganized and spontaneous, but highly energetic; above all, it is honest.

Compare this to the first two parts of Quinn Norton’s (Wired Magazine) well-done analysis of Anonymous in “Anonymous: Beyond the Mask” (Part 1 here: http://www.wired.com/threatlevel/2011/11/anonymous-101/all/1; Part 2 here: http://www.wired.com/threatlevel/2011/12/anonymous-101-part-deux/). One of the first things this series does incredibly well is to identify Anonymous for what it is – a culture, or more accurately, a counter-culture.

Like Punk, Quinn goes on to describe the Anonymous culture:

The birthplace of Anonymous is a website called 4chan founded in 2003, that developed an “anything goes” random section known as the /b/ board.
…
Like Alan Moore’s character V who inspired Anonymous to adopt the Guy Fawkes mask as an icon and fashion item, you’re never quite sure if Anonymous is the hero or antihero. The trickster is attracted to change and the need for change, and that’s where Anonymous goes. But they are not your personal army – that’s Rule 44 – yes, there are rules. And when they do something, it never goes quite as planned. The internet has no neat endings.

What’s more, both are media savvy in their own ways, leveraging them for their own purpose. Obviously, in the ‘70s and ‘80s, the internet wasn’t even a twinkle in our eyes yet, so they relied on print and radio (typically either on small, low-band college stations or on pirate radio stations since mainstream radio stations wouldn’t give them airplay) to get their message out. Anonymous, however, have the luxury of the internet and search engines, where information is easily accessible and available:

But to be historical, let’s start with 4chan.org, a wildly popular board for sharing images and talking about them, and in particular, 4chan’s /b/ board (Really, really, NSFW). /b/ is a web forum where posts have no author names and there are no archives and it’s explicitly about anything at all. This technological format meeting with the internet in the early 21st Century gave birth to Anonymous, and it remains the mother’s teat from which Anonymous sucks. (Rule 22)

Both follow its own rules, many of which run counter to conventionally accepted protocols, and frequently meant to shock, ridicule and otherwise laugh at mainstream society.

/b/ is the id of the internet, the collective unconscious’s version of the place from which the base drives arise. There is no sophistication in the slurs, sexuality, and destruction in the savage landscape of /b/ — it is the natural state of networked man.

In this, it has a kind of innocence and purity. Terms like ‘nigger’ and ‘faggot’ are common, but not there because of racism and bigotry – though racism and bigotry are easily found there. Their use is there to keep you out. These words are heads on pikes warning you that further in it gets much worse, and it does.

Nearly any human appetite is acceptable, nearly any flaw exploited, and probably photographed with a time stamp. But /b/ reminds us that the id is the seat of creative energy. Much of it, hell even most of it, is harmless or even sweet. People reach out for help on /b/, and they find encouragement and advice. The id and /b/ are the foxholes of those who feel powerless and disenfranchised.

And like Punk, it never intended to be overtly political. Rather, the circumstances and events of the time instigated it. “The Guns of Brixton”, written by The Clash about the 1981 Brixton Riots is one of many examples. For Anonymous, its forays into political protest were spurred on by their collective belief that Julian Assange and WikiLeaks were wrongfully targeted by governments and large, multinational corporations, and that fellow “compatriots” at BitTorrent site, Pirate Bay, were wrongfully attacked. In all cases, the common thread was a belief of suppression by the establishment.

Where they differ, however, is in their means of expression. Punk is analog. It could only reach those in proximity to a radio signal (or the occasional TV appearance), a concert venue, or to a “zine”. It’s effect and impact on society at large could only scale to the number of members it could congregate in any one physical location, which meant that it could remain largely contained and isolated. On the other hand, Anonymous is digital. Its reach is unbounded and its impact on society much more significant. The virtual nature of Anonymous means that they are able to challenge mainstream society more directly with little or no impunity. With tools like the Low Orbit Ion Cannon for DDOS attacks, and with more talented hacker members able to break into corporate and government servers and stealing sensitive information from them, governments and corporations see them as a real threat.

At its essence, the Punk subculture provided its members a means of “flipping off” mainstream culture, through its music, dress, art, literature, and language. Yet, it was easy for mainstream society to ignore early punk youth, since their access to media was relatively limited. Anonymous shares this same “f--- you” attitude along with the same antipathy toward authority, yet they have the means to express their views more dramatically, and with greater reach, particularly because the internet, social media, and mobile devices enable members of Anonymous to be anywhere, or anyone.

Punk has evolved over the decades. The music has changed; the aesthetics are different, and to some extent, what was considered shocking then is widely accepted now. Yet, the idea of Punk is still here. Anonymous is just the latest manifestation of it, and it could potentially have even greater impact on society-at-large.

Wednesday, December 14, 2011

SOPA Will Be Our Generation’s McCarthy Witch Hunt

In the late 1940s and early 1950s Joseph McCarthy was determined to eradicate the Red Scare by accusing numerous Americans of treason and being communists. It resulted in many actors being blacklisted, and resulted in the now infamous question to the “Hollywood Ten” from the House Committee on Un-American Activities – “Are you now or have you ever been a member of the Communist Party?” They exercised their 5th Amendment rights and refused to answer the question, principally because they felt their First Amendment rights were being impinged.

In its current form, the “Stop Online Piracy Act” (SOPA) would allow the Department of Justice and Copyright holders to seek injunction against websites that are accused of enabling, facilitating or engaging in copyright infringement. It doesn’t stop there: It would force search engines to remove all indexes for that site, mandate that ISPs block access to the site, and require 3rd party sites like PayPal from engaging or transacting with the offending website. All because the copyright holder (or DOJ) makes an accusation. The burden of proof is on the ISPs, the search engines and the 3rd party vendors to show that the “offending website” is not violating any copyright (So perhaps Congress should consult the 6th Amendment). The implications are severe even for websites that reference these infringing sites. They could be shut down too.

Let’s be clear, I’m not condoning piracy of any kind. Intellectual Property vis-à-vis copyright is the coin of the realm of many companies, even whole industries like Publishing, Media, Software, and yes, the Entertainment world, and they should protect their assets. They should derive value and profit from their IP. An author who pours their heart into a publication, or an artist whose performance I like should be paid. Likewise, content producers – studios, publishers, media companies – should be able to garner payment for their role in providing content. But they are looking at the whole piracy issue the wrong way.

Brute-force tactics to protect copyright have been epic failures. DRM approaches don’t work. In fact, they incite piracy, and worse, they harm the very companies they try to protect. In 2007, Radiohead released their album “In Rainbows” DRM-free. A year later, they had sold over 1.75 million copies and 1.2 million fans would buy tickets to their show. Bottom line: Locking down content doesn’t protect copyright holders. Instead, DRM tactics will end up frustrating consumers who legally purchase content but can’t use it or copy it to a new device and, as a result, diminishes revenue. And at that point, the opportunity cost of future purchases with the same DRM constraints will grow higher and higher. Media, publishing and entertainment executives know that DRM has failed, and feel that their only recourse is through SOPA.

There will always be a small percentage of consumers who will use pirated content. But it needn’t be a negative sum game. In some cases, it should be written off as a business cost in order to generate more revenue: a pirated song, might lead to the offending consumer to purchase a ticket to a concert, or to the next movie because they can’t wait. Yet, to prevent wholesale piracy, technology exists today that can protect copyrighted content: XMP (even ODRL can be serialized into XMP), digital fingerprinting for starters. By using these, along with other tools that can scan the internet for matching assets, asset producers can identify and isolate pirated copies. Then they can go after the offending sites directly.

SOPA won’t stop piracy, but it will impact everyone’s access on the Internet. And in that vein, SOPA legitimizes the piracy of 1st Amendment rights, much in the same way that McCarthyism censored free though in the 1950s, simply by accusation of copyright infringement.

NOTE: The views expressed in this post and on this blog are my own. They do not reflect the views of my employer, its employees or its partners.