Jim's Thoughtspot: 2012

Tuesday, July 24, 2012

Enumerated Constants in XQuery

I’ve been working on a little project that allows me to merge my love of baseball with my knowledge of XML technologies. In the process of working through this project, I am creating XQuery modules that encapsulate the logic for the data. Part of the data that I’m looking at must account for different outcomes during the June amateur draft.

It turns out that the MLB June Amateur draft is quite interesting in that drafting prospects is a big gamble. Drafts may or may not sign in any given year, and remain eligible for drafts in subsequent years. If they don’t sign during that year, they could be drafted by another team in following years. Alternately, they could be selected by the same team and signed. However, even if they do sign, there’s no guarantee that they’ll make it to big leagues. And even if they do, they might not make it with the same team they signed with initially (in other words, they were traded before reaching the MLB).

In effect there are several scenarios, depending how the data is aggregated or filtered. However, these scenarios are well defined and constrained to a finite set of possibilities:

All draft picks
All signed draft picks
All signed draft picks who never reach the MLB (the vast majority don’t)
All signed draft picks who reached the MLB with the club that signed them
All signed draft picks who reached the MLB with another club
All unsigned draft picks
All unsigned draft picks who reached the MLB with a different club
All unsigned draft picks who reach with the same club, but at a later time
All unsigned draft picks who never reach the MLB

All of these scenarios essentially create subsets of information that I can work with, depending whether I’m interested in analyzing a single draft year, or all draft years in range. They’re essentially the same queries, with minor variations to filter to meet a specific scenario.

Working with various strongly typed languages like C# or Java, I would use a construct like an enum to encapsulate these possibilities into one object. Then I can pass this into a single method that will allow me to conditionally process the data based on the specified enum value. Pretty straightforward. For example, in C# or Java I would write:

public enum DraftStatus {
   ALL,  //All draft picks (signed and unsigned)
   UNSIGNED, //All unsigned draft picks
   UNSIGNED_MLB, //All unsigned picks who made it to the MLB
   SIGNED,  //All signed draft picks
   SIGNED_NO_MLB, //Signed but never reached the MLB
   SIGNED_MLB_SAME_TEAM, //signed and reached MLB with the same team
   SIGNED_MLB_DIFF_TEAM  //signed and reached with another club   
};

The important aspect of enumerations is that each item in an enumeration can be descriptive and also map to a constant integer value. For example UNSIGNED is much more intuitive and meaningful than 1, even though they are equivalent.

Working with XQuery, I don’t have the luxury of an enumeration. Well, at least in the OOP sense. I could write separate functions for each of the scenarios above and perform the specific query and return a the desired subset I need. But that’s just added maintenance down the road.

At first I toyed with the idea of using an XML fragment containing a list of elements that mapped the element name to an integer value:

<draftstates>
    <ALL>0</ALL>
    <UNSIGNED>1</UNSIGNED>
    <UNSIGNED_MLB>2</UNSIGNED_MLB>
    <SIGNED>3</SIGNED>
    <SIGNED_NO_MLB>4</SIGNED_NO_MLB>
    <SIGNED_MLB>5</SIGNED_MLB>
    <SIGNED_MLB_SAME_TEAM>6</SIGNED_MLB_SAME_TEAM>
    <SIGNED_MLB_DIFF_TEAM>7</SIGNED_MLB_DIFF_TEAM>
</draftstates>

And then using a variable declaration in my XQuery:

module namespace ds="http://ghotibeaun.com/mlb/draftstates";
declare variable $ds:draftstates := collection("/mlb")/draftstates;

To use it, I need to cast the element value to an integer. Using an example, let's assume that I want all signed draftees who reached the MLB with the same team:

declare function gb:getDraftPicksByState($draftstate as xs:integer, $team as xs:string) as item()* {
   let $picks := 
       if ($draftstate = 
           xs:integer($ds:draftstates/SIGNED_MLB_SAME_TEAM)) then
           let $results := 
               /drafts/pick[Signed="Yes"][G != 0][Debut_Team=$team]
           return $results
       (: more cases... :)
       else ()
   return $picks
};

(:call the function:)
let $sameteam := 
    gb:getDraftPicks(xs:integer($ds:draftstates/SIGNED_MLB_SAME_TEAM), 
                     "Rockies")
return $sameteam

It works, but it’s not very elegant. Every value in the XML fragment has to be extracted through the xs:integer() function which is added logic and makes the code less readable. Add to that, IDEs like Oxygen that enable code completion (and code hinting) doesn’t work with this approach.

What does work well (at least in Oxygen, and I suspect in other XML/XQuery IDEs) are code completion for variables and functions, which led me to another idea. Prior to Java 5, there weren’t enum structures. Instead, enumerated constants were created through the declaration of constants encapsulated in a class:

public class DraftStatus {
    public static final int ALL = 0;
    public static final int UNSIGNED = 1;
    public static final int UNSIGNED_MLB = 2;
    public static final int SIGNED = 3;
    public static final int SIGNED_NO_MLB = 4;
    public static final int SIGNED_MLB = 5;
    public static final int SIGNED_MLB_SAME_TEAM = 6;
    public static final int SIGNED_MLB_DIFF_TEAM = 7;   
}

This allowed static access to the constant values via the class, e.g., DraftStatus.SIGNED_MLB_SAME_TEAM.
The same principle can be applied to XQuery. Although there isn’t the notion of object encapsulation by class, we do have encapsulation by namespace. Likewise, XQuery supports code modularity by allowing little bits of XQuery to be stored in individual files, much like .java files. To access class members, you (almost always) have to import the class into the current class. The same is true in XQuery. You can import various modules into a current module by declaring the referenced module’s namespace and location.
Using this approach, we get the following:

mlbdrafts-draftstates.xqy

xquery version "1.0";

module namespace ds="http://ghotibeaun.com/mlb/draftstates";

declare variable $ds:ALL as xs:integer := 0;
declare variable $ds:UNSIGNED as xs:integer := 1;
declare variable $ds:UNSIGNED_MLB as xs:integer := 2;
declare variable $ds:SIGNED := 3;
declare variable $ds:SIGNED_NO_MLB := 4;
declare variable $ds:SIGNED_MLB := 5;
declare variable $ds:SIGNED_MLB_SAME_TEAM := 6;
declare variable $ds:SIGNED_MLB_DIFF_TEAM := 7;

Now we reference this in another module:

import module namespace ds="http://ghotibeaun.com/mlb/draftstates" at "mlbdrafts-draftstates.xqy";

Which gives as direct access to all the members like an enumeration:

The bottom line is that this approach has worked really well for me. I can use descriptive constant names that map to specific values throughout my code and shows how you can add a little rigor to your XQuery coding.

Tuesday, January 17, 2012

A First Look at ODRL v2

With other things taking high priority over the last 6 months, this is the first opportunity I’ve had to look at the progression of ODRL Version 2.0, and evaluating where it’s improved from the earlier versions.

First things first, ODRL has migrated to the W3C as a Community Working Group. Overall, this is a good thing. It opens it up to the wider W3C community, gives greater credence to the effort and more importantly, more exposure. Well done.

On to my first impressions:

1 . The model has been greatly simplified. With ODRL 1.x, it was possible to express the same rights statement in several different ways. The obvious implication was that it was virtually impossible to build a generalized API for processing IP Rights, save running XJC on the schema, which isn't necessarily always what I want. It wasn’t all bad news though, the 1.x extension model was extremely flexible and enabled the model to support additional business-specific rights logic.

2. Flexible Semantic Model. The 2.0 model has a strong RDF-like flavor to it. Essentially, all of the entities, assets, parties (e.g., rightsholders, licensees), permissions, prohibitions, and constraints are principally URI-based resource pointers that imply semantics to each of the entities. Compared to 1.x, this is a vast improvement to its tag-based semantics, which meant that you were invariably extending either the ODRL content model, data dictionary, or both.

3. Needs More Extensibility. The current normative schema, still in draft, does need some additional design. Out of the box testing with Oxygen shows that only one element is exposed (policy). All of the other element definitions are embedded within the complexType models, which means makes it difficult to extend the model with additional structural semantics. This is extremely important on a number of fronts:

The current model exposes assets as explicit members of a permission or prohibition. Each “term” (i.e., permission or prohibition) is defined by an explicit action (print, modify, sell, display). It’s not uncommon to have a policy that covers dozens or hundreds of assets. So for each term, I have to explicitly call out each asset. This seems a little redundant. The 1.x model had the notion of terms that applied to all declared assets at the beginning of the policy (or in the 1.x semantics, rights). I’d like to see this brought back into the 2.0 model.
The constraint model is too flat. The new model is effectively a tuple of: constraintName, operator, operand. This works well for simple constraints like the following psuedo-code : “print”, “less than”, “20000”, but doesn’t work well for situations where exceptions may occur (e.g., I have exclusive rights to use the asset in the United States until 2014, except in the UK; or I have worldwide rights to use the asset in print, except for North Korea, and the Middle East). Instead, I have to declare the same constraint twice: once within a permission, and second time as a prohibition. I’d like the option to extend the constraint model to enable more complex expressions like the ones above.

Additionally list values within constraints are expressed tokenized strings within the rightOperand attribute. While completely valid to store values in this, I have a nit against these types of token lists, especially if the set of values is particularly long, like it can for things like countries using ISO-3166 codes.

I shouldn’t have to extend the whole complexType declaration in order to extend the model with my own semantics. However the current schema is structured that way. Instead, I’d like to see each entity type exposed as an “abstract” element, bound to a type, which ensures that my extension elements would have to at least conform to the base model.

Takeaways

I’m looking forward to using this with our Rights Management platform. The model is simple and clean and has a robust semantics strategy modeled on an RDF-like approach. This will make it easier to use the out-of-the-box model. That said, it’s missing some key structures that would make it easier to use and extend if I have to, but that can be address with a few modifications to the schema. (I have taken a stab at refactoring to test this theory – it’s pretty clean and I’m able to add my “wish list” extensions with very little effort.

Link: http://dl.dropbox.com/u/29013483/odrl-v2-proposed.xsd