Tuesday, July 24, 2012

Enumerated Constants in XQuery

I’ve been working on a little project that allows me to merge my love of baseball with my knowledge of XML technologies.  In the process of working through this project, I am creating XQuery modules that encapsulate the logic for the data.  Part of the data that I’m looking at must account for different outcomes during the June amateur draft.

It turns out that the MLB June Amateur draft is quite interesting in that drafting prospects is a big gamble.  Drafts may or may not sign in any given year, and remain eligible for drafts in subsequent years.  If they don’t sign during that year, they could be drafted by another team in following years.  Alternately, they could be selected by the same team and signed.  However, even if they do sign, there’s no guarantee that they’ll make it to big leagues.  And even if they do, they might not make it with the same team they signed with initially (in other words, they were traded before reaching the MLB).

In effect there are several scenarios, depending how the data is aggregated or filtered.  However, these scenarios are well defined and constrained to a finite set of possibilities:
  • All draft picks
  • All signed draft picks
  • All signed draft picks who never reach the MLB (the vast majority don’t)
  • All signed draft picks who reached the MLB with the club that signed them
  • All signed draft picks who reached the MLB with another club
  • All unsigned draft picks
  • All unsigned draft picks who reached the MLB with a different club
  • All unsigned draft picks who reach with the same club, but at a later time
  • All unsigned draft picks who never reach the MLB
All of these scenarios essentially create subsets of information that I can work with, depending whether I’m interested in analyzing a single draft year, or all draft years in range.  They’re essentially the same queries, with minor variations to filter to meet a specific scenario. 

Working with various strongly typed languages like C# or Java, I would use a construct like an enum to encapsulate these possibilities into one object.  Then I can pass this into a single method that will allow me to conditionally process the data based on the specified enum value.  Pretty straightforward.  For example, in C# or Java I would write:
public enum DraftStatus {
   ALL,  //All draft picks (signed and unsigned)
   UNSIGNED, //All unsigned draft picks
   UNSIGNED_MLB, //All unsigned picks who made it to the MLB
   SIGNED,  //All signed draft picks
   SIGNED_NO_MLB, //Signed but never reached the MLB
   SIGNED_MLB_SAME_TEAM, //signed and reached MLB with the same team
   SIGNED_MLB_DIFF_TEAM  //signed and reached with another club   
};
The important aspect of enumerations is that each item in an enumeration can be descriptive and also map to a constant integer value.  For example UNSIGNED is much more intuitive and meaningful than 1, even though they are equivalent.

Working with XQuery, I don’t have the luxury of an enumeration.  Well, at least in the OOP sense.  I could write separate functions for each of the scenarios above and perform the specific query and return a the desired subset I need.  But that’s just added maintenance down the road. 

At first I toyed with the idea of using an XML fragment containing a list of elements that mapped the element name to an integer value:
<draftstates>
    <ALL>0</ALL>
    <UNSIGNED>1</UNSIGNED>
    <UNSIGNED_MLB>2</UNSIGNED_MLB>
    <SIGNED>3</SIGNED>
    <SIGNED_NO_MLB>4</SIGNED_NO_MLB>
    <SIGNED_MLB>5</SIGNED_MLB>
    <SIGNED_MLB_SAME_TEAM>6</SIGNED_MLB_SAME_TEAM>
    <SIGNED_MLB_DIFF_TEAM>7</SIGNED_MLB_DIFF_TEAM>
</draftstates>
And then using a variable declaration in my XQuery:
module namespace ds="http://ghotibeaun.com/mlb/draftstates";
declare variable $ds:draftstates := collection("/mlb")/draftstates;
To use it, I need to cast the element value to an integer. Using an example, let's assume that I want all signed draftees who reached the MLB with the same team:
declare function gb:getDraftPicksByState($draftstate as xs:integer, $team as xs:string) as item()* {
   let $picks := 
       if ($draftstate = 
           xs:integer($ds:draftstates/SIGNED_MLB_SAME_TEAM)) then
           let $results := 
               /drafts/pick[Signed="Yes"][G != 0][Debut_Team=$team]
           return $results
       (: more cases... :)
       else ()
   return $picks
};

(:call the function:)
let $sameteam := 
    gb:getDraftPicks(xs:integer($ds:draftstates/SIGNED_MLB_SAME_TEAM), 
                     "Rockies")
return $sameteam
It works, but it’s not very elegant.  Every value in the XML fragment has to be extracted through the xs:integer() function which is added logic and makes the code less readable.   Add to that, IDEs like Oxygen that enable code completion (and code hinting) doesn’t work with this approach. 

What does work well (at least in Oxygen, and I suspect in other XML/XQuery IDEs) are code completion for variables and functions, which led me to another idea.  Prior to Java 5, there weren’t enum structures.  Instead, enumerated constants were created through the declaration of constants encapsulated in a class:
public class DraftStatus {
    public static final int ALL = 0;
    public static final int UNSIGNED = 1;
    public static final int UNSIGNED_MLB = 2;
    public static final int SIGNED = 3;
    public static final int SIGNED_NO_MLB = 4;
    public static final int SIGNED_MLB = 5;
    public static final int SIGNED_MLB_SAME_TEAM = 6;
    public static final int SIGNED_MLB_DIFF_TEAM = 7;   
}
This allowed static access to the constant values via the class, e.g., DraftStatus.SIGNED_MLB_SAME_TEAM.
The same principle can be applied to XQuery.  Although there isn’t the notion of object encapsulation by class, we do have encapsulation by namespace.  Likewise, XQuery supports code modularity by allowing little bits of XQuery to be stored in individual files, much like .java files. To access class members, you (almost always) have to import the class into the current class.  The same is true in XQuery.  You can import various modules into a current module by declaring the referenced module’s namespace and location.
Using this approach, we get the following:


mlbdrafts-draftstates.xqy
xquery version "1.0";

module namespace ds="http://ghotibeaun.com/mlb/draftstates";

declare variable $ds:ALL as xs:integer := 0;
declare variable $ds:UNSIGNED as xs:integer := 1;
declare variable $ds:UNSIGNED_MLB as xs:integer := 2;
declare variable $ds:SIGNED := 3;
declare variable $ds:SIGNED_NO_MLB := 4;
declare variable $ds:SIGNED_MLB := 5;
declare variable $ds:SIGNED_MLB_SAME_TEAM := 6;
declare variable $ds:SIGNED_MLB_DIFF_TEAM := 7;
Now we reference this in another module:
import module namespace ds="http://ghotibeaun.com/mlb/draftstates" at "mlbdrafts-draftstates.xqy";
Which gives as direct access to all the members like an enumeration:
xqueryconstants-autocomplete
The bottom line is that this approach has worked really well for me.  I can use descriptive constant names that map to specific values throughout my code and shows how you can add a little rigor to your XQuery coding.

2 comments:

inthewoods said...

If you want to use XML standards for this you could write a schema and then use the explicit datatype(s). http://developer.marklogic.com/learn/2007-04-schema

But honestly your solution is much cleaner.

Jim Earley said...

Thanks!

Good point about using schemas. In that case, you need to include both a schema and an XML fragment that binds against into the XQuery that represents the enumeration. I suppose the only constraint there is that changes to the enumeration set (i.e., add/removing constants, changing constant values) has to be applied both in the schema as well as the XML fragment.

The other thing I don't mention here is that I'm writing this "little" XQuery application in layers. The intent is that these are low-level functions that almost act like methods in abstract classes. On top of that are higher level XQuery modules that hide (most of) these complexities