I’ve been working on putting my family history together for the last few years. Most of the genealogical applications have some pretty nice features, but none seemed to have all of the features I wanted. I wanted the ability to manage all of the summary information and relationships (all applications do this), but also cross reference the factual data with individual biographies. And, I want to be able to display the information in different ways and formats – not just the ones supported by any particular application.
I started looking at the format that most genealogy programs store the data into. With few exceptions, they all use GEDCOM, or GEnealogical Data COMmunication. The standard was developed by the Church of Jesus Christ of Latter-day Saints as a means of creating a portable data format to express information about individuals, families and sources (bibliography).
GEDCOM is line-delimited field format that identifies the start of a new record with the number 0. Fields within a record are identified with an incremented number. For example a first level field line would start with the number 1. A subfield (e.g, the given name of a person’s full name) would start with the next highest number. The following is an example record for an individual
0 @1@ INDI
1 NAME Robert Eugene/Williams/
1 SEX M
2 DATE 02 OCT 1822
2 PLAC Weston, Madison, Connecticut
2 SOUR @6@
3 PAGE Sec. 2, p. 45
3 EVEN BIRT
4 ROLE CHIL
2 DATE 14 APR 1905
2 PLAC Stamford, Fairfield, CT
2 PLAC Spring Hill Cem., Stamford, CT
2 ADDR 73 North Ashley
3 CONT Spencer, Utah UT84991
2 DATE from 1900 to 1905
1 FAMS @4@
1 FAMS @9@
Other than the line/sequencing delimiters, the data structures are pretty free form, and is parser dependent. Even the field names, outside the common set supplied by GEDCOM are parser dependent. So if you use one genealogy tool, it can understand these fields, but if you try to load it in another, it blows up. Gah! Add to that, GEDCOM just isn’t that great for handling rich content like pictures in a biography.
This sounds like a job for XML.
So the first question I had to address is how to model this. I’ve looked at some the of GEDCOM XML sites, and they suffer from the same problems as the text data structure do. Just not enough rich data.
The answer I came up with was to use DITA, which has several things going for it:
- I can easily mimic GEDCOM’s data structure with a specialized map
- I can extend the model to support other potentially valuable metadata
- I can easily model rich biographical content as a topic specialization
- DITA’s numerous linking mechanisms work well for the various types of links I would need: internal references within a map, rel-tables, cross-references, external hyperlinks to third-party websites and content.
The first thing I did was to model and create a map specialization that mimics the GEDCOM data. For the sake of brevity, I’ll show a sample of a specialized map. If you want more information, ping me:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE familytree SYSTEM "file:/opt/dita/1.2/dtd1.2/genealogy/dtd/familytree.dtd">
<title>Schmoe Family Tree</title>
<individual id="I1" keys="I000001" gender="male">
<placename>The Stork Factory</placename>
<individual id="I2" keys="I00002" gender="male">
<individual id="I000003" keys="I000003" gender="female">
<family id="f1" keys="F1">