GEDCOM Parser (#6)

by Jamis Buck

GEDCOM Parser

GEDCOM is the "GEnealogical Data COMmunication" file format. It is a plain-text electronic format used to transfer genealogical data. (The purpose of this quiz is not to debate whether it is a particularly *good* file format or not--but it is certainly more compact than the corresponding XML would be, and bandwidth was particularly important back when the standard was developed.)

The purpose of this quiz is to develop a simple parser than can convert a GEDCOM file to XML.

GEDCOM Format

The GEDCOM file format is very straightforward. Each line represents a node in a tree. It looks something like this:

0 @I1@ INDI
1 NAME Jamis Gordon /Buck/
2 SURN Buck
2 GIVN Jamis Gordon
1 SEX M
...

In general, each line is formatted thus:

LEVEL TAG-OR-ID [DATA]

The LEVEL is an integer, representing the current depth in the tree. If subsequent lines have greater levels than the current node, they are children of the current node.

TAG-OR-ID is either a tag that identifies the type of data in that node, or it is a unique identifier. Tags are 3- or 4-letter words in uppercase. The unique identifiers are always text surrounded by "@" characters (i.e., "@I54@"). If an ID is given, the DATA is the type of the subtree that is identified.

So, to take the example given above apart:

1) "0 @I1@ INDI". This starts a new subtree of type INDI (individual). The id for this individual is "@I1@".

2) "1 NAME Jamis Gordon /Buck/". This starts a NAME subtree with a value of "Jamis Gordon /Buck/".

3) "2 SURN Buck". This is a subelement of the NAME subtree, of type SURN ("surname").

4) "2 GIVN Jamis Gordon". As SURN, but specifies the given name of the individual.

5) "1 SEX M". Creates a new subelement of the INDI element, of type "SEX" (i.e., "gender").

And so forth.

Variable whitespace is allowed between the level and the tag. Blank lines are ignored.

The Challenge

The challenge, then, is to create a parser that takes a GEDCOM file as input and converts it to XML. The snippet of GEDCOM given above would become:

<gedcom>
<indi id="@I1@">
<name value="Jamis Gordon /Buck/">
<surn>Buck</surn>
<givn>Jamis Gordon</givn>
</name>
<sex>M</sex>
...
</indi>
...
</gedcom>

Sample Input

There is a large GEDCOM file online containing the lineage of various European royalty. You may download it from http://search.cpan.org/src/PJCJ/Gedcom-1.11/royal.ged (yah, it's a CPAN link, but it had the highest bandwidth of any other URL I found via Google). (This particular link makes generous use of whitespace to increase the readability of the file.)


Quiz Summary

This quiz generated some interesting discussion on Ruby Talk. Some details of the GEDCOM format were talked about and the proper way to handle XML output was debated. I don't want to fill this summary with that conversation (visit the archives for the "[QUIZ]" thread, if you missed it), but I do want to point out the best point made about the XML format. Here's what Hans Fugal had to say about it:

I take issue with the example XML in the quiz because I am from the
"data in text, metadata in attributes" camp, and the name is not
metadata. Here is a snippet of the output that I am generating:

<INDI id='@I11@'>
<NAME>Itha /Steele/<SURN>Steele</SURN>
<GIVN>Itha</GIVN>
</NAME>
<SEX>F</SEX>
<_UID>38CC16658231D511ACB8E07C9CE21378E1AF</_UID>
<FAMS>@F12@</FAMS>
<FAMC>@F4@</FAMC>
</INDI>

I have yet to fall in love with XML like the rest of the world, but I think Hans makes a good point here.

The solution submitted by Florian Gross also supported YAML and "pretty print" output.

Formats aside, submitted solutions varied in how much they interpreted from the GEDCOM file as opposed to simple translation. The most common change between input and output was to build a single entity out of GEDCOM's CONC and CONT fields.

XML generation techniques were also varied among submissions. Some of us built-up our own Strings, others used REXML, Cedric Foll used XmlSimple and Jim Weirch used his Builder package, described here:

Builder Objects

Now let's look at a solution. Here's the code submitted by Hans Fugal:

ruby
#! /usr/bin/ruby
require 'rexml/document'

doc = REXML::Document.new "<gedcom/>"
stack = [doc.root]

ARGF.each_line do |line|
next if line =~ /^\s*$/

# parse line
line =~ /^\s*(\d+)\s+(@\S+@|\S+)(\s(.*))?$/ or raise "Invalid GEDCOM"
level = $1.to_i
tag = $2
data = $4

# pop off the stack until we get the parent
while (level+1) < stack.size
stack.pop
end
parent = stack.last

# create XML tag
el = nil
if tag =~ /@.+@/
el = parent.add_element data
el.attributes['id'] = tag
else
el = parent.add_element tag
el.text = data
end

stack.push el
end
doc.write($stdout,0)
puts

The above starts by creating a REXML document and a stack for managing parent/child relationships. With setup out of the way, the code reads from STDIN or files specified as command-line arguments, line by line.

The processing of each line is a three stage process: Parse the line, unwind the stack to the parent for this element, and finally add the element to the parent through the REXML API and push the new element onto the stack.

When it's all been read, the complete XML is dumped to STDOUT.

Obviously, Hans' solution doesn't do any special handling of the GEDCOM format. It's a simple parse and print solution.

If aren't going to use a great library like REXML to generate XML output, remember to handle your own escaping. (See my submission for an example of code that forgot to do this! Oops.)

If you are a person who deals with GEDCOM files outside of this quiz, you may want to check out these links passed to me by Jamis Buck:

If you go to http://www.familysearch.com, you can search for
ancestors/relatives and download their information in GEDCOM format.

There are also a variety of tools available for taking a GEDCOM file
and creating a website from it. In fact, if you go to
http://www.onepagegenealogy.com you can have a wall-chart sized
pedigree chart printed from a GEDCOM file for $20. (That particular
project is a research project here at BYU.)

My thanks go out to Jamis for our second contributed quiz, and a good topic at that. I've got two more contributed quizzes waiting in the wings, which is great news. Keep 'em coming!

Stay tuned folks, because it's Game Show time with tomorrow's Ruby Quiz...