XML, HTML, JSON – Choosing the Right Format for Legislative Text

I find I’m often talking about an information model and XML as if they’re the same thing. However, there is no reason to tie these two things together as one. Instead, we should look at the information model in terms of the information it represents and let the manner in which we express that information be a separate concern. In the last few weeks I have found myself discussing alternative forms of representing legislative information with three people – chatting with Eric Mill at the Sunlight Foundation about HTML microformats (look for a blog from him on this topic soon), Daniel Bennett regarding microdata, and Ari Hershowitz regarding JSON.

I thought I would try and open up a discussion on this topic by shedding some light on it. If we can strip away the discussion of the information model and instead focus on the representation, perhaps we can agree on which formats are better for which applications. Is a format a good storage format, a good transport format, a good analysis/programming format, or a good all-around format?

1) XML:

I’ll start with a simple example of a bill section using Akoma Ntoso:

<section xmlns="http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD03" 
       id="{GUID}" evolvingId="s1">
    <num>§1.</num>
    <heading>Commencement </heading>
    <content> <p>This act will go into effect on 
       <date name=”effectiveDate” date="2013-01-01">January 1, 2013</date&gt;. 
    </p> </content>
</section> 

Of course, I am partial to XML. It’s a good all-around format. It’s clear, concise, and well supported. It works well as a good storage format, a good transport format, as well as being a good format of analysis and other uses. But it does bring with it a lot of complexity that is quite unnecessary for many uses.

2) HTML as Plain Text

For developers looking to parse out legislative text, plain text embedded in HTML using a <pre> element has long been the most useful format.

   <pre>
   §1. Commencement
   This act will go into effect on January 1, 2013.
   </pre>

It is a simple and flexible represenation. Even when an HTML represenation is provided that is more highly decorated, I have always invariably removed the decorations to leave behind this format.

However, in recent years, as governments open up their internal XML formats as part of their transparency intiatives, it’s becoming less necessary to write your own parsers. Still, raw text is a very useful base format.

3) HTML/HTML5 using microformats:

<div class="section" id="{GUID}" data-evolvingId="s1">
   <div>
      <span class="num">§1.</span> 
      <span class=”heading”>Commencement </span>
   </div>
   <div class="content"><p>This act will go into effect on 
   <time name="effectiveDate" datetime="2013-01-01">January 1, 2013 <time>. 
   </p></div>
</div>

As you can see, using HTML with microformats is a simple way of mapping XML into HTML. Currently, many legislative data sources that offer HTML content either offer bill text as plain text as I showed in the previous example or they decorate it in a way that masks much of the semantic meaning. This is largely because web developers are building the output to an appearance specification rather than to an information specification. The result is class names that better describe the appearance of the text than the underlying semantics. Using microformats preserves much of the semantic meaning through the use of the class attribute and other key attributes.

I personally think that using HTML with microformats is a good way to transport legislative data to consumers that don’t need the full capabilities of the XML representation and are more interested in presenting the data rather than analyzing or processing it. A simple transform could be used to take the stored XML and to then translate it into this form for delivery to a requestor seeking an easy-to-consume solution.

[Note: HTML5 now offers a <section> element as well as an <article> element. However, they’re not a perfect match to the legislative semantics of a section and an article so I prefer not to use them.]

4) HTML5 Microdata:

<div itemscope 
      itemtype="http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD03#section" 
      itemid="urn:xcential:guid:{GUID}">
   <data itemprop="evolvingId" value="s1"/>
   <div>
      <span itemprop="num">§1.</span>
      <span itemprop="heading">Commencement </span>
   </div>
   <div itemprop="content"> <p>This act will go into effect on 
      <time itemprop="effectiveDate" time="2013-01-01">January 1, 2013 </time>.
   </p> </div>
</div>

Using microdata, we see more formalization of the annotation convention than microformats offers – which brings along additional complexity and requires some sort of naming authority which I can’t say I either really understand or see how it will happen. But it’s a more formalized approach and is part of the HTML5 umbrella. I doubt that microdata is a good way to represetn a full document. Rather, I see microdata better fitting in to the role of annotating specific parts of a document with metadata. Much like microformats, microdata is a good solution as a transport format to a consumer not interested in dealing with the full XML representation. The result is a format that is rich in semantic information and is also easily rendered to the user. However, it strikes me that the effort to more robustly handle namespaces only reinvents one of XMLs more confusing aspects, namely namespaces, in just a different way.

5) JSON

{
   "type": "http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD03#section",
   "id": "{GUID}",
   "evolvindId": "s1",
    "num" : {
      "type": "http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD03#num",
      "text": "§1."
   },
   "heading":  {
      "type": "http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD03#heading",
      "text": "Commencement"
   },
   "content": {
      "type": "http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD03#content",
      "text1": "This act will go into effect on "
      "date": {
         "type": "http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD03#date",
         "date": "2013-01-01",
         "text": "January 1, 2013"
      }
      "text2": "."
   }
}

Quite obviously, JSON is great if you’re looking to easily load the information into your programmatic data structures and aren’t looking to present the information as-is to the user. This is a programmatic format primarily. Representing the full document in JSON might be overkill. Perhaps the role of JSON is for key parts of extracted metadata than the full document.

There are still other formats I could have brought up like RDFa, but I think my point has been made. There are many different ways of representing the same legislative model – each with its own strength and weaknesses. Different consumers have different needs. While XML is a good all-around format, it also brings with it some degree of sophistication and complexity that many information consumers simply don’t need to tackle. It should be possible, as a consumer, to specify the form of the information that most closely fits my need and have the legislative data source deliver it to me in that format.

[Note: In Akoma Ntoso, the format is called the “manifestation.” and is specified as part of the referencing specification.]

What do you think?

XML, HTML, JSON – Choosing the Right Format for Legislative Text

9 thoughts on “XML, HTML, JSON – Choosing the Right Format for Legislative Text

  1. What do I think? I think it’s all brilliant! Your idea to use XML ensures a couple of things…1, that it will be useful code for some time to come and 2, that even a layman can understand the code, or at least the context of the code without having to know every parameter of the code. Good thinking!!

  2. It is worth remembering that XML by itself isn’t a panacea, either. Working with trademark XML data, I have seen so many completely absurd structural conventions that now I can only laugh when I see a new one. Mostly these seem to be artefacts of trying to represent an earlier pure database format as faithfully as possible. And even if the structures are fine, it is almost always possible to fill them with write-only data so that it takes a human being or a sh*tload of heuristics to figure out what the present state is.

    Actually, just the other day I got a silly idea for a workshop presentation, something like a legal XML gallery of horrors. Who knows, maybe in a year or two…

  3. JSON doesn’t have good validation tools, lack of stream-parsing implementations, namespace support is non-existing/standardised. It also makes for an incredibly poorly readable format when dealing with inline elements, but especially the fact that the JSON specification does not have any hard ordering specifications for the objects means you’re going to have to rely on JSON-specific implementations – never a good idea.

    The micro-formats are only going to help a little bit, but they should be used for an improved user experience for an HTML rendering only, and not as a reliable data structure.

    I honestly don’t see any alternative that is as perfect a fit for legislative documents.

    My 2 cents,

    -Phil

  4. Something approaching programmable would be ideal like html5 or JSON. I don’t think we’ll ever convince lawyers to learn even basic html (despite my talk on reinventlawchannel.com) but with the rise of Legal Information Engineers over the next few years — using a machine-executable language would not be unreasonable. Michael Poulshock at Hammura.bi is exploring this idea — he wrote his own programming language for the law. We shouldn’t be too concerned with accessibility because lawyers won’t want to deal with it anyway.

  5. Grant, I appreciate the post and your example. I have posted my response in a google doc which includes some changes to the Microdata example to add some additional features. Also, I will link to a comparison table of the various formats. For example, only HTML is human readable, because of the presentation layer (browser) and Microdata is the best way so far to embed metadata/machine readable information into HTML. Note that the extensions to the new HTML5 data prefix attribute has some serious data collision and usability problems.

    response doc with revised HTML5 with Microdata and embedded comments:
    https://docs.google.com/document/d/1TUPUOfgrqP__Rt5dhEPkBy5f-d0BBkSzZfQp3vLNmoQ/pub

    comparison table:
    https://docs.google.com/spreadsheet/pub?key=0Al6jPr0LRFa0dGcwNW1KUFI2ajdlR0JZOWVZQ3ptUlE&output=html

    slides about Microdata:
    http://advocatehope.org/tech-tidbits/microdata-the-human-readable-data/presentation_view

  6. […] The data provided must be useful. This means that the most important characteristics of the data must be described in ways that allow it to be interpreted by a computer without too much work. For instance, important entities described by the data should be marked in ways that are easily found and characterized – preferably using broadly accepted open standards. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s