Common Identifiers or a Common Data Format. What is more important?

I just read this excellent post by Tom Bruce, et. al. from the Legal Information Institute at the Cornell University Law School.

Tom’s post brought to mind something I have long wrestled with. (Actually so long that it was a key part of my job working on CAD systems in the aerospace industry long ago). I sometimes wonder if having common identifiers isn’t more important than having a common data format. The reason is that being able to unambiguously establish relationships is both very difficult and very useful. In fact, one of the reasons you want a common format is so that you can find and establish these identifiers.

I have used a number of schemes for identifiers over the past ten years. Most of the schemes I have used have involved Uniform Resource Names or URNs. A decade ago we designed a URN mechanism for use in the California Legislature. Our LegisWeb product uses a very similar URN schema based on the lessons learned on the California Project. In more recent years I have experimented with the URN:lex proposal and the URL-based proposal that is within Akoma Ntoso. Both of these proposals are based around FRBR. I can’t say I have found the ideal solution.

I favor URN-based mechanism for a number of reasons. URNs are names – that was their intent as defined by the IETF. They are not meant to imply a location or organizational containment (well mostly they aren’t). In theory, these identifiers can be passed around and resolved to find the most relevent representation when needed. But there is a problem. URNs have never really been accepted. While they conceptually are very valuable, their poor acceptance and lack of supporting tools tends to undermine their value.

Akoma Ntoso takes a different approach. It uses URLs intead of URNs. Rather than using URLs as locations though, they are used as if they are identifiers. It is the duty of the webserver and applications plugged into the webserver to intercept the URLs and treat them as identifiers. In doing this, the webserver provides the same resolution functions that URNs were supposed to offer. My upcoming editor implements this functionality. I have built HTTP handlers that convert URLs into repository queries which retrieve and compose the requested documents. I have it working and it works well – as much as I understand the Akoma Ntoso proposal. I’m still not totally crazy about overloading identifier semantics on top of location semantics though. At least the technology support is better in place.

So what issues have I struggled with?

First of all, none of the proposals seem to adequately address how you deal with portions of documents. There are many issues in the area. The biggest of course is the inherent ambiguity within legislative documents. As Tom mentioned in his post, duplicate numbering often occurs. There are usually perfectly good and valid reasons for this – such as different operational conditions. But sometimes these are simply errors that everyone has to accommodate. Being able to specify the information necessary to resolve the ambiguity is not in any proposal I have seen. Add to that the temporal issues that come with renumbering actions. How do you refer to something that is subject to amendment and renumbering? Do you want a reference to specific wording at a specific point in time, or do you want you reference to track with amendments and renumbering?

At this point people often ask me why a hash identifier followed by a cleverly designed element id won’t work. The first thing you have to realize is that the # means something to the effect of “retrieve the document and then scroll to that Id”. The semantic I am looking for is “retreive the document fragment located at Id”. The importance of the difference becomes obvious when you realize that the client browser holds the “#” part of the request and all the server sees is the document URL, minus the hash and identifier. When your document is a thousand pages long and all you want is a single section, that distinction is quite important. Secondly managing ids across renumbering actions is very messy and introduces as many problems as it solves.

Secondly, the referencing mechanism tends to be documented oriented. Certainly, Akoma Ntoso uses virtual URL identifiers to refer to much more than simple documents, but the whole approach gets cumbersome and hard to explain. (If you want to appreciate this, try and explain XML schema’s namespace URI/URL concept to an uninitiated developer.) What’s more, it’s not clear if a common URL mechanism does enough to establish common enough practices for the effort to be useful. For instance, what if I want to refer to the floor vote after the second reading in the Assembly? Is there a reference to that? In California there is. That’s because the results of that vote are reported as a document. But there is nothing that says this should be the case. I have had the need to interrelate a number of ancilliary documents with the legislation. How to do that in a consistent way is not all that clear cut.

The third problem is user acceptance. The URN:Lex proposal, in particular, looks quite daunting. It uses lots of characters like @, $, ;, While end users can be shielded from some of this, my experience has taught me that even software developers rebel against complexity they can’t understand or appreciate. So far, this has been a struggle.

I’m eagerly awaiting Part 2 of Tom’s post on identifiers. It’s a great subject to explore.

Advertisements
Common Identifiers or a Common Data Format. What is more important?

4 thoughts on “Common Identifiers or a Common Data Format. What is more important?

  1. Grant: It’s interesting to read about these issues from the perspective of server-side developers.

    In building out law support for Zotero + CSL, I’ve assumed that when a target is referenced (for statutes, that will mean a section-level block), the client will capture the text together with metadata necessary to construct a human-readable citation. Thinking out loud about workflows, the item itself can (and probably will) fill the role of an initial source target for readers that want quick access to the cited provision, since it contains the target text and a link back to the original repository, and can be exposed to the Web.

    The problem unwinds back to identifiers though, doesn’t it, since a mechanism for assuring the veracity of the attached text is needed, and that in turn will depend on having an identifier that can be tied to a hash issued at the point of publication.

    For what it’s worth, I’ve adopted a scheme based on urn:lex for identfying jurisdictions in the extended CSL styles that will be used with Multilingual Zotero. That part of the schema is straightforward and uncluttered. It was also the only scheme I could find that provided for identifying rule-making bodies that are not nation-states.

    1. grantcv1 says:

      Hey Frank, nice to hear from you. The approach I adopted at http://Legix.info is also URN:Lex inspired, plus a bit of Akoma Ntoso’s ideas, plus whatever else I needed to solve my specific issues. The result is a bit of a hybrid scheme that ends supporting both URN and URL notations at the same time. Now all I need to do is make sure I can support IRI’s (Internationalized Resource Identifiers) and I have all bases covered.

  2. Some great issue-spotting here. I’m hoping that, in my own posts over at Metasausage, I’ll eventually get around to spelling out all the URI design issues — which are many. The use of hash-based fragment identifiers for dereferenceable URIs is surely not going to play well with their use for labelled subdocuments. There are, I think, a few things that it would help us to keep in mind.

    1) There is no “Highlander Rule” that says “there can be only one” when it comes to identifiers. What we have to do as SW publishers is guarantee that there is *some* unique identifier and that it is dereferenceable to the object itself. It can be one among many URIs that would lead us to the same thing.
    2) Most of those many URIs that apply to the same object would simply be accessors that we use to represent or mimic the place of our particular object in a hierarchy of collections of objects. On that view, the path information in a URI represents a series of nested collections (which in turn implies that each path component represents some identifiable collection as well).

    Once you wrap your head around that the need for multiple URIs becomes obvious, because different use cases demand different nestings (eg. sorted by chronology, by committee, by subject, etc., in different layerings). The idea that you would take a string that quacks like a URI but insist that it’s now an opaque unique identifier because the slashes are no longer meaningful, somehow, is asking for trouble; it flies in the face of what users have “learned in their fingers” when it comes to navigating collections (eg. by truncating a URI in the browser bar when wanting to see the collection that contains whatever it is that you’re looking at now).

    I’ll stop there in order to avoid writing the upcoming blog post here; there are also complicated issues that have to do with what we use for a dereferencing scheme and what we do with versioning and whether or not we think that applying FRBR concepts results in any utility whatsoever (I don’t happen to think it does)….
    t.

    1. grantcv1 says:

      I have a very simple case where there is a need for multiple identifiers. California Chapter 10 of 2011 is also Version 95 of Senate Bill 78 of the 2011-2012 General Session (our versioning counts down from 99 rather than the more traditional count up from 1) as well as the most recent version. I set up the location resolution service to accept the URI specified in several ways to end up at the same location.

      But my big dream is to build on the notion of location brokers that would be built to accept identifier tokens from anywhere in the world and would return one or more locations where that law could be retrieved. I used to dabble in concepts like this with CORBA back in my CAD days. A common identifier scheme would be a key first step towards this goal.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s