Uncategorized

Xcential is a Change Management Company

At Xcential, we typically describe ourselves as a legislative technology company. While that is correct, the true answer is more nuanced than that. We purposefully don’t solve problems that are mainstream and relatively easily solved by other off-the-shelf software. Instead, we say that we focus on drafting but, in saying that, we understate what we do. In practice, we focus on a very complex and high-value problem called change management — as it relates to legislation. Few people truly know how to solve this problem.

Twenty years ago, the founders of Xcential worked at an XML database company that was a subsidiary of Xerox. We started Xcential because we thought the legislation was one of the best applications for XML we had ever come across. It was the change management aspects that fascinated me, in particular. While my knowledge of legislation was based on high school civics class, I had a lot of experience in the field of change management.

At the start of my career, I was an electronics design engineer at the Boeing Company. While there, I worked on a very sophisticated form of change management — concurrent fault simulation of behavioral representations of electronic systems. Fault simulation is a deliciously complex differencing problem. In legislation, we think of changes as amendments to the text and we record them as insertions and deletions. In fault simulation, the changes aren’t textual, they are behavioral. We record those changes as observable differences from expected results in something called a fault dictionary. With this dictionary of simulated faults, you are able to backtrack to predict which likely faults are causing the problem.

While managing amendments and managing faults in an electronic system might seem a world apart, algorithmically they are surprising similar. In an amended bill, the objective is to efficiently record changes to a document as deltas (differences) recorded inline within the original text. When simulating an electronic system, the objective is to record thousands of potential failures as shadow circuits (differences) against a single good simulation executing concurrently. The shadow circuits, while a dynamic part of a simulation run, are very analogous to the changes recorded in a document. It’s a very clever techniques for efficiently simulating the behavior of thousands of things that might go without having to run thousands of individual simulations.

Getting my head around the complexities of concurrent fault simulation taught me how to think in a world of asynchronous recursion — electronic systems are inherently asynchronous. Complex recursion in legislative documents is something I must frequently wrestle with, from parsing and responding to complex requests for documents or parts of documents in the URL Resolver to managing the layers of sets of changes that exist in the U.S. Code as laws are amended.

Change management has a lot of applications — not just in managing faults in an electronic circuit or amendments in legislation. Another project at Boeing that I was not directly involved with involved allowing every airliner coming off the assembly line to have it’s own unique document configuration that would evolve through the thirty or so years the aircraft was in service. So many possibilities…

Standard
Process, technology, Uncategorized

GitHub Copilot — Is it the future?

Several months ago, I got admitted to the GitHub Copilot preview. For those of you who don’t know what Copilot is, it’s a AI-based plugin to Visual Studio Code that helps you by suggesting code for you to type. If you like, the suggestion, you hit tab, and on you go.

Join the GitHub Copilot waitlist · GitHub

It may sound like magic, and in some ways, it does seem like that. Apparently, it learns the vast base of open-source code found in the GitHub repositories. This, of course, has led to the inevitable charges that it violates fair use of that code and even that it will ultimately replace developer’s jobs much as factory automation has replaced workers. From my experience, this is more about sensationalism than anything real to worry about.

In my recent posts, I’ve covered the DIKW pyramid. It seems we’ve been stuck in the information layer for a long time, only barely touching the knowledge layer in very rudimentary ways. Yes, there are tools like Siri and Alexa which claim to be AI-based virtual assistants, but they just feel like a whole bunch or complicated programming to achieve something that is more annoying than helpful. There is Tesla Copilot for self-driving cars, but that just seems scary to me. (Full disclosure: I don’t even trust cruise control) To me, GitHub copilot is the first piece of software that truly seems to drive deep into the knowledge layer and even reach the wisdom layer. It’s truly simulating some sort of real smartness.

While the sensationalists love to make it seem that Copilot is lifting code from other people’s work and offering it up as a suggestion, I’ve seen nothing whatsoever that suggests that that is what it is doing. Instead, it truly seems to understand what I am doing. It makes suggestions that could only come from my code. It uses my naming conventions, coding standards, and even my coding style. It seems to have analyzed enough of the code base in my application to understand what local functions and libraries it could draw upon. The code it synthesizes are obviously built on templates that it’s derived by learning. But those templates aren’t just copies of other people’s work. This is how synthesis works in the CAD world I come from (actually, it’s a bit more sophisticated that the synthesis I knew in CAD many years ago) and this is a natural next step in coding technologies.

I’ve been experimenting with what Copilot can do — how far reaching its learning seems to be. It’s able to help me writing JavaScript. What it is able to suggest is remarkable. However, coding assistance is not its only trick. It even helps with writing comments — sometimes with a bit of an attitude too. Last week I was adding a TODO: comment into the loader part of LegisPro to note that it needed to be modernized. Copilot’s unsolicited suggestion for my comment was “Replace the loader with a real loader”. Thanks Copilot. As Han Solo once said, “I’m not really interested in your opinion 3PO”.

Of course, this all leads to the inevitable question. Can it be trained to write legislation? Much to my surprise, it seemingly can. How and why it knows this is completely unknown to me. It’s able to suggest basic amending language and seems to know enough that it can use fragments of quotes from Thomas Jefferson and Benjamin Franklin. I find it incredible that it can even understand the context of legislation and that I did not have to tell it what that context was.

So am I sold on this new technology? Well, yes and no.

It’s not the scary source code stealing and eavesdropping application some would make it out to be. The biggest drawback to it is the same reason I don’t even trust cruise control in my car. It’s not that I don’t trust the computer. It’s that I don’t trust myself to not become lazy and complacent and come to believe the computer is right. I’ve already come across a number of situations where I’ve accepted Copilot’s suggestion without too much thought, only to needlessly wasting hours tracking down a problem that would never have existed if I had actually taken the time to write the code.

It’s an interesting technology, and I believe it’s going to be am important part of how software development evolves in the coming years. But as with all new technologies, it must be adopted with caution.

Standard
Uncategorized

Comparing DOCX to Akoma Ntoso for Legislation

After describing what makes for good legislative XML, I feel I should bring up a favorite topic of mine — why word processors don’t make for good legislative drafting tools.

Lately, we’ve been implementing round tripping tools to allow Akoma Ntoso documents to be imported and exported from Microsoft Word. This is to facilitate migration from a largely office productivity-oriented system to an XML-based one and to allow the exchange of documents with external clients that don’t have access to the internal systems being used to draft and manage legislation. It’s been quite a difficult process. The round-tripping itself has been quite straight forward. Exporting a document is relatively easy and reimporting that exported document, unchanged, isn’t difficult. What is very problematic is trying to ingest documents drafted or extensively edited using a word processor. The DOCX markup quickly becomes a tangled mess. Even when a document looks fine visually, there can be a lot going wrong on the inside, revealing the drafter’s struggle with the word processor to get a document that at least looks right. To avoid the problematic mess, we tend to resort to interpreting the words and discarding the structure and internal metadata entirely. It’s not perfect, but it’s at least manageable.

I’m going to compare the prominent word processing format today, DOCX (well, at least the WordprocessingML part of it) to Akoma Ntoso in respect to how they stack up to each other on my list:

  • Is it semantic?
    DOCX: No, not at all. DOCX is a serialization of the inner workings of Microsoft Word. It makes no attempt to be anything else.
    Akoma Ntoso: Yes, this is the fundamental approach Akoma Ntoso takes.
  • Is the presentation separated from the semantics as much as possible?
    DOCX: No, the presentation is tied directly into the document itself, and what’s more, is very proprietary.
    Akoma Ntoso: Yes, although you can apply presentation directly inline in cases, such as tables, where necessary.
  • Is all the text (excluding any metadata section) in the natural reading order?
    DOCX: Yes, for the most part.
    Akoma Ntoso: Yes, for the most part.
  • Does it, to the fullest extent possible, avoid the use of generated text?
    DOCX: No, and this is one of the most frustrating and infuriating parts of working with DOCX.
    Akoma Ntoso: Mostly, but it doesn’t preclude practices that ensure this rule is followed.
  • Is every provision that needs data associated with it permanently identifiable?
    DOCX: Mostly.
    Akoma Ntoso: Yes, via the @wId or the @GUID attributes.
  • Is every provision that is referred to easily locatable?
    DOCX: Not without extensive customization.
    Akoma Ntoso: Yes, via a standardized locator mechanism using the @eId/@wId attributes.
  • If the XML schema is for general use, is there an extensible way to add missing constructs?
    DOCX: No, unless you regard styling as your constructs (a bad idea) or want a complex customization task.
    Akoma Ntoso: Yes, via the seven elements found in the generic model.
  • Is there an extensible metadata mechanism?
    DOCX: Yes, but it’s complicated.
    Akoma Ntoso: Yes, but it’s complicated.
  • Does it provide the facilities necessary to automate according to modern expectations?
    DOCX: No, the presentation oriented structure of DOCX does little to enable downstream automation.
    Akoma Ntoso: Yes, Akoma Ntoso encourages a hierarchical content structure that is ideal for downstream automation.

Of course, Akoma Ntoso looks a lot better for legislative documents than does DOCX files. That should be no surprise — Akoma Ntoso is purpose-built for legislation while DOCX is a general purpose document model intended for no single purpose. But it is also fundamentally very different. While Akoma Ntoso is designed to be in modern standards-based document information model for legislation, DOCX is a serialization of the archaic data structures that exist within Microsoft Word. DOCX reflects the proprietary inner workings of Microsoft Word rather than the semantic meanings to be found within a document.

Akoma Ntoso has its drawbacks too. It’s complex, a bit academic, and has to span a very broad range of legal traditions make it a good fit for most legislative traditions, but a perfect fit for none.

Standard
Akoma Ntoso, Standards, technology, Uncategorized

What is Good Legislative XML?

I’m often asked what make on XML model better than another when it comes to representing laws and regulations. Just because a document is modeled in XML does not mean that it is useful in that form — the design of the schema matters in terms of what it enables or facilitates.

We have a few rules of thumb that we apply when either designing or adopting an XML schema:

  • Is it semantic?
    Reason: In order to process the information in a document, you have to understand what it is and what it means.
  • Is the presentation separated from the semantics as much as possible?
    Reason: We have moved beyond paper and nowadays it’s important to present information in form factors that just don’t suit the legacy constraints imposed by printing paper.
  • Is all the text (excluding any metadata section) in the natural reading order?
    Reason: The simplest way to present and process the text in a document is in the reading order of the text. This is particularly important is the presentation is to be added to the XML using simple CSS styling (as opposed to HTML transformation) and when the text is subject to complex amending instructions.
  • Does it, to the fullest extent possible, avoid the use of generated text?
    Reason: Similar to the last rule, it’s important for text to be displayed or amended when that text is represented. Generating text opens up a can of worms which can require sophisticated additional processing. Also, from a historical record of the text, which is essential for enacted law, having part of the text be generated by an external algorithm requires that the algorithm itself become part of the permanent record.
  • Is every provision that needs data associated with it permanently identifiable?
    Reason: With modern automation comes the need to not only manage the text of a provision but also state information. For example, is the current status of the provision pending, effective, repealed, or spent? While some of the metadata might be stored with the XML representation of the provision itself, sometimes it is better to store that metadata in a separate part of the document or in an external database. In these cases, it’s important to be able to permanently associate this external metadata with the provision — and this usually requires an immutable (permanent) identifier.
  • Is every provision that is referred to easily locatable?
    Reason: Laws are full of references (or citations). These are to provisions within the same document or to other documents or provisions within those documents. There needs to be a way to accurately and efficiently traverse and process these references. This need usually requires a locating identifier that an unambiguously identify the provision being referred to.
  • If the XML schema is for general use, is there an extensible way to add missing constructs?
    Reason: It is easy to claim to support all the legal traditions in the word, but extremely difficult to do so. While legal traditions are remarkably similar around the world, it’s impossible to predict every single construct that will arise — especially with documents data back hundreds of years. There has to be a way to implement constructs that don’t intrinsically exist within the base XML schema.
  • Is there an extensible metadata mechanism?
    Reason: A primary objective for representing a legislative or regulatory document in XML is for the processability it enables. This invariably means a need to record extensive metadata about the provisions found within the document. As the automation possibilities are endless, there needs to be a way to model and record the metadata that is generated.
  • Does it provide the facilities necessary to automate according to modern expectations?
    Reason: Some structure facilitate automation while others do not. For instance, flat structures can simplify the drafting process, but also make the automation process more difficult. It’s usually better to implement hierarchical structures and then hide the drafting complexity that creates with richer tools.

Standard
Process, technology, Track Changes

Moving on Up to Document Synthesis

In my last blog, I discussed the DIKW pyramid and how the CAD world has advanced through the layers while the legal profession was going much slower. I mentioned that design synthesis was my boss Jerry’s favorite topic. We would spend hours at his desk in the evening while he described his vision for design synthesis — which would become the norm in just a few years.

Jerry’s definition of design (or document) synthesis was quite simple — it was the processing of the information found in one document to produce or update another document where that processing was not simple translation. In the world of electronic design, this meant writing a document that described the intended behavior of a circuit and then having a program that would create a manufacturable design using transistors, capacitors, resistors, etc. from the behavioral description. In the software world, we’ve been using this same process for years, writing software in a high-level language and then compiling that description into machine code or bytecode. For hardware design, this was a huge change — moving away from the visual representation of a schematic to a language-based representation similar to a programming language.

In the field of legal informatics, we already see a lot of processes that touch on Jerry’s definition of document synthesis. Twenty years ago, it was seeing how automatable legislation could be, but wasn’t, that convinced me that this field was ready for my skills.

So what processing do we have that meets this definition of document synthesis:

  • In-context amending is the most obvious process. Being able to process changes recorded in a marked up proposed version of a bill to extract and produce a separate amending document
  • Automated engrossing is the opposite process — taking the amending instructions found in one document to automatically update the target document.
  • Code compilation or statute consolidation is another very similar process, applying amending language found in the language of a newly enacted law to update pre-existing law.
  • Bill synthesis is a new field we’ve been exploring, allowing categorized changes to the law to be made in context and then using those changes and related metadata to produce bills shaped by the categorization metadata provided.
  • Automated production of supporting documents from legislation or regulations. This includes producing documents such as proclamations which largely reflect the information found within newly enacted laws. As sections or regulations come into effect, proclamations are automatically published enumerating those changes.

In the CAD world, the move to design synthesis required letting go of the visually rich but semantically poor schematic in favor of language-based techniques. Initially there was a lot of resistance to the idea that there would no longer be a schematic. While at University, I had worked as a draftsman and even my dad had started his career as a draftsman, so even I had a bit of a problem with that. But the benefits of having a rich semantic representation that could be processed quickly outweighed the loss of the schematic.

Now, the legislative field is wrestling with the same dilemma — separating the visual presentation of the law, whether on paper or in a PDF, from the semantic meaning found within it. Just as with CAD, it’s a necessary step. The ability to process the information automatically dramatically increases the speed, accuracy, and volume of documents that can be processed — allowing information to be produced and delivered in a timely manner. In our society where instant delivery has become the norm, this is now a requirement.

Standard
Process, Standards, Transparency, W3C

Connected Information

As a proponent of XML for legislation, I’m often asked why an XML approach is better than a more traditional approach using a word processor. The answer is simple – it’s all about connected information.

The digital end point in a legislative system can no longer be publication of PDFs. PDFs are nothing but a kludgy way to digitize paper — a way to preserve the old traditions and avoid the future. Try reading a PDF on a cell phone and you see the problem. Try clicking on a citation in a PDF and you see the problem. Try and scrape the information out of a PDF to make it computer readable and you see the problem. The only useful function that PDFs serve is as a bridge to the past.

The future is all about connected information — breaking the physical bounds of what we think of as a document and allowing the nuggets of information found within them to be connected, interrelated, and acted upon. This is the real reason why the future lies with XML and its related technologies.

In my blog last week I provided a brief glimpse into how our future amending tools will work. I explored how legislation could be managed similar to how software is managed with GitHub. This is an example of how useful connected information becomes. Rather than producing bills and amendments as paper documents, the information is stored in a way that it can be efficiently and accurately automated — and made available to the public in a computer readable way.

At Xcential, we’re building our new web-based authoring system — LegisPro. If you take a close look at it, you’ll see that it has two main components. Of course, there is a robust XML editor. However, at the system’s very heart is a linking system — something we call a resolver. It’s this resolver where the true power lies. It’s an HTTP-based system for managing all the linkages that exist in the system. It connects XML repositories, external data sources, and even SQL databases together to form a seamless universe of connected information.

We’re working hard to transform how legislation, and indeed, all government information is viewed. It’s not just about connecting laws and legislation together through simple web links. We talking about providing rich connections between all government information — tying financial data to laws and legislation, connecting regulatory information together, associating people, places, and things to government data, and on and on. We have barely started to scratch the surface, but it’s clear that the future lies with connected information.

While we today position LegisPro as a bill authoring system — it’s much more than that. It’s some of the fundamental underpinnings necessary for a system to transform government documents of today into the connected information of tomorrow.

Standard
Uncategorized

LegisPro edit will soon be ready for beta!

Our new rulemaking LegisProedit is coming along nicely. It’s a web-based XML drafting tool specifically designed for the rigors of rulemaking tasks such as legislative bill drafting. It supports both the Akoma Ntoso and the USLM legislative models and can be customized to support any other model if necessary.

This past week I gave a demonstration of it at the LEX US Summer School at George Mason University in Washington D.C. With trepidation, I allowed everyone to have a hands-on experience with it as I provided guidance. This was the first time the editor had been used by anyone outside of Xcential and the first time we had stressed server performance. While certainly not glitch free, the editor exceeded my expectations for this point in the development process and all went well. It worked!

This week we are talking about the editor at NCSL by way of a screenshot demo that I am sharing here:

The next opportunities to try the editor hands-on will be at the LEX Summer School in Ravenna, Italy next month and we will also be showing it later in the month at NALIT in Sacramento, California.

The QuickStarter beta program is still in the process of being finalized. We are currently envisioning different levels of participation, from basic beta testing to a full-fledged evaluation program for anyone looking to use it or a part of it in an upcoming project.

More information can be found at http://xcential.com/legispro or you can contact us at info@xcential.com.

Standard
Akoma Ntoso, HTML5, LegisPro Web, LEX Summer School, Standards, Transparency

Data Transparency Breakfast, LEX US Summer School 2015, First International Akoma Ntoso Conference, and LegisPro Edit reveal.

Last week was a very good week for my company, Xcential.

We started the week hosting a breakfast put on by the Data Transparency Coalition at the Booz Allen Hamilton facility in Washington D.C.. The topic was Transforming Law and Regulation. Unfortunately, an issue at home kept me away but I was able to make a brief pre-recorded presentation and my moderating role was played by Mark Stodder, our company President. Thank you, Mark!

Next up was the first U.S. edition of the LEX Summer School from Italy. I have attended this summer school every year since 2010 in Italy and it’s great to see the same opportunity for an open dialog amongst the legal informatics community finally come to the U.S. Monica Palmirani (@MonicaPalmirani), Fabio Vitali, and Luca Cervone (@lucacervone) put on the event from the University of Bologna. The teachers also included Jim Mangiafico  (@mangiafico) (the LoC data challenge winner), Veronique Parisse (@VeroParisse) from the European Union, Andrew Weber (@atweber) from the Library of Congress, Kirsten Gullickson (@GullicksonK) from the Office of the Clerk at the U.S. House of Representatives, and myself from Xcential. I flew in for an abbreviated visit covering the last two days of the Summer School where I covered how the U.S. Code is modeled in Akoma Ntoso and gave the students an opportunity to try out our new bill drafting editor — LegisProedit.

After the Summer School concluded, it was followed by the first International Akoma Ntoso Conference on Saturday, where I spoke about the architecture of our new editor as well as how the USLM schema is a derivative of the Akoma Ntoso schema. We had good turnout, from around the world, and a number of interesting speakers.

This week is NCSL in Seattle where we will be discussing our new editor with potential customers and partners. Mark Stodder from Xcential will be in attendance.

In a month, I’ll be in Ravenna once more for the European LEX Summer School — where I’ll be able to show even more progress towards the goal of a full product line of Akoma Ntoso tools. It’s interesting times for me.

The editor is coming along nicely and we’re beginning to firm up our QuickStarter beta plans. I’ve already received a number of requests and will be getting in touch with everyone as soon as we’re ready to roll out the program. If you would like to participate as a beta tester — or if you would just like more information, please contact us at info@xcential.com.

I’m really excited about how far we’ve come. Akoma Ntoso is on the verge of being certified as an official OASIS standard, our Akoma Ntoso products are coming into place, and interest around the world is growing. I can’t wait to see where we will be this time next year.

Standard
Akoma Ntoso, Standards, W3C

Automating Legal References in Legislation

This is a blog I have wanted to write for quite some time. It addresses what I believe to be the single most important issue when modeling information for legal informatics. It is also, I believe, the most urgent aspect that we need to agree upon in order to promote legal informatics as a real emerging industry. Today, most jurisdictions are simply cobbling together short term solutions without much consideration to the big picture. With something this important, we need to look at the big picture first and come up with a lasting solution.

Citations, references, or links are a very important aspect of the law. Laws are inherently a web of interconnections and interdependencies. Correctly resolving those connections allows us to correctly interpret the law. Mistakes or ambiguities in how those connections are made is completely unacceptable.

I work on projects around the world as well as my work on the OASIS LegalDocumentML technical committee. As I travel to the four corners of the Earth, I am starting to see more clearly how this problem can be solved in a clean and extensible manner.

There are, of course, already many proposals to address this. The two I have looked at the most are both from Italy:
A Uniform Resource Name (URN) Namespace for Sources of Law (LEX)
Akoma Ntoso References (in the process of being standardized by OASIS)

My thoughts derive from these two approaches, both of which I have implemented in one way or another, with varying degrees of success. My earliest ideas were quite similar to the LEX-URN proposal by being based around URNs. However, with time Fabio Vitali at the University of Bologna has convinced me that the approach he and Monica Palmirani put forth with Akoma Ntoso using URLs is more practical. While URNs have their appeal, they really have not achieved critical mass in terms of adoption to be practical. Also, the general reaction I have gotten with LEX-URN encoded references has not been positive. There is just too much special encoding going on within them for them to be readable by the uninitiated.

Requirements

Before diving into this subject too deep, let’s define some basic requirements. In order to be effective, a reference must:
• Be unambiguous.
• Be predictable.
• Be adaptable to all jurisdictions, legal systems, and all the quirks that arise.
• Be universal in application and reach.
• Be implementable with current tools and technologies.
• Be long lasting and not tied to any specific implementation
• Be understandable to mere mortals like myself.

URI/IRI

URIs (Uniform Resource Identifiers) give us a way to identify resources in a computing system. We’re all familiar with URLs that allow us to retrieve pages across the web using hierarchical locations. Less well known are URNs which allow us to identify resources using a structured name which presumably will then be located using some form of a service to map the name to a location. The problem is, a well-established locating service has never come about. As a result, URNs have languished as an idea more than a tool. Both URLs and URNs are forms of URIs.

IRIs are a generalization of URIs to allow characters outside of the ASCII character set supported by normal URIs. This is important in jurisdictions that use more complex character than ASCII supports.

Given the current state of the art in software technology, basing references on URIs/IRIs makes a lot of sense. Using the URL/IRL variant is the safer and more universally accepted approach.

FRBR

FRBR is the Functional Requirements for Bibliographical Records. It is a conceptual entity-relationship model developed by librarians for modeling bibliographic information in databases. In recent years it has received a fair amount of attention for use as the basis for legal references. In fact, both the LEX-URN and the Akoma Ntoso models are based, somewhat loosely, on the model. At times, there is some controversy as to whether this model is appropriate or not. My intent is not to debate the merits of FRBR. Instead, I simply want to acknowledge that it provides a good overall model for thinking about how a legal reference should be constructed. In FRBR, there are four main entities:
1. Work – The work is the “what”, allowing us to specify what it is that we are referring to, independent of which version or format we are interested in.
2. Expression – The expression answers the “from when” question, allowing us to specify, in some manner, which version, variant, or time frame we are interested in.
3. Manifestation – The manifestation is the “which format” part, where we specify the format that we would like the information returned as.
4. Item – The item finally allows us to specify the “from where” part, when multiple sources of the information are available, that we want the information to come from.

That’s all I want to mention about FRBR. I want to pick up the four concepts and work from them.

What do we want?

Picking up the Akoma Ntoso model for specifying a reference as a URL, and mindful of our basic requirements, a useful model to reference a resource is as a hierarchical URL, starting by specifying the jurisdiction and then working hierarchically down to the item in question.

This brings me to the biggest hurdle I have come across when working with the existing proposals. It’s not terribly clear what a reference should be like when the item being referenced is a sub-part of a resource being modeled as an XML document. For instance, how would I refer to section 500 of the California Government Code? Without putting in too much thought, the answer might be something like /us-ca/codes/gov.xml#sec500, using a URL to identify the Government Code followed by a fragment identifier specifying section 500 of the Government Code. The LEX URN proposal actually suggests using the # fragment identifier, referring to the fragment as a partition. There are two problems with this solution though. First, any browser will interpret a reference using the fragment identifier as two parts – the part before the # fragment identifier showing the resource to be retrieved from the server and the part after the fragment identifier as an “id” to the item to scroll to. Retrieving the huge Government code when all we want is the one sentence in Section 500 is a terrible solution. The second problem is that it defines, possibly for all time, how a large document might have been constructed out of sub-documents. For example, is the US Code one very large document, does it consist of documents made out of the Titles, or as it is quite often modeled, is every section a different document? It would be better if references did not capture any part of this implementation decision. A better approach is to allow the “what” part of a reference to be specified as a virtual URL all the way down to whatever is wanted, even when the “what” is found deep inside an XML document in a current implementation. For example, the reference would better be specified as /us-ca/codes/gov/sec500. We’re not exposing in the reference where the document boundaries currently exist.

On to the next issue, what happens when there is more than one possible way to reference the same item? For example, the sections in California’s codes, as is usually the case, are numbered sequentially with little regard to the heading hierarchy above the sections. So a reference specified as /us-ca/codes/gov/sec500 is clear, concise, and unambiguous. It follows the manner in which sections are cited in the text. But /us-ca/codes/gov/title1/div3/chap6/sec500 is simply another way to identify the exact same section. This happens in other places too. /us-ca/statutes/2012/chap5 is the same document as /us-ca/bills/2011/sb730. So two paths identify the same document. Do we allow two identities? Do we declare one as the canonical reference and the other as an alternate? It’s not clear to me.

What about ambiguity? Mistakes happen and odd situations arise. Take a look at both Chapter 14s that exist in Division 6 of Title 1 of the California Government Code. There are many reasons why this happens. Sometimes it’s just a mistake and sometimes it’s quite deliberate. We have to be able to support this. In California, we disambiguate by using “qualifying language” which we embed somehow into the reference. The qualifying language specifies the last statute to create or amend the item needing disambiguation.

The From When do we want it?

A hierarchical path identifies, with some disambiguation, what it is we want. But chances are that what we want has varied over time. We need a way to specify the version we’re looking for or ask for the version that was valid at a specific point in time. Both the LEX URN and the Akoma Ntoso proposals for references suggest using an “@” sign around some nomenclature which identifies a version or date. (The Akoma Ntoso proposal adds the “:” sign as well)

A problem does arise with this approach though. Sometimes we find that multiple versions exist at a particular date. These versions are all in effect, but based on some conditional logic, only one might be operational at a particular time. How one deals with operational logic can be a bit tricky at times. That’s an open issue to me still.

Which Format do we want?

I find specifying the format to be relatively uncontroversial. The question is whether we specify the format using well established prefixes such as .pdf, .odt, .docx, .xml, and .html or whether we instead try to be more precise by embedding or encoding the MIME type into the reference. Personally, I think that simple extensions, while less rigorous and subject to unfortunate variations and overlaps, offer a far more likely to be adopted approach than trying to use the MIME type somehow. Simple generally wins over rigorous but more complex solutions.

The From Where should it come?

This last part, the from where should it come part, is something that is often omitted from the discussion. However, in a world where multiple libraries offering the same resource will quite likely exist, this is really important. Let’s take a look at the primary example once more. We want section 500 of the California Government Code. The reference is encoded as /us-ca/codes/gov/sec500. Where is this information to come from? Without a domain specified, our URL is a local URL so the presumption is that it will be locally resolved – the local system will find it, somehow. What if we don’t want to rely on a local resolution function? What if there are numerous sources of this data and we want to refer to one of them in particular. When we prepend the domain, aren’t we specifying from where we want the information to come from? So if we say http: //leginfo.ca.gov/us-ca/codes/gov/sec500, aren’t we now very precisely specifying the source of the information to be the official California source? Now, say the US Library of Congress decides to extend Thomas to offer state legislation. If we want to specify that copy, we would simply construct a reference as http: //thomas.loc.gov/us-ca/codes/gov/sec500. It’s the same URL after the domain is specified. If we leave the URL as simply /us-ca/codes/gov/sec500, we have a general reference and we leave it to the local system to provide the resolution service for retrieving and formating the information. We probably want to save references in a general fashion without a domain, but we certainly will need to refer to specific copies within the tools that we build.

Resolvers

The key to making this all work is having resolvers that can interpret standardized references and find a way to provide the correct response. It is important to realize that these URLS are all virtual URLs. They do not necessarily resolve to files that exist. It is the job of the resolving service to either construct the valid response, possibly by digging into database and files, or to negotiate with other resolvers that might do all or part of the job of providing a response. For example, imagine that Cornell University offers a resolver at http: //lii.cornell.edu. It might, behind the scenes, work with the official data source at http: //leginfo.ca.gov to source California legislation. Anyone around the world could use the Cornell resolver and be unaware of the work it is doing to source information from resolvers at the official sources around the world. So the local system would be pointed to the Cornell service and when the reference /us-ca/codes/gov/sec500 arose, the local system would defer to the LII service for resolution which in turn would defer to California’s official resolver. In this way, the resolvers would bear the burden of knowing where all the official data sources around the world are located.

Examples

So to end, I would like to sum up with some examples:

[Note that the links are proposals, using a modified and simplified form of the Akoma Ntoso proposal, rather than working links at this point]

/us-ca/codes/gov/sec500
– Get section 500 of the California Government Code. It’s up to the local service to decide where and how to resolve the reference.

http: //leginfo.ca.gov/us-ca/codes/gov/sec500
– Get Section 500 of the California Government Code from the official source in California.

http: //lii.cornell.edi/us-ca/codes/gov/sec500
– Get Section 500 of the California Government Code from Cornell’s LII and have them figure where to get the data from

/us-ca/codes/gov/sec500@2012-01-01
– Get Section 500 of the California Government Code as it existed on January 1, 2012

/us-ca/codes/gov/sec500@2012-01-01.pdf
– Get Section 500 of the California Government Code as it existed on January 1, 2012, in a PDF format

/us-ca/codes/gov/title1/div3/chap6/sec500
– Get Section 500 of the California Government Code, but the fully hierarchy is specified

My blog has gotten very long and I have only just started to scratch the surface. I haven’t addressed multilingual issues, alternate character sets, and a host of other issues at all. It should already be apparent that this is all simply a natural extension of the URLs we already use, but with sophisticated services underneath resolving to items other than simple files. Imagine for a moment how the field of legal informatics could advance if we could all agree to something this simple and comprehensive soon.

What do you think? Are there any other proposals, solutions, or prototypes out there that addresses this? How does the OASIS legal document ML work factor into this?

Standard