Connected Information

As a proponent of XML for legislation, I’m often asked why an XML approach is better than a more traditional approach using a word processor. The answer is simple – it’s all about connected information.

The digital end point in a legislative system can no longer be publication of PDFs. PDFs are nothing but a kludgy way to digitize paper — a way to preserve the old traditions and avoid the future. Try reading a PDF on a cell phone and you see the problem. Try clicking on a citation in a PDF and you see the problem. Try and scrape the information out of a PDF to make it computer readable and you see the problem. The only useful function that PDFs serve is as a bridge to the past.

The future is all about connected information — breaking the physical bounds of what we think of as a document and allowing the nuggets of information found within them to be connected, interrelated, and acted upon. This is the real reason why the future lies with XML and its related technologies.

In my blog last week I provided a brief glimpse into how our future amending tools will work. I explored how legislation could be managed similar to how software is managed with GitHub. This is an example of how useful connected information becomes. Rather than producing bills and amendments as paper documents, the information is stored in a way that it can be efficiently and accurately automated — and made available to the public in a computer readable way.

At Xcential, we’re building our new web-based authoring system — LegisPro. If you take a close look at it, you’ll see that it has two main components. Of course, there is a robust XML editor. However, at the system’s very heart is a linking system — something we call a resolver. It’s this resolver where the true power lies. It’s an HTTP-based system for managing all the linkages that exist in the system. It connects XML repositories, external data sources, and even SQL databases together to form a seamless universe of connected information.

We’re working hard to transform how legislation, and indeed, all government information is viewed. It’s not just about connecting laws and legislation together through simple web links. We talking about providing rich connections between all government information — tying financial data to laws and legislation, connecting regulatory information together, associating people, places, and things to government data, and on and on. We have barely started to scratch the surface, but it’s clear that the future lies with connected information.

While we today position LegisPro as a bill authoring system — it’s much more than that. It’s some of the fundamental underpinnings necessary for a system to transform government documents of today into the connected information of tomorrow.

Connected Information

Can GitHub be used to manage legislation?

Every so often, someone suggests that GitHub would be a great way to manage legislation. Usually, we roll our eyes at the naïve suggestion and that is that.

However, there are a good many similarities that do deserve consideration. What if the amending process was supported by a tool that, while maybe not GitHub, worked on the same principles?

My company, Xcential, built the amending solution for the California Legislature, using a process we like to call Amendments in Context. With this process, a proposed revision of a bill is drafted and then the amendments necessary to produce that revision are extracted as an amendment document. That amendment document, which really becomes an enumeration of proposed changes in a report, is then submitted to the committee for approval. If approved, the revised document that was drafted earlier then becomes the next official version of the bill. This process differs from the traditional process in which an amendment document is drafted, itemizing changes to be made. When the committee approves the amendments, there is a mad rush, usually overnight, to implement (or execute) those amendments to the last version in order to produce the next version. Our Amendments in Context automated approach is more accurate and largely eliminates the overnight bottleneck of having to execute approved amendments before the start of business the following day.

Since implementing this system for California, we’ve been involved in a number of other jurisdictions and efforts that deal with the amending process. This has given us quite a good perspective on the various ways in which bill amendments get handled.

As software developers ourselves, we’ve often been struck by how similar the bill amendment process is to the software development process — the very thing that invariably leads to the suggestion that GitHub could be a great repository for legislation. With this all in mind, let’s compare and contrast the bill amending process with the software development process using GitHub.

(We’ll make suitable procedural simplifications to keep the example clear)

BILL AMENDING PROCESS SOFTWARE ENHANCEMENT PROCESS
Begin a proposed amendment Begin a proposed enhancement
Create a copy of the last version of a bill. In the U.S. and other parts of the world that still use page and line numbers, cleverly annotated page and line number information from the last publication must be included. This copy will be modified to reflect the proposed changes. Create a new software branch. This branch will be modified to implement the proposed enhancement
Make the proposed changes using redlining, showing the changes as insertions and deletions. Carefully craft the changes to obey the drafting rules and any political sensitivities regarding how the changes are shown. Make the proposed changes to the software — testing and debugging as needed.
Redlining Software
Generate the amendment Prepare to commit
The amendment generator examines the redlining (insertions and deletions), carefully grouping changes together to produce a minimized set of amendments. These amendments are expressed in the familiar, at least in the U.S., “on page X, line Y, strike ‘this’ and replace with ‘that'” or something along those lines. (For jurisdictions that don’t use an amendment generator, a manually written amendment document, enumerating the amendments, is the starting point) A differencing engine compares the source code with the prior version, carefully grouping changes together to produce a minimized set of hunks. If you use a tool such as SourceTree by Atlassian, these hunks are shown as source code with lines to be removed and lines to be inserted.
Amendment Hunks
Save the amendment document alongside the revised bill with redlining Commit the changes to GitHub
Vote on the amendments Submit for review
The amendment document goes to committee where it is proposed and then either adopted or rejected. The procedures here may differ, depending on the jurisdiction. In California, multiple competing amendment documents (known as instruction amendments) may be proposed at any one time, but only one can be adopted and it is adopted in whole. Other jurisdictions allow multiple amendment documents to be adopted and individual amendments with any amendment document to be adopted or rejected. The review board considers the proposed enhancement and decides whether or not to incorporate them into the next release. They may choose to adopt the entire enhancement or they may choose to adopt only certain aspects of it.
Execute the amendment Merge into mainline
In California, because only single whole amendments can be adopted, executing an adopted amendment is quite easy — the redlined version of the bill simply becomes the next version. However, in most jurisdictions, this isn’t so easy. Instead, each amendment must be applied to a new copy of the bill, destined to become the next version. Conflicts that arise must be resolved following a prescribed set of procedures. Incorporating an enhancement into the mainline involves a merge of the enhancement branch into the mainline. If an enhancement is not adopted in whole, then approved changes may be cherry picked. When conflicts between different sets of approved enhancements occur, GitHub requires manual intervention to resolve the issues. This process is generally a lot less formal than resolving conflicts in legislation.

So, as you can see, there are a lot of similarities between amending a bill and implementing a software enhancement. The basic process is essentially identical. However, the differences lie in the details.

Git is designed specifically for the software development process. The legislative process has quite a different set of requirements and traditions which must be met. It simply isn’t possible to bend and distort the legislative process to fit the model prescribed by Git. However, that doesn’t mean that something like GitHub is out of the question. What if there was a GitHub for Legislation — a tool with an associated repository, modeled after Git and GitHub, specifically designed for managing legislation?

This example shows the power of adopting XML for drafting legislation. With properly designed XML, legislation becomes a vast store of machine-readable information that can meet the 21st century challenges of accuracy, efficiency, and transparency. We’re not just printing paper anymore — we’re managing digital information.

Can GitHub be used to manage legislation?

Data Transparency Breakfast, LEX US Summer School 2015, First International Akoma Ntoso Conference, and LegisPro Edit reveal.

Last week was a very good week for my company, Xcential.

We started the week hosting a breakfast put on by the Data Transparency Coalition at the Booz Allen Hamilton facility in Washington D.C.. The topic was Transforming Law and Regulation. Unfortunately, an issue at home kept me away but I was able to make a brief pre-recorded presentation and my moderating role was played by Mark Stodder, our company President. Thank you, Mark!

Next up was the first U.S. edition of the LEX Summer School from Italy. I have attended this summer school every year since 2010 in Italy and it’s great to see the same opportunity for an open dialog amongst the legal informatics community finally come to the U.S. Monica Palmirani (@MonicaPalmirani), Fabio Vitali, and Luca Cervone (@lucacervone) put on the event from the University of Bologna. The teachers also included Jim Mangiafico  (@mangiafico) (the LoC data challenge winner), Veronique Parisse (@VeroParisse) from the European Union, Andrew Weber (@atweber) from the Library of Congress, Kirsten Gullickson (@GullicksonK) from the Office of the Clerk at the U.S. House of Representatives, and myself from Xcential. I flew in for an abbreviated visit covering the last two days of the Summer School where I covered how the U.S. Code is modeled in Akoma Ntoso and gave the students an opportunity to try out our new bill drafting editor — LegisProedit.

After the Summer School concluded, it was followed by the first International Akoma Ntoso Conference on Saturday, where I spoke about the architecture of our new editor as well as how the USLM schema is a derivative of the Akoma Ntoso schema. We had good turnout, from around the world, and a number of interesting speakers.

This week is NCSL in Seattle where we will be discussing our new editor with potential customers and partners. Mark Stodder from Xcential will be in attendance.

In a month, I’ll be in Ravenna once more for the European LEX Summer School — where I’ll be able to show even more progress towards the goal of a full product line of Akoma Ntoso tools. It’s interesting times for me.

The editor is coming along nicely and we’re beginning to firm up our QuickStarter beta plans. I’ve already received a number of requests and will be getting in touch with everyone as soon as we’re ready to roll out the program. If you would like to participate as a beta tester — or if you would just like more information, please contact us at info@xcential.com.

I’m really excited about how far we’ve come. Akoma Ntoso is on the verge of being certified as an official OASIS standard, our Akoma Ntoso products are coming into place, and interest around the world is growing. I can’t wait to see where we will be this time next year.

Data Transparency Breakfast, LEX US Summer School 2015, First International Akoma Ntoso Conference, and LegisPro Edit reveal.

Coming soon!!! A new web-based editor for Akoma Ntoso

I’ve been working hard for a long time — building an all new web-based editor for Akoma Ntoso. We will be showing it for the first time at the upcoming Akoma Ntoso LEX Summer School in Washington D.C.

Unlike our earlier AKN/Editor, this editor is a pure XML editor designed from the ground up using the XML capabilities that modern browsers possess. This editor is much more robust, more precise,  and is very scalable.

NewEditor

Basic Features

  1. Configurable XML models — including Akoma Ntoso and USLM
  2. Edit full documents or portions of large documents
  3. Flexible selection and editing regardless of XML structure
  4. Built-in redlining (change tracking) supporting textual AND structural changes
  5. Browse document sources with drag-and-drop.
  6. Full undo & redo
  7. Customizable attribute editor
  8. Search and replace
  9. Modular architecture to allow for extensive customization

Underlying Technology

  1. XML-based editing component
    • DOM 4 support
    • XPath Support
    • CSS Styling
    • Sophisticated event model
  2. HTTP-based resolver architecture for retrieving documents
    • Interpret citations
    • Deference URLs
    • WebDAV adaptors to document repositories
    • Query repositories with XQuery or databases with SQL
  3. AngularJS-based User Interface using HTML5
    • Component modules for easy customization
  4. XML repository for storing documents
    • Integrate any XML repository
    • Built-in support for eXist-db
  5. Validation & Publishing
    • XML Schema validator
    • XSL-FO publishing

We’ll reveal a lot more at the LEX Summer School later this month! If you’re interested in our QuickStart beta program, drop me a note at grant.vergottini@xcential.com.

Coming soon!!! A new web-based editor for Akoma Ntoso

Automating Legal References in Legislation

This is a blog I have wanted to write for quite some time. It addresses what I believe to be the single most important issue when modeling information for legal informatics. It is also, I believe, the most urgent aspect that we need to agree upon in order to promote legal informatics as a real emerging industry. Today, most jurisdictions are simply cobbling together short term solutions without much consideration to the big picture. With something this important, we need to look at the big picture first and come up with a lasting solution.

Citations, references, or links are a very important aspect of the law. Laws are inherently a web of interconnections and interdependencies. Correctly resolving those connections allows us to correctly interpret the law. Mistakes or ambiguities in how those connections are made is completely unacceptable.

I work on projects around the world as well as my work on the OASIS LegalDocumentML technical committee. As I travel to the four corners of the Earth, I am starting to see more clearly how this problem can be solved in a clean and extensible manner.

There are, of course, already many proposals to address this. The two I have looked at the most are both from Italy:
A Uniform Resource Name (URN) Namespace for Sources of Law (LEX)
Akoma Ntoso References (in the process of being standardized by OASIS)

My thoughts derive from these two approaches, both of which I have implemented in one way or another, with varying degrees of success. My earliest ideas were quite similar to the LEX-URN proposal by being based around URNs. However, with time Fabio Vitali at the University of Bologna has convinced me that the approach he and Monica Palmirani put forth with Akoma Ntoso using URLs is more practical. While URNs have their appeal, they really have not achieved critical mass in terms of adoption to be practical. Also, the general reaction I have gotten with LEX-URN encoded references has not been positive. There is just too much special encoding going on within them for them to be readable by the uninitiated.

Requirements

Before diving into this subject too deep, let’s define some basic requirements. In order to be effective, a reference must:
• Be unambiguous.
• Be predictable.
• Be adaptable to all jurisdictions, legal systems, and all the quirks that arise.
• Be universal in application and reach.
• Be implementable with current tools and technologies.
• Be long lasting and not tied to any specific implementation
• Be understandable to mere mortals like myself.

URI/IRI

URIs (Uniform Resource Identifiers) give us a way to identify resources in a computing system. We’re all familiar with URLs that allow us to retrieve pages across the web using hierarchical locations. Less well known are URNs which allow us to identify resources using a structured name which presumably will then be located using some form of a service to map the name to a location. The problem is, a well-established locating service has never come about. As a result, URNs have languished as an idea more than a tool. Both URLs and URNs are forms of URIs.

IRIs are a generalization of URIs to allow characters outside of the ASCII character set supported by normal URIs. This is important in jurisdictions that use more complex character than ASCII supports.

Given the current state of the art in software technology, basing references on URIs/IRIs makes a lot of sense. Using the URL/IRL variant is the safer and more universally accepted approach.

FRBR

FRBR is the Functional Requirements for Bibliographical Records. It is a conceptual entity-relationship model developed by librarians for modeling bibliographic information in databases. In recent years it has received a fair amount of attention for use as the basis for legal references. In fact, both the LEX-URN and the Akoma Ntoso models are based, somewhat loosely, on the model. At times, there is some controversy as to whether this model is appropriate or not. My intent is not to debate the merits of FRBR. Instead, I simply want to acknowledge that it provides a good overall model for thinking about how a legal reference should be constructed. In FRBR, there are four main entities:
1. Work – The work is the “what”, allowing us to specify what it is that we are referring to, independent of which version or format we are interested in.
2. Expression – The expression answers the “from when” question, allowing us to specify, in some manner, which version, variant, or time frame we are interested in.
3. Manifestation – The manifestation is the “which format” part, where we specify the format that we would like the information returned as.
4. Item – The item finally allows us to specify the “from where” part, when multiple sources of the information are available, that we want the information to come from.

That’s all I want to mention about FRBR. I want to pick up the four concepts and work from them.

What do we want?

Picking up the Akoma Ntoso model for specifying a reference as a URL, and mindful of our basic requirements, a useful model to reference a resource is as a hierarchical URL, starting by specifying the jurisdiction and then working hierarchically down to the item in question.

This brings me to the biggest hurdle I have come across when working with the existing proposals. It’s not terribly clear what a reference should be like when the item being referenced is a sub-part of a resource being modeled as an XML document. For instance, how would I refer to section 500 of the California Government Code? Without putting in too much thought, the answer might be something like /us-ca/codes/gov.xml#sec500, using a URL to identify the Government Code followed by a fragment identifier specifying section 500 of the Government Code. The LEX URN proposal actually suggests using the # fragment identifier, referring to the fragment as a partition. There are two problems with this solution though. First, any browser will interpret a reference using the fragment identifier as two parts – the part before the # fragment identifier showing the resource to be retrieved from the server and the part after the fragment identifier as an “id” to the item to scroll to. Retrieving the huge Government code when all we want is the one sentence in Section 500 is a terrible solution. The second problem is that it defines, possibly for all time, how a large document might have been constructed out of sub-documents. For example, is the US Code one very large document, does it consist of documents made out of the Titles, or as it is quite often modeled, is every section a different document? It would be better if references did not capture any part of this implementation decision. A better approach is to allow the “what” part of a reference to be specified as a virtual URL all the way down to whatever is wanted, even when the “what” is found deep inside an XML document in a current implementation. For example, the reference would better be specified as /us-ca/codes/gov/sec500. We’re not exposing in the reference where the document boundaries currently exist.

On to the next issue, what happens when there is more than one possible way to reference the same item? For example, the sections in California’s codes, as is usually the case, are numbered sequentially with little regard to the heading hierarchy above the sections. So a reference specified as /us-ca/codes/gov/sec500 is clear, concise, and unambiguous. It follows the manner in which sections are cited in the text. But /us-ca/codes/gov/title1/div3/chap6/sec500 is simply another way to identify the exact same section. This happens in other places too. /us-ca/statutes/2012/chap5 is the same document as /us-ca/bills/2011/sb730. So two paths identify the same document. Do we allow two identities? Do we declare one as the canonical reference and the other as an alternate? It’s not clear to me.

What about ambiguity? Mistakes happen and odd situations arise. Take a look at both Chapter 14s that exist in Division 6 of Title 1 of the California Government Code. There are many reasons why this happens. Sometimes it’s just a mistake and sometimes it’s quite deliberate. We have to be able to support this. In California, we disambiguate by using “qualifying language” which we embed somehow into the reference. The qualifying language specifies the last statute to create or amend the item needing disambiguation.

The From When do we want it?

A hierarchical path identifies, with some disambiguation, what it is we want. But chances are that what we want has varied over time. We need a way to specify the version we’re looking for or ask for the version that was valid at a specific point in time. Both the LEX URN and the Akoma Ntoso proposals for references suggest using an “@” sign around some nomenclature which identifies a version or date. (The Akoma Ntoso proposal adds the “:” sign as well)

A problem does arise with this approach though. Sometimes we find that multiple versions exist at a particular date. These versions are all in effect, but based on some conditional logic, only one might be operational at a particular time. How one deals with operational logic can be a bit tricky at times. That’s an open issue to me still.

Which Format do we want?

I find specifying the format to be relatively uncontroversial. The question is whether we specify the format using well established prefixes such as .pdf, .odt, .docx, .xml, and .html or whether we instead try to be more precise by embedding or encoding the MIME type into the reference. Personally, I think that simple extensions, while less rigorous and subject to unfortunate variations and overlaps, offer a far more likely to be adopted approach than trying to use the MIME type somehow. Simple generally wins over rigorous but more complex solutions.

The From Where should it come?

This last part, the from where should it come part, is something that is often omitted from the discussion. However, in a world where multiple libraries offering the same resource will quite likely exist, this is really important. Let’s take a look at the primary example once more. We want section 500 of the California Government Code. The reference is encoded as /us-ca/codes/gov/sec500. Where is this information to come from? Without a domain specified, our URL is a local URL so the presumption is that it will be locally resolved – the local system will find it, somehow. What if we don’t want to rely on a local resolution function? What if there are numerous sources of this data and we want to refer to one of them in particular. When we prepend the domain, aren’t we specifying from where we want the information to come from? So if we say http: //leginfo.ca.gov/us-ca/codes/gov/sec500, aren’t we now very precisely specifying the source of the information to be the official California source? Now, say the US Library of Congress decides to extend Thomas to offer state legislation. If we want to specify that copy, we would simply construct a reference as http: //thomas.loc.gov/us-ca/codes/gov/sec500. It’s the same URL after the domain is specified. If we leave the URL as simply /us-ca/codes/gov/sec500, we have a general reference and we leave it to the local system to provide the resolution service for retrieving and formating the information. We probably want to save references in a general fashion without a domain, but we certainly will need to refer to specific copies within the tools that we build.

Resolvers

The key to making this all work is having resolvers that can interpret standardized references and find a way to provide the correct response. It is important to realize that these URLS are all virtual URLs. They do not necessarily resolve to files that exist. It is the job of the resolving service to either construct the valid response, possibly by digging into database and files, or to negotiate with other resolvers that might do all or part of the job of providing a response. For example, imagine that Cornell University offers a resolver at http: //lii.cornell.edu. It might, behind the scenes, work with the official data source at http: //leginfo.ca.gov to source California legislation. Anyone around the world could use the Cornell resolver and be unaware of the work it is doing to source information from resolvers at the official sources around the world. So the local system would be pointed to the Cornell service and when the reference /us-ca/codes/gov/sec500 arose, the local system would defer to the LII service for resolution which in turn would defer to California’s official resolver. In this way, the resolvers would bear the burden of knowing where all the official data sources around the world are located.

Examples

So to end, I would like to sum up with some examples:

[Note that the links are proposals, using a modified and simplified form of the Akoma Ntoso proposal, rather than working links at this point]

/us-ca/codes/gov/sec500
– Get section 500 of the California Government Code. It’s up to the local service to decide where and how to resolve the reference.

http: //leginfo.ca.gov/us-ca/codes/gov/sec500
– Get Section 500 of the California Government Code from the official source in California.

http: //lii.cornell.edi/us-ca/codes/gov/sec500
– Get Section 500 of the California Government Code from Cornell’s LII and have them figure where to get the data from

/us-ca/codes/gov/sec500@2012-01-01
– Get Section 500 of the California Government Code as it existed on January 1, 2012

/us-ca/codes/gov/sec500@2012-01-01.pdf
– Get Section 500 of the California Government Code as it existed on January 1, 2012, in a PDF format

/us-ca/codes/gov/title1/div3/chap6/sec500
– Get Section 500 of the California Government Code, but the fully hierarchy is specified

My blog has gotten very long and I have only just started to scratch the surface. I haven’t addressed multilingual issues, alternate character sets, and a host of other issues at all. It should already be apparent that this is all simply a natural extension of the URLs we already use, but with sophisticated services underneath resolving to items other than simple files. Imagine for a moment how the field of legal informatics could advance if we could all agree to something this simple and comprehensive soon.

What do you think? Are there any other proposals, solutions, or prototypes out there that addresses this? How does the OASIS legal document ML work factor into this?

Automating Legal References in Legislation

And now for something completely different… Chinese!

Last week we saw how Akoma Ntoso can be applied to a very large consolidated Code – the United States Code. This week we take the challenge in a different direction – applying Akoma Ntoso to a bilingual implementation involving a totally different writing system. Our test document this week is the Hong Kong Basic Law. This document serves as the constitutional document of the Hong Kong Special Administrative Region of the People’s Republic of China. It was adopted on the 4 April 1990 and went into effect on July 1, 1997 when the United Kingdom handed over the region to the People’s Republic of China.

The Hong Kong Basic Law is available in English, Traditional Chinese, and Simplified Chinese. For our exercise, we are demonstrating the document in English and in Traditional Chinese. (Thank you to Patrick for doing the conversion for me.) Fortunately, using modern technologies, supporting Chinese characters alongside Latin characters is quite straightforward. Unicode provides a Hong Kong supplementary character set to handle characters unique to Hong Kong. The biggest challenge is ensuring that all the unicode declarations throughout the various XML and HTML files that the information must flow through are set correctly. With the number of accents we find in names in California as well as the rigorous nature of California’s publishing rules, getting Unicode right is something we have grown accustomed to.

While I hadn’t expected there to be any problems with Unicode, I was pleasently surprised to find that the fonts used in Legix simply worked with the Traditional Chinese characters without issue as well. (Well at least as far as I can tell without the ability to actually read Chinese)

The only issue we encountered was Internet Explorer’s support for CSS3. Apparently, IE still does not recognize “list-style-type” with a value of “cjk-ideographic”. So instead of getting Traditional Chinese numerals, we get Arabic numerals. The other browsers handled this much better.

So what other considerations were there? A big consideration was the referencing mechanism. To me, modeling how you refer to something in an information model can be more important than the information model itself. The referencing mechanism defines how the information is organized and allows you to address a specific piece of information in a very precise and accurate way. Done right, any piece of information can be accessed very quickly and easily. Done wrong and you get chaos.

Our referencing mechanism relies on the Functional Requirements for Bibliographical Records (FRBR). This mechanism is used by both SLIM and Akomantoso. Another interesting FRBR proposal for legislation can be found here.

FRBR defines an information model based on a hierarchical scheme of Work-Expression-Manifestion-Item. Think of the work as the overall document being addressed, the expression being the version desired, the manifestation the format you want to information presented in, and finally the item as a means for addressing a specific instance of the information. Typically we’re only concerend with Work-Expression-Manifestation.

For a bilingual or multilingual system, the “expression” part of the reference is used to specify which language you wish the document to be returned in. If you check out the references at Legix.info you will see that the two references the the Hong Kong Basic Law are:

The expressions are called out as “doc;en-uk” for the English version and “doc;zh-yue” for the Chinese version. Relatively straightforward. The manifestations are not shown and the result is the default manifestation of HTML.

Check the samples out and let me know what you think.

And now for something completely different… Chinese!