Akoma Ntoso, LEX Summer School, Process, Standards, Track Changes, Transparency, W3C

Achieving Five Star Open Data

A couple weeks ago, I was in Ravenna, Italy at the LEX Summer School and follow-on Developer’s Workshop. There, the topic of a semantic web came up a lot. Despite cooling in the popular press in recent years, I’m still a big believer in the idea. The problem with the semantic web is that few people actually get it. At this point, it’s such an abstract idea that people invariably jump to the closest analog available today and mistake it for that.

Tim Berners-Lee (@timberners_lee), the inventor of the web and a big proponent of linked data, has suggested a five star deployment scheme for achieving open data — and what ultimately will be a semantic web. His chart can be thought of as a roadmap for how to get there.

Take a look at today’s Data.gov website. Everybody knows the problem with it — it’s a pretty wrapper around a dumping ground of open data. There are thousands and thousands of data sets available on a wide range of interesting topics. But, there is no unifying data model behind all these data dumps. Sometimes you’re directed to another pretty website that, while well-intentioned, hides the real information behind the decorations. Sometimes you can get a simple text file. If you’re lucky, you might even find the information in some more structured format such as a spreadsheet or XML file. Without any unifying model and with much of the data intended as downloads rather than as an information service, this is really still Tim’s first star of open data — even though some of the data is provided as spreadsheets or open data formats. It’s a good start, but there’s an awful long way to go.

So let’s imagine that a better solution is desired, providing information services, but keeping it all modest by using off-the-shelf technology that everyone is familiar with. Imagine that someone with the authority to do so, takes the initiative to mandate that henceforth, all government data will be produced as Excel spreadsheets. Every memo, report, regulation, piece of legislation, form that citizens fill out, and even the U.S. Code will be kept in Excel spreadsheets. Yes, you need to suspend disbelief to imagine this — the complications that would result would be incredibly tough to solve. But, imagine that all those hurdles were magically overcome.

What would it mean if all government information was stored as spreadsheets? What would be possible if all that information was available throughout the government in predictable and permanent locations? Let’s call the system that would result the Government Information Storehouse – a giant information repository for information regularized as Excel spreadsheets. (BTW, this would be the future of government publishing once paper and PDFs have become relics of the past.)

How would this information look? Think about a piece of legislation, for instance. Each section of the bill might be modeled as a single row in the spreadsheet. Every provision in that section would be it’s own spreadsheet cell (ignoring hierarchical considerations, etc.) Citations would turn into cell references or cell range references. Amending formulas, such as “Section 1234 of Title 10 is amended by…” could be expressed as a literal formula — a spreadsheet formula. It would refer to the specific cell in the appropriate U.S. Code Title and contain programmatic instructions for how to perform the amendment. In short, lots of once complex operations could be automated very efficiently and very precisely. Having the power to turn all government information into a giant spreadsheet has a certain appeal — even if it requires quite a stretch of the imagination.

Now imagine what it would mean if selected parts of this information were available to the public as these spreadsheets – in a regularized and permanent way — say Data.gov 2.0 or perhaps, more accurately, as Info.gov. Think of all the spreadsheet applications that would be built to tease out knowledge from the information that the government is providing through their information portal. Having the ability to programmatically monitor the government without having to resort to complex measures to extract the information would truly enable transparency.

At this point, while the linkages and information services give us some of the attributes of Tim’s four and five star open data solutions, but our focus on spreadsheet technology has left us with a less than desirable two star system. Besides, we all know that having the government publish everything as Excel spreadsheets is absurd. Not everything fits conveniently into a spreadsheet table to say nothing of the scalability problems that would result. I wouldn’t even want to try putting Title 42 of the U.S. Code into an Excel spreadsheet. So how do we really go about achieving this sort of open data and the efficiencies it enables — both inside and outside of government?

In order to realize true four and five star solutions, we need to quickly move on to fulfilling all the parts of Tim’s five star chart. In his chart, a three star solution replaces Excel spreadsheets with an open data format such as a comma separated file. I don’t actually care for this ordering because it sacrifices much to achieve the goal of having neutral file formats — so lets move on to full four and five star solutions. To get there, we need to become proficient in the open standards that exist and we must strive to create ones where they’re missing. That’s why we work so hard on the OASIS efforts to develop Akoma Ntoso and citations into standards for legal documents. And when we start producing real information services, we must ensure that the linkages in the information (those links and formulas I wrote about earlier), exist to the best extent possible. It shouldn’t be up to the consumer to figure out how a provision in a bill relates to a line item in some budget somewhere else — that linkage should be established from the get-go.

We’re working on a number of core pieces of technology to enable this vision and get to full five star open data. We integrating XML repositories and SQL databases into our architectures to give us the information storehouse I mentioned earlier. We’re building resolver technology that allows us to create and manage permanent linkages. These linkages can be as simple as citation references or as complex as instructions to extract from or make modifications to other information sources. Think of our resolver technology as akin to the engine in Excel than handles cell or range references, arithmetic formulas, and database lookups. And finally, we’re building editors that will resemble word processors in usage, but will allow complex sets of information to be authored and later modified. These editors will have many of the sophisticated capabilities such as track changes that you might see in a modern word processor, but underneath you will find a complex structured model rather than the ad hoc data structures of a word processor.

Building truly open data is going to be a challenging but exciting journey. The solutions that are in place today are a very primitive first step. Many new standards and technologies still need to be developed. But, we’re well on our way.

Process, Standards, Transparency

Imagining Government Data in the 21st Century

After the 2014 Legislative Data and Transparency conference, I came away both encouraged and a little worried. I’m encouraged by the vast amount of progress we have seen in the past year, but at the same time a little concerned by how disjointed some of the initiatives seem to be. I would rather see new mandates forcing existing systems to be rethought rather than causing additional systems to be created – which can get very costly over time. But, it’s all still the Wild Wild West of computing.

What I want to do with my blog this week is try and define what I believe transparency is all about:

  1. The data must be available. First and foremost, the most important thing is that the data be provided at the very least – somehow, anyhow.
  2. The data must be provided in such a way that it is accessible and understandable by the widest possible audience. This means providing data formats that can be read by ubiquitous tools and, ensuring the coding necessary to support all types of readers including those with disabilities.
  3. The data must be provided in such a way that it should be easy for a computer to digest and analyze. This means using data formats that are easily parsed by a computer (not PDF, please!!!) and using data models that are comprehensible to widest possible audience of data analysts. Data formats that are difficult to parse or complex to understand should be discouraged. A transparent data format should not limit the visibility of the data to only those with very specialized tools or expertise.
  4. The data provided must be useful. This means that the most important characteristics of the data must be described in ways that allow it to be interpreted by a computer without too much work. For instance, important entities described by the data should be marked in ways that are easily found and characterized – preferably using broadly accepted open standards.
  5. The data must be easy to find. This means that the location at which data resides should be predictable, understandable, permanent, and reliable. It should reflect the nature of the data rather than the implementation of the website serving the data. URLs should be designed rather than simply fallout from the implementation.
  6. The data should be as raw as possible – but still comprehensible. This means that the data should have undergone as little processing as possible. The more that data is transformed, interpreted, or rearranged, the less like the original data it becomes. Processing data invariably damages its integrity – whether intentional or unintentional. There will always be some degree of healthy mistrust in data that has been over-processed.
  7. The data should be interactive. This means that it should be possible to search the data at its source – through both simple text search and more sophisticated data queries. It also means that whenever data is published, there should be an opportunity for the consumer to respond back – be it simple feedback, a formal request for change, or some other type of two way interaction.

How can this all be achieved for legislative data? This is the problem we are working to solve. We’re taking a holistic approach by designing data models that are both easy to understand and can be applied throughout the data life cycle. We’re striving to limit data transformations by designing our data models to present data in ways that are both understandable to humans and computers alike. We are defining URL schemes that are well thought out and could last for as long as URLs are how we find data in the digital era. We’re defining database solutions that allow data to not only be downloaded, but also searched and queried in place. We’re building tools that will allow the data to not only be created but also interacted with later. And finally, we’re working with standards bodies such as the LegalDocML and LegalCiteM technical committees at OASIS to ensure well thought out world wide standards such as Akoma Ntoso.

Take a look at Title 1 of the U.S. Code. If you’re using a reasonably modern web browser, you will notice that this data is very readable and understandable – its meant to be read by a human. Right click with the mouse and view the source. This is the USLM format that was released a year ago. If you’re familiar with the structure of the U.S. Code and you’re reasonably XML savvy, you should feel at ease with the data format. It’s meant to be understandable to both humans and to computer programs trying to analyze it. The objective here is to provide a single simple data model that is used from initial drafting all the way through publishing and beyond. Rather than transforming the XML into PDF and HTML forms, the XML format can be rendered into a readable form using Cascading Style Sheets (CSS). Modern XML repositories such as eXist allow documents such as this to be queried as easily as you would query a table in a relational database – using a query language called XQuery.

This is what we are doing – within the umbrella of legislative data. It’s a start, but ultimately there is a need for a broader solution. My hope is that government agencies will be able to come together under a common vision for our information should be created, published, and disseminated – in order to fulfill their evolving transparency mandates efficiently. As government agencies replace old systems with new systems, they should design around a common open framework for transparent data rather building new systems in the exact same footprint as the old systems that they demolish. The digital era and transparency mandates that have come with it demand new thinking far different than the thinking of the paper era which is now drawing to a close. If this can be achieved, then true data transparency can be achieved.

Process, Transparency

What is Transparency?

I’ve been thinking a lot about transparency lately. The disappearance of Malaysian Airline Flight 370 (MH370) provided an interesting case to look at – and some important lessons. Releasing data which requires great expertise to decipher isn’t transparency.

My boss, when I worked on process research at the Boeing Company many years ago, used to drill into me the difference between information and data. To him, data was raw – and meaningless unless you knew how to interpret it. Information, on the other hand, had the meaning applied so you could understand it – information, to him, was meaningful.

Let’s recall some of the details of the MH370 incident. The plane disappeared without a trace – for reasons that remain a mystery. The only useful information, after radar contact was lost, was a series of pings received by Inmarsat’s satellite. Using some very clever mathematics involving Doppler shifts, Inmarsat was able to use that data to plot a course for the lost plane. That course was revealed to the world and the search progressed. However, when that course failed to turn up the missing plane, there were increasingly angry calls for more transparency from Inmarsat – to reveal the raw data. Inmarsat’s response was that they had released the information, in the form of a plotted course, to the public and to the appropriate authorities, However, they chose to withhold the underlying data, claiming it wouldn’t be useful. The demands persisted, primarily from the press and the victims’ families. Eventually Inmarsat gave in and agreed to release the data. With great excitement, the press reported this as “Breaking News”. Then, a bewildered look seemed to come across everyone and the story quickly faded away. Inmarsat had provided the transparency in the form it was demanded, releasing the raw data along with a brief overview and the relevant data highlighted, but it still wasn’t particularly useful. We’re still waiting to hear if anyone will ever be able to find any new insights into whatever happened to MH370 using this data. Most likely though, that story has run its course – you simply need Inmarsat’s expertise to understand the data.

There is an important lesson to be learned – for better or worse. Raw data can be released, but without the tools and expertise necessary to intepret it, it’s meaningless. Is that transparency? Alternatively, raw data can be interpreted into meaningful information, but that opens up questions as to the honesty and accuracy of the interpretation. Is that transparency? It’s very easy to hide the facts in plain sight – by delivering it in a convoluted and indecipherable data format or by selectively interpreting it to tell an incomplete story. How do we manage transparency to achieve the objective of providing the public with an open, honest, and useful view of government activities?

Next week, I want to describe my vision for how government information should be made public. I want to tackle the conflicting needs of providing information that is both unfiltered yet comprehensible. While I don’t have the answers, I do want to start the process of clarifying what better transparency is really going to achieve.

Process, Standards, Transparency

Improving Legal References

In my blog last week, I talked a little about our efforts to improve how citations are handled. This week, I want to talk about this in some more detail. I’ve been participating on a few projects to improve how citations and references to legal citations are handled.

Let’s start by looking at the need. Have you noticed how difficult it is to lookup many citations found in legislation published on the web? Quite often, there is no link associated with the citation. You’re left to do your own legwork if you want to lookup that citation – which probably means you’ll take the author’s word for it and not bother to follow the citation. Sometimes, if you’re lucky, you will find a link (or reference) associated with the citation. It will point to a location, chosen by the author, that contains a copy of the legal text being referenced.

What’s the problem with these references?

  • If you take a look at the reference, chances are it’s a crufty URL containing all sorts of gibberish that’s either difficult or impossible to interpret. The URL reflects the current implementation of the data provider. It’s not intended to be meaningful. It follows no common conventions for how to describe a legal provision.
  • Wait a few years and try and follow that link again. Chances are, that link will now be broken. The data provider may have redesigned their site or it might not even exist anymore. You’re left with a meaningless link that points to nowhere.
  • Even if the data does exist, what’s the quality of the data at the other end of the link. Is the text official text, a copy, or even a derivative of the official text? Has the provision been amended? Has it been renumbered? Has it been repealed? What version of that provision are you looking at now? These questions are all hard to answer.
  • When you retrieve the data, what format does it come in? Is it a PDF? What if you want the underlying XML? If that is available, how do you get it?
  • The object of our efforts, both at the standards committee and within the projects we’re working on at Xcential, is to tackle this problem. The approach being taken involves properly designing meaningful URLs which are descriptive, unambiguous, and can last for a very long time – perhaps decades or longer. These URLs are independent of the current implementation – they may not reflect how the data is stored at all. The job of figuring out how to retrieve the data, using the current underlying content management system, is the job of a “resolver”. A resolver is simply an adapter that is attached to a web server. It intercepts the properly designed URL references and then transparently maps them into the crufty old URLs which the content management system requires. The data is retrieved from the content management system, formatted appropriately, and returned as if it really did exist at the property designed URL which you see. As the years go by and technologies advance, the resolver can be adapted to handle new generations of content management system. The references themselves will never need to change.

    There are many more details to go into. I’ll leave those for future blogs. Some of the problems we are tackling involve mapping popular names into real citations, working through ambiguities (including ones created in the future), handling alternate data sources, and allowing citations to be retrieved at varying degrees of granularity.

    I believe that solving the legal references problem is just about the most important progress we can make towards improving the legal informatics field. It’s an exciting time to be working in this field.

Process, Transparency

Transparent legislation should be easy to read

Legislation is difficult to read and understand. So difficult that it largely goes unread. This is something I learned when I first started building bill drafting systems over a decade ago. It was quite a let down. The people you would expect to read legislation don’t actually do that. Instead they must rely on analyses, sometimes biased, performed by others that omits many of the nuances found within the legislation itself.

Much of the problem is how legislation is written. Legislation is often written so as to concisely describe a set of changes to be made to existing law. The result is a document that is written to be executed by a law compilation team deep within the government rather than understood by law makers or the general public. This article, by Robert Potts, rather nicely sums up the problem.

Note: There is a technical error in the article by Robert Potts. The author states “These statutes are law, but since Congress has not written them directly to the Code, they are added to the Code as ‘notes,’ which are not law. So even when there is a positive law Title, because Congress has screwed it up, amendments must still be written to individual statutes.” This is not accurate. Statutory notes are law. This is explained in Part IV (E) of the DETAILED GUIDE TO THE CODE CONTENT AND FEATURES.

So how can legislation be made more readable and hence more transparent? The change must come in how amendments are written – with an intent to communicate the changes rather than just to describe them. Let’s start by looking at a few different ways that amendments can be written:

1) Cut-and-Bite Amendments

Many jurisdiction around the world use the cut-and-bite approach to amending, also known as amendments by reference. This includes Congress here in the U.S., but it is also common to most of the other jurisdictions I work with. Let’s take a look at a hypothetical cut-and-bite amendment:

SECTION 1. Section 1234 of the Labor Code is amended by repealing “$7.50” and substituting “$8.50”.

There is no context to this amendment. In order to understand this amendment, someone is going to have to go look up Section 1234 of the Labor Code and manually make apply the change to see what it is all about. While this contrived example is simple, it already involves a fair amount of work. When you extrapolate this problem to a real bill and the sometimes convoluted state of the law, the effort to understand a piece of legislation quickly becomes mind-boggling. For a real bill, few people are going to have either the time or the resources to adequately research all the amendments to truly understand how they will affect the law.

2) Amendments Set Out in Full

I’ve come to appreciate the way the California Legislature handles this problem. The cut-and-bite style of amending, as described above, is simply disallowed. Instead, all amendments must be set out in full – by re-enacting the section in full as amended. This is mandated by Article 4, section 9 of the California Constitution. What this means is that the amendment above must instead be written as:

Section 1. Section 1234 of the Labor Code is amended to read:

1234. Notwithstanding any other provision of this part, the minimum wage for all industries shall be not less than $8.50 per hour.

This is somewhat better. Now we can see that we’re affecting the minimum wage – we have the context. The wording of the section, as amended, is set out in full. It’s clear and much more transparent.

However, it’s still not perfect. While we can see how the amended law will read when enacted, we don’t actually know what changed. Actually, in California, if you paid attention to the bill redlining through its various stages, you could have tracked the changes through the various versions to arrive at the net effect of the amendment. (See note on redlining) Unfortunately, the redlining rules are a bit convoluted and not nearly as apparent as they might seem to be – they’re misleading to the uninitiated. What’s more, the resulting statute at the end of the process has no redlining so the effect of the change is totally hidden in the enacted result.

Setting out amendments in full has been adopted by many states in addition to California. It is both more transparent and greatly eases the codification process. The codification process becomes simple because the new sections, set out in full, are essentially prefabricated blocks awaiting insertion into the law at enactment time. Any problems which may result from conflicting amendments are, by necessity, resolved earlier rather than later. (although this does bring along its own challenges)

3) Amendments in Context

There is an even better approach – which is adopted to varying degrees by a few legislatures. It is to build on the approach of setting out sections in full, but adds a visible indication of what has changed using strike and insert notation. I’ll refer to this as Amendments in Context.

This problem is partially addressed, at the federal level, by the Ramseyer Rule which requires that a separate document be published which essentially does shows all amendments in context. The problem is that this second document isn’t generally available – and it’s yet another separate document.

Why not just write the legislation showing the amendments in context to begin with? I can think of no reason other than tradition why the law, as proposed and enacted, shouldn’t show all amendments in context. Let’s take a look at this approach:

Section 1. Section 1234 of the Labor Code is amended to read:

1234. Notwithstanding any other provision of this part, the minimum wage for all industries shall be not less than $7.50 $8.50 per hour.

Isn’t this much clearer? At a glance we can see that the minimum wage is being raised a dollar. It’s obvious – and much more transparent.

At Xcential, we address this problem in California by providing an amendments in context view for all state legislation within our LegisWeb bill tracking service. We call this feature As Amends the LawTM and it is computed on-the-fly.

Governments are spending a lot of time, energy, and money on legislative transparency. The progress we see today is in making the data more accessible to computer analysis. Amendments in context would make the legislation not only more accessible to computer analysis – but also more readable and understandable to people.

Redlining Note: If redlining is a new term to you, it is similar to, but subtly different, to track changes in a word processor.


Computerize vs. Automate

There are two words that have long been important to me: computerize and automate. The dictionary defines these words as follows:


tr.v., -ized, -iz·ing, -iz·es. 
   1. To furnish with a computer or computer system.
   2. To enter, process, or store (information) in a 
  computer or system of computers

v., -mat·ed, -mat·ing, -mates. 
   1. To convert to automatic operation: automate a factory.
   2. To control or operate by automation.

We often make the mistake of confusing these two concepts as the same thing. They are very different. Doing one does not imply the other. Using a computer does not mean you have automated and automating does not imply the need for a computer. I have found the confusion between computerization and automation to be at the very heart of the disappointment many have with XML solutions. Just because you’re using XML does not mean you are reaping the benefits that XML can provide.

Let’s take a step back and see where we are in history. We are living in a very important era. We are witnessing the transition from paper documents to digital information. This is the sort of transition that only happens every few hundred years, rivaling the advent of the Gutenberg printing press in the 15th century. The benefits of digital information are all around us. Just think of how efficient many businesses have become. As I write this, I am waiting on a parcel that was shipped from Shanghai just 4 days ago. I have tracked that parcel throughout its journey and I know with certainty that is will be delivered in the next couple of hours. That is a benefit of automation.

In my experience, governments don’t see the same benefits of automation that the private sector does. Why is this? Governments, like private industry, have readily computerized their operations. But when it comes to automating, governments tends to balk. There are many reasons for this – the perceived loss of jobs, the need to retrain, the lack of competitive pressures. But to me it seems that the overriding reason is tradition. Things are done the way they are because that is the way they have always been done. When it comes time to rethink tradition, it is sometimes hard to identify who you need to get permission from.

Whatever the reasons, the slowness to automate slows innovation when it comes to legislative information. Sure, the information is now online. Great! But what has been put online is most often just digital paper – like PDFs or unstructured HTML. That’s a half step into the future whilst looking to the past. Rather than taking advantage of the new medium and exploiting what now can be done through automation, we’re clinging to the centuries old models for how to manage and publish paper.

Why is this important? What does it matter? Well, for starters, let’s consider accuracy. For as long as I have involved in this field, the importance of accurately representing the law has been drilled into me. Yet whenever I start writing software to analyze laws, from anywhere, I am surprised at how easy it is to find errors. I’m talking about citations to sections of laws that don’t exist anymore or have more recently been renumbered to be somewhere else. I am talking about duplicate numbering or misnumbering. I am talking about common typos. These are all things that could be rectified with proper automation.

A pet project for me is point-in-time law. It is a subject that has fascinated me for a decade. It is very hard to do. Why is that? Because deciding which law is effective or operational at any point in time is really hard to do. And deciphering references between documents is riddled with ambiguity. This is because, whilst we live in an era where information around the world is stitched together at lightning speeds by computers, we still write that information somewhere in the text of a bill to be read by a person alone. Sometimes I find that quite ironic as I am constantly surprised at how few people actually read the bills – despite having strong opinions about them.

Isn’t it time we started treating legislation as digital information rather than as paper? Isn’t it time we went beyond computerization and looked towards real automation of legislation?

Akoma Ntoso, Process, Standards

Legislative Information Modeling

Last week I brought up the subject of semantic webs for legal documents. This week I want to expand the subject by discussing the technologies that I have encountered recently that point the way to a semantic web. Of course, there are the usual general purpose semantic web technologies like RDF, OWL, and SPARQL. Try as I might, I have been unable to get much practical interest out of anyone in these technologies. Part of the reason is that the abstraction they demand is just beyond most people’s grasp at this point in time. In academic circles it becomes easy to discuss these topics, but step into the “real world” and interest evaporates immediately.

Rather than pressing ahead with those technologies, I have chosen in recent years to step away and focus more on less abstract and more direct information modeling approaches. As I mentioned last week, I see two key areas of information modeling – the documents and the relationships between them. In some respects, there are three areas – distinguishing the metadata about the documents from the documents themselves. Typically I lump the documents with their metadata because much of the metadata gets included with the document text blurring the distinction and calling for a more uniform integrated model.

The projects I have worked on over the past decade have resulted in several legislative information models. With each project I have learned and evolved to result in the SLIM model found at the Legix.info demonstration website that exists today. Over time, a few key aspects have emerged as most important:

  • First and foremost has been the need for simplicity. It is very easy to get all caught up with the information model, discovering all the variations out there and finding clever solutions to each and every situation. However, it easily becomes possible to end up with a large and complex information model that you cannot teach to anyone that does not share your passion and experiences in information modeling. Your efforts to satisfy everyone result in a model that satisfies no one due to the resulting complexity of trying to please too many masters.
  • Secondly, you need to provide a way to build familiarity into your information model. While there are many consistently used terms in legislation, at the same time, traditions around the world do vary and sometimes very similar words have quite different meanings to different organizations. Trying to change long standing traditions to arrive at more consistent or abstract terminology always seems to be an uphill battle.
  • Thirdly, you have to consider the usage model. Is the model intended for downstream reporting and analysis or does the model need to work in an editing environment? An editing model could be quite different from a model intended only for downstream processing. The reason for this is that the manner in which the model will interact with the editor must be given important consideration. Two important aspects require consideration. First, the model must be robust yet flexible enough to work with all the intermediate states that a document will exist at whilst being edited. Second, change tracking is a very important consideration during the amendment process and how that function will be implemented in the document editor must be considered.

While I have developed SLIM and its associated reference scheme over the past few years, in the last year I have started experimenting with a few alternate models in the hopes of finding a more perfect model to solve the problem of legislative information modeling. Most recently I have started experimenting with Akoma Ntoso developed by Fabio Vitali and Monica Palmirani at the University of Bologna. This project is supported by Africa i-Parliaments, a project sponsored by United Nations Department of Economic and Social Affairs. I very much like this model as it follows many of the same ideals in terms of good information modeling that I try to conform to. In fact, it is quite similar to SLIM in many respects. The legix.info site has many examples of Akoma Ntoso documents, created by translating SLIM into Akoma Ntoso via an XSL Transform.

While I very much like Akoma Ntoso, I have yet to master it. It is a far more ambitious effort than SLIM, has many more tags, and covers a broader range of document types. Like SLIM, it covers both the metadata and the document text in a uniform model. I have yet to convince myself as to its viability as an editing schema. Adapting it to work with the editors I have worked with in the past is a project I just haven’t had the time for yet.

The other important aspect of a semantic web, as I wrote about last week is the referencing scheme. Akoma Ntoso uses a notation based on coded URLs to implement referencing. It is partly based on a conceptually similar model URN:LEX model based around URNs developed by Enrico Francesconi and Pierluigi Spinosa at the ITTIG/CNG in Florence, Italy. Both schemes build upon the Functional Requirement for Bibliographic Records (FRBR) model. I have tried adopting both models but have run into snags with the models either not covering enough types of relationships, scaring people away with too many special characters with encoded meaning, or resulting in too complex a location resolution model for my needs. At this point I have cherry picked the best features of both to try and arrive at a compromise that works for my cases. Hopefully I will be able to evolve towards a more consistent implementation as those efforts mature.

My next effort is to start taking a closer look at MetaLex, an open XML-based interchange format for legislation. It has been developed in Europe and defines a set of conventions for metadata, naming, cross references, and compound documents. Many projects in Europe including Akoma Ntoso comply with the Metalex framework. It will be interesting for me to see how easily I can adapt SLIM to Metalex. Hopefully the changes required will amount mostly to deriving from the Metalex schema and adapting to its attribute names. We shall see…

Process, Standards, W3C

What is a Semantic Web?

Tim Berners-Lee, inventor of the World Wide Web, defines a semantic web quite simply as “a web of data that can be processed directly and indirectly by machines“. In my experience, that simple definition quickly becomes confusing as people add their own wants and desires to the definition. There are technologies like RDF, OWL, and SPARQL that are considered key components of semantic web technology. It seems though that these technologies add so much confusion through abstraction that non-academic people quickly steer as far away from the notion of a semantic web as they can get.

So let’s stick to the simple definition from Tim Berners-Lee. We will simply distinguish the semantic web from our existing web by saying that a semantic web is designed to be meaningful to machines as well as to people. So what does it mean for a web of information to be meaningful to machines? A simple answer is to say that there are two primary things that a machine needs to understand about a web. First of all, what the pages are all about, and secondly what the relationships that connect the pages together are all about.

It turns out that making a machine capable of understanding even the most rudimentary aspects of pages and the links that connect them is quite challenging. Generally, you have to resort to fragile custom-built parsers or sophisticated algorithms that analyze the document pages and the references between them. Going from pages with lots of words connected somehow to other pages to a meaningful information model is quite a chore.

What we need to improve the situation are agreed upon information formats and referencing schemes in a semantic web that can more readily be interpreted by machines. Defining what those formats and schemes are is where the subject of semantic webs starts getting thorny. Before trying to tackle all of this, let’s first consider how this all applies to us.

What could benefit more from a semantic web than legal publishing? Understanding the law is a very complex subject which requires extensive analysis and know-how. This problem could be simplified substantially using a semantic web. Legal documents are an ideal fit to the notion of a semantic web. First of all, the documents are quite structured. Even though each jurisdiction might have their own presentation styles and procedural traditions, the underlying models are all quite similar around the world. Secondly, legal documents are rich with relationships or citations to other documents. Understanding these relationships and what they mean is quite important to understanding the meaning of the documents.

So let’s consider the current state of legal publishing – and from my perspective – legislative publishing. The good news is that the information is almost universally available online in a free and easily accessed format. We are, after all, subject to the law and providing access to that law is the duty of the people that make the laws. However, providing readable access to the documents is often the only objective and any which way of accomplishing that objective is simply the requirement. Documents are often published as PDFs which are nice to read, but really difficult for computers to understand. There is no uniformity between jurisdictions, minimal analysis capability (typically word search), and links connecting references and citations between documents are most often missing. This is a less than ideal situation.

We live in an era where our legal institutions are expected to provide more transparency into their functions. At the same time, we expect more from computers than merely allowing us to read documents online. It is becoming more and more important to have machines interpret and analyze the information within documents – and without error. Today, if you want to provide useful access to legal information by providing value-added analysis capabilities, you must first tackle the task of interpreting all the variations in which laws are published online. This is a monumental task which then subjects you to a barrage of changes as the manner in which the documents are released to the public evolves.

So what if there was a uniform semantic web for legal documents? What standards would be required? What services would be required? Would we need to have uniform standards or could existing fragmented standards be accommodated? Would it all need to come from a single provider, from a group of cooperating providers, or would there be a looser way to federate all the documents being provided by all the sources of law around the world? Should the legal entities that are sources of law assume responsibility for publishing legal documents or should this be left to third party providers? In my coming posts I want to explore these questions.

Process, Standards

Welcome to my new blog on Legal Informatics

Imagine that all the world’s laws are published electronically in an open and consistent manner. Imagine that you or your business can easily research the laws to which you are subject. Imagine an industry that caters to the needs of the legal profession based on open worldwide standards.

Of course, there are many reasons why this is just not possible. Every legislature or parliament has their own way of doing things. Every country has their own unique legal system.  Every jurisdiction has their own unique traditions. It simply isn’t possible that all these unique requirements could be harmonized to achieve that vision. Of course not… But it will happen. It might take 50 years, but eventually it will happen. We can debate endlessly why it won’t. We can argue over nuances that get in the way forever. That’s not why I am writing this blog.

I want to open the discussion to how it might happen. What steps can we can start taking right now that will lead us towards our eventual goal? We live in an era where there is widespread dissatisfaction with the way our governments pass laws. There are constant calls for better transparency into the workings of the legislative process. The dissatisfaction we all feel has created an opportunity for entrepreneurial startups. Their goals are most often to affect change in government. For those of us with existing experience in this field, how can we harness our knowledge and work with these emerging efforts to achieve a greater good for us all?

I’ve spent the past ten years in this field, working as a consultant and developer primarily to the State of California. See my About for more about me. Now, with that experience to draw upon, I am hoping to make this blog a useful tool to others that might learn from my past. I’m going to make this blog a regular part of my life – posting regularly, maybe weekly. With each post I want to raise a number of questions and open up thoughtful discussions. Some of the topics I have in mind:

  • How do we balance openness and transparency with business opportunity?
  • Do we need open standards? If not now, when?
  • When it comes to openness and transparency, what is the government’s responsibility?
  • Are there technologies we need to focus on?
  • Isn’t this a Semantic Web for law? What does that mean anyway?
  • And from time to time I will share some of the questions I get each week about how to model legislation in XML. I’ll try not to get bogged down in technical minutiae.

What else? Please leave me a comment with your suggestions. Rather than just being a blog, I would like to see this grow into more of a conversation about how legal informatics can be applied to achieve a truly beneficial semantic web for law.

What could your role be in all this? Are you a government agency, a not-for-profit, a fledgling startup, a publishing company, or even a technology supplier or consultant like myself? Regardless of who you are, I am asking for your participation in this blog. Together we can shape the future of how legal information is shared around the world.

So let’s get started… My next post will start with a question I have been wrestling with lately – How can we heed the call for better open source data without hindering the for-profit motive that will foster an industry?