An HTML5-Based XML Editor for Legislation!

UPDATE: (May 17, 2012) For all the people that asked for more editing capabilities, I have updated the editor to provide rudimentary cut/copy/paste capabilities via the normal shortcut keys. More to follow as I get the cycles to refine the capabilities.

I’ve just released my mini-tutorial for the HTML5-based XML editor I am developing for drafting legislation (actually it’s best for tagging existing legislation at this point).


Please keep in mind that this editor is still very much in development – so its not fully functional and bug-free at this point. But I do believe in being open and sharing what I am working on. We will be using this editor at our upcoming International Legislation Unhackathons (http://legalhacks.org) this coming weekend. The editor is available to experiment with at legalhacks.org site.

There are three reason I think that this approach to building editors is important:

  1. The editor uses an open standard for storing legislative data. This is a huge development. The first couple generations of legislative systems were built upon totally proprietary data formats. That meant that all the data was locked into fully custom tools that were built years ago could only be understood by those systems. Those systems were very closed. That last decade was the development of the XML era of legislative tools. This made it possible to use off-the-shelf editors, repositories, and publishing tools. But the XML schemas that everyone used were still largely proprietary and that meant everyone still had to invest millions of dollars in semi-custom tools to produce a workable system. The cost and risk of this type of development still put the effort out of reach of many smaller legislative bodies.

    So now we’re moving to a new era, tools based on a common open standard. This makes it possible for an industry of plug-and-play tools to emerge, reducing the cost and risks for everyone. The editor I am showing uses Akoma Ntoso for its information model. While not yet a standard, it’s on a standards track at the OASIS Standards Consortium and has the best chance of emerging as the standard for legal documents.

  2. The editor is built upon open web standards. Today you have several choices when you build a legislative editor. First, you can build a full custom editor. That’s a bad idea in this day and age when there are so many existing editors to build upon. So that leaves you with the choice of building your editor atop a customizable XML editor or customizing the heck out of a word processor. Most XML editors are built with this type of customization in mind. They intrinsically understand XML and are very customizable. But they’re not the easiest tools to master – for either the developer or the end user. Another approach is to use a word processor and bend and distort it into being an XML editor. This is something well beyond the original intent of the word processor and dealing with the mismatch in mission statements for a word processor and a legislative drafting tool leaves open lots of room for issues in the resulting legislation.

    There is another problem as well with this approach. When you choose to customize an off-the-shelf application, you have to buy into the API that the tool vendor supplies. Chances are that API is proprietary and you have no guarantee that they won’t change it on a whim. So you end up with a large investment in software built on an application API that could become almost unrecognizable with the next major release. So while you hope that your investment should be good for 10-12 years, you might be in for a nasty surprise at a most inopportune time well before that.

    The editor I have built has taken a different approach. It is building upon W3C standards that are being built around HTML5. These APIs are standards, so they won’t change on a whim – they will be very long lived. If you don’t like a vendor and want to change, doing so is trivial. I’m not just saying this. The proof is in the pudding. This editor works on all four major browsers today! This isn’t just something I am planning to support in the future; it is something I already support. Even while the standards are still being refined, this editor already works with all the major browsers. (Opera is lagging behind in support for some of the application APIs I am using.) Can you do that with an application built on top of Microsoft Office? Do you want to switch to Open Office and have an application you built? You’re going to have to rewrite your application.

  3. Cloud-based computing is the future, Sure, this trend has been obvious for years, but the W3C finally recognizes the web-based application as being more than just a sophisticated website. That recognition is going to change computing forever. Whether your cloud is public or private, the future lies in web-based applications. Add to that the looming demands for more transparent government and open systems with facilitate real public participation and it becomes obvious that the era of the desktop application is over. The editor I am building anticipates this future.

  4. I’ve been giving a lot of thought to where this editor can go. As the standards mature, I learn to tame the APIs, and the browsers finish the work remaining for them, it seems that legislative drafting is only the tip of the iceberg for such an approach to XML-based editing. Other XML models such as DITA and XBRL might well be other areas worth exploring.

    What do you think? Let me know what ideas you have in this area.

Imagine All 50 States Legislation in a Common Format

Last week I expressed dissapointment over NCSL’s opposition to the DATA Act (H.R. 2146). Their reasoning is that the burden this might create on the state’s systems will not be affordable. Contrast this with the topic of the international workshop held in Brussels last week – “Identifying benefits deriving from the adoption of XML-based chains for drafting legislation“. The push toward more transparent government need not be unaffordable.

With that in mind, stop for a while and imagine having the text from all 50 states legislation publishing in a common XML format. Seem like an impossibly difficult and expensive undertaking doesn’t it? With all the requirements gathering, getting systems to cooperate, and getting buy-in throughout the country, this could be another super-expensive project that in the end would fail. What would such a project cost? Millions and millions?

Well, remember again Henry Ford’s quote “If you think you can do a thing or think you can’t do a thing, you’re right”. Would you believe that a system to gather and publish all 50 states has recently been developed, in months rather than years, and on a shoe-string budget? That system is BillTrack50.com. It’s a 50 state bill tracking service. Check it out! We, at Xcential, helped them to do this herculian task by providing a simple and neutral XML and the software to do much of the processing. The press release is here. The format is SLIM, the same format the underlies my legix.info prototype. It’s a simple, easy-to-adopt XML format built on our past decade’s experience in legislative systems. Karen Sahuka at BillTrack50 recently gave a presentation on her product at the Non-profit Technology Conference in San Francisco.

SLIM is not as ambitious as Akoma Ntoso. If you take a gander at my legix.info site, you will see that it’s very easy to go from SLIM to Akoma Ntoso. In fact, going between any two formats is not all that difficult with modern transformation technology. It’s how we built the publishing system for the State of California as well. My point is that with the right attitude, a little innovation, and the right tools, achieving the modern requirements for accountability and transparency need not be out of reach.

And now for something completely different… Chinese!

Last week we saw how Akoma Ntoso can be applied to a very large consolidated Code – the United States Code. This week we take the challenge in a different direction – applying Akoma Ntoso to a bilingual implementation involving a totally different writing system. Our test document this week is the Hong Kong Basic Law. This document serves as the constitutional document of the Hong Kong Special Administrative Region of the People’s Republic of China. It was adopted on the 4 April 1990 and went into effect on July 1, 1997 when the United Kingdom handed over the region to the People’s Republic of China.

The Hong Kong Basic Law is available in English, Traditional Chinese, and Simplified Chinese. For our exercise, we are demonstrating the document in English and in Traditional Chinese. (Thank you to Patrick for doing the conversion for me.) Fortunately, using modern technologies, supporting Chinese characters alongside Latin characters is quite straightforward. Unicode provides a Hong Kong supplementary character set to handle characters unique to Hong Kong. The biggest challenge is ensuring that all the unicode declarations throughout the various XML and HTML files that the information must flow through are set correctly. With the number of accents we find in names in California as well as the rigorous nature of California’s publishing rules, getting Unicode right is something we have grown accustomed to.

While I hadn’t expected there to be any problems with Unicode, I was pleasently surprised to find that the fonts used in Legix simply worked with the Traditional Chinese characters without issue as well. (Well at least as far as I can tell without the ability to actually read Chinese)

The only issue we encountered was Internet Explorer’s support for CSS3. Apparently, IE still does not recognize “list-style-type” with a value of “cjk-ideographic”. So instead of getting Traditional Chinese numerals, we get Arabic numerals. The other browsers handled this much better.

So what other considerations were there? A big consideration was the referencing mechanism. To me, modeling how you refer to something in an information model can be more important than the information model itself. The referencing mechanism defines how the information is organized and allows you to address a specific piece of information in a very precise and accurate way. Done right, any piece of information can be accessed very quickly and easily. Done wrong and you get chaos.

Our referencing mechanism relies on the Functional Requirements for Bibliographical Records (FRBR). This mechanism is used by both SLIM and Akomantoso. Another interesting FRBR proposal for legislation can be found here.

FRBR defines an information model based on a hierarchical scheme of Work-Expression-Manifestion-Item. Think of the work as the overall document being addressed, the expression being the version desired, the manifestation the format you want to information presented in, and finally the item as a means for addressing a specific instance of the information. Typically we’re only concerend with Work-Expression-Manifestation.

For a bilingual or multilingual system, the “expression” part of the reference is used to specify which language you wish the document to be returned in. If you check out the references at Legix.info you will see that the two references the the Hong Kong Basic Law are:

The expressions are called out as “doc;en-uk” for the English version and “doc;zh-yue” for the Chinese version. Relatively straightforward. The manifestations are not shown and the result is the default manifestation of HTML.

Check the samples out and let me know what you think.

Computerize vs. Automate

There are two words that have long been important to me: computerize and automate. The dictionary defines these words as follows:

Computerize
(kəm-pyū'tə-rīz') 
tr.v., -ized, -iz·ing, -iz·es. 
   1. To furnish with a computer or computer system.
   2. To enter, process, or store (information) in a 
  computer or system of computers
Automate
(ô'tə-māt') 
v., -mat·ed, -mat·ing, -mates. 
v.tr. 
   1. To convert to automatic operation: automate a factory.
   2. To control or operate by automation.

We often make the mistake of confusing these two concepts as the same thing. They are very different. Doing one does not imply the other. Using a computer does not mean you have automated and automating does not imply the need for a computer. I have found the confusion between computerization and automation to be at the very heart of the disappointment many have with XML solutions. Just because you’re using XML does not mean you are reaping the benefits that XML can provide.

Let’s take a step back and see where we are in history. We are living in a very important era. We are witnessing the transition from paper documents to digital information. This is the sort of transition that only happens every few hundred years, rivaling the advent of the Gutenberg printing press in the 15th century. The benefits of digital information are all around us. Just think of how efficient many businesses have become. As I write this, I am waiting on a parcel that was shipped from Shanghai just 4 days ago. I have tracked that parcel throughout its journey and I know with certainty that is will be delivered in the next couple of hours. That is a benefit of automation.

In my experience, governments don’t see the same benefits of automation that the private sector does. Why is this? Governments, like private industry, have readily computerized their operations. But when it comes to automating, governments tends to balk. There are many reasons for this – the perceived loss of jobs, the need to retrain, the lack of competitive pressures. But to me it seems that the overriding reason is tradition. Things are done the way they are because that is the way they have always been done. When it comes time to rethink tradition, it is sometimes hard to identify who you need to get permission from.

Whatever the reasons, the slowness to automate slows innovation when it comes to legislative information. Sure, the information is now online. Great! But what has been put online is most often just digital paper – like PDFs or unstructured HTML. That’s a half step into the future whilst looking to the past. Rather than taking advantage of the new medium and exploiting what now can be done through automation, we’re clinging to the centuries old models for how to manage and publish paper.

Why is this important? What does it matter? Well, for starters, let’s consider accuracy. For as long as I have involved in this field, the importance of accurately representing the law has been drilled into me. Yet whenever I start writing software to analyze laws, from anywhere, I am surprised at how easy it is to find errors. I’m talking about citations to sections of laws that don’t exist anymore or have more recently been renumbered to be somewhere else. I am talking about duplicate numbering or misnumbering. I am talking about common typos. These are all things that could be rectified with proper automation.

A pet project for me is point-in-time law. It is a subject that has fascinated me for a decade. It is very hard to do. Why is that? Because deciding which law is effective or operational at any point in time is really hard to do. And deciphering references between documents is riddled with ambiguity. This is because, whilst we live in an era where information around the world is stitched together at lightning speeds by computers, we still write that information somewhere in the text of a bill to be read by a person alone. Sometimes I find that quite ironic as I am constantly surprised at how few people actually read the bills – despite having strong opinions about them.

Isn’t it time we started treating legislation as digital information rather than as paper? Isn’t it time we went beyond computerization and looked towards real automation of legislation?

Legislative Information Modeling

Last week I brought up the subject of semantic webs for legal documents. This week I want to expand the subject by discussing the technologies that I have encountered recently that point the way to a semantic web. Of course, there are the usual general purpose semantic web technologies like RDF, OWL, and SPARQL. Try as I might, I have been unable to get much practical interest out of anyone in these technologies. Part of the reason is that the abstraction they demand is just beyond most people’s grasp at this point in time. In academic circles it becomes easy to discuss these topics, but step into the “real world” and interest evaporates immediately.

Rather than pressing ahead with those technologies, I have chosen in recent years to step away and focus more on less abstract and more direct information modeling approaches. As I mentioned last week, I see two key areas of information modeling – the documents and the relationships between them. In some respects, there are three areas – distinguishing the metadata about the documents from the documents themselves. Typically I lump the documents with their metadata because much of the metadata gets included with the document text blurring the distinction and calling for a more uniform integrated model.

The projects I have worked on over the past decade have resulted in several legislative information models. With each project I have learned and evolved to result in the SLIM model found at the Legix.info demonstration website that exists today. Over time, a few key aspects have emerged as most important:

  • First and foremost has been the need for simplicity. It is very easy to get all caught up with the information model, discovering all the variations out there and finding clever solutions to each and every situation. However, it easily becomes possible to end up with a large and complex information model that you cannot teach to anyone that does not share your passion and experiences in information modeling. Your efforts to satisfy everyone result in a model that satisfies no one due to the resulting complexity of trying to please too many masters.
  • Secondly, you need to provide a way to build familiarity into your information model. While there are many consistently used terms in legislation, at the same time, traditions around the world do vary and sometimes very similar words have quite different meanings to different organizations. Trying to change long standing traditions to arrive at more consistent or abstract terminology always seems to be an uphill battle.
  • Thirdly, you have to consider the usage model. Is the model intended for downstream reporting and analysis or does the model need to work in an editing environment? An editing model could be quite different from a model intended only for downstream processing. The reason for this is that the manner in which the model will interact with the editor must be given important consideration. Two important aspects require consideration. First, the model must be robust yet flexible enough to work with all the intermediate states that a document will exist at whilst being edited. Second, change tracking is a very important consideration during the amendment process and how that function will be implemented in the document editor must be considered.

While I have developed SLIM and its associated reference scheme over the past few years, in the last year I have started experimenting with a few alternate models in the hopes of finding a more perfect model to solve the problem of legislative information modeling. Most recently I have started experimenting with Akoma Ntoso developed by Fabio Vitali and Monica Palmirani at the University of Bologna. This project is supported by Africa i-Parliaments, a project sponsored by United Nations Department of Economic and Social Affairs. I very much like this model as it follows many of the same ideals in terms of good information modeling that I try to conform to. In fact, it is quite similar to SLIM in many respects. The legix.info site has many examples of Akoma Ntoso documents, created by translating SLIM into Akoma Ntoso via an XSL Transform.

While I very much like Akoma Ntoso, I have yet to master it. It is a far more ambitious effort than SLIM, has many more tags, and covers a broader range of document types. Like SLIM, it covers both the metadata and the document text in a uniform model. I have yet to convince myself as to its viability as an editing schema. Adapting it to work with the editors I have worked with in the past is a project I just haven’t had the time for yet.

The other important aspect of a semantic web, as I wrote about last week is the referencing scheme. Akoma Ntoso uses a notation based on coded URLs to implement referencing. It is partly based on a conceptually similar model URN:LEX model based around URNs developed by Enrico Francesconi and Pierluigi Spinosa at the ITTIG/CNG in Florence, Italy. Both schemes build upon the Functional Requirement for Bibliographic Records (FRBR) model. I have tried adopting both models but have run into snags with the models either not covering enough types of relationships, scaring people away with too many special characters with encoded meaning, or resulting in too complex a location resolution model for my needs. At this point I have cherry picked the best features of both to try and arrive at a compromise that works for my cases. Hopefully I will be able to evolve towards a more consistent implementation as those efforts mature.

My next effort is to start taking a closer look at MetaLex, an open XML-based interchange format for legislation. It has been developed in Europe and defines a set of conventions for metadata, naming, cross references, and compound documents. Many projects in Europe including Akoma Ntoso comply with the Metalex framework. It will be interesting for me to see how easily I can adapt SLIM to Metalex. Hopefully the changes required will amount mostly to deriving from the Metalex schema and adapting to its attribute names. We shall see…

What is a Semantic Web?

Tim Berners-Lee, inventor of the World Wide Web, defines a semantic web quite simply as “a web of data that can be processed directly and indirectly by machines“. In my experience, that simple definition quickly becomes confusing as people add their own wants and desires to the definition. There are technologies like RDF, OWL, and SPARQL that are considered key components of semantic web technology. It seems though that these technologies add so much confusion through abstraction that non-academic people quickly steer as far away from the notion of a semantic web as they can get.

So let’s stick to the simple definition from Tim Berners-Lee. We will simply distinguish the semantic web from our existing web by saying that a semantic web is designed to be meaningful to machines as well as to people. So what does it mean for a web of information to be meaningful to machines? A simple answer is to say that there are two primary things that a machine needs to understand about a web. First of all, what the pages are all about, and secondly what the relationships that connect the pages together are all about.

It turns out that making a machine capable of understanding even the most rudimentary aspects of pages and the links that connect them is quite challenging. Generally, you have to resort to fragile custom-built parsers or sophisticated algorithms that analyze the document pages and the references between them. Going from pages with lots of words connected somehow to other pages to a meaningful information model is quite a chore.

What we need to improve the situation are agreed upon information formats and referencing schemes in a semantic web that can more readily be interpreted by machines. Defining what those formats and schemes are is where the subject of semantic webs starts getting thorny. Before trying to tackle all of this, let’s first consider how this all applies to us.

What could benefit more from a semantic web than legal publishing? Understanding the law is a very complex subject which requires extensive analysis and know-how. This problem could be simplified substantially using a semantic web. Legal documents are an ideal fit to the notion of a semantic web. First of all, the documents are quite structured. Even though each jurisdiction might have their own presentation styles and procedural traditions, the underlying models are all quite similar around the world. Secondly, legal documents are rich with relationships or citations to other documents. Understanding these relationships and what they mean is quite important to understanding the meaning of the documents.

So let’s consider the current state of legal publishing – and from my perspective – legislative publishing. The good news is that the information is almost universally available online in a free and easily accessed format. We are, after all, subject to the law and providing access to that law is the duty of the people that make the laws. However, providing readable access to the documents is often the only objective and any which way of accomplishing that objective is simply the requirement. Documents are often published as PDFs which are nice to read, but really difficult for computers to understand. There is no uniformity between jurisdictions, minimal analysis capability (typically word search), and links connecting references and citations between documents are most often missing. This is a less than ideal situation.

We live in an era where our legal institutions are expected to provide more transparency into their functions. At the same time, we expect more from computers than merely allowing us to read documents online. It is becoming more and more important to have machines interpret and analyze the information within documents – and without error. Today, if you want to provide useful access to legal information by providing value-added analysis capabilities, you must first tackle the task of interpreting all the variations in which laws are published online. This is a monumental task which then subjects you to a barrage of changes as the manner in which the documents are released to the public evolves.

So what if there was a uniform semantic web for legal documents? What standards would be required? What services would be required? Would we need to have uniform standards or could existing fragmented standards be accommodated? Would it all need to come from a single provider, from a group of cooperating providers, or would there be a looser way to federate all the documents being provided by all the sources of law around the world? Should the legal entities that are sources of law assume responsibility for publishing legal documents or should this be left to third party providers? In my coming posts I want to explore these questions.

To go Open Source or Not?

It is my dream to establish a legal informatics industry. Today, legal informatics is conducted either as an internal function or by consulting firms that specialize in long multi-year projects to build custom solutions. The few commercial products that exist are in the form of proprietory products or web services. Compared to many other forms of informatics, legal informatics has evolved very slowly. Part of the reason for this, of course, is the specialized nature of this field. This is particularly the case with legislative information where each legislature or parliament has long established traditions that are difficult to change.

As with every other informatics field, an industry will be established eventually. The costs of custom built software are simply economicially impractical in many cases, demanding a re-think about how solutions are created. For me, a key part of establishing that industry is the creation of standards. Whether there are official standards or de facto standards, standards will spur on the creation of an industry by creating a common model upon which to build. I have seen this happen in other industries that I have participated in and I don’t see why legal informatics should be any different. Yes, legal informatics is tardy in this regard, but that slowness should not discourage us from making it happen now.

So the question is quite simple. Can an open source solution form the basis of a de facto standard for legal informatics? And if so, what does that solution need to consist of? That is the question we have been wrestling with. There are two sides to this argument. While we might want to promote the establishment of an industry, at the same time we need to provide an economic incentive that will encourage businesses to participate. Could providing too much of an open source solution merely enable existing players to be more efficient, yet continue to work in relative isolation? It seems that a better outcome would be to promote the creation of interoperable products that can be mixed or matched to solve the multitude of needs in this field.

To this end, we’re trying a two pronged approach. First, we are fully supporting the establishment of official standards through bodies such as OASIS. Secondly, understanding that the official standard route is going to be a slow and perhaps arduous process, we’re pushing for the establishment of de facto standards. To this second goal, we have open-sourced our own SLIM model for legislation. It’s a very simple XML model based on 10 years experience building these types of models. While it isn’t a be-all and end-all solution, it is consistent with the current XML thinking and is quite easy to adopt. I have spent a fair amount of time wrestling with how to release it as an open source package recently. There are two questions I have:

  1. Which model? There are so many to choose from: Creative Commons, GNU, BSD, etc. Which model is permissive enough without discouraging commercial entities from adopting it.
  2. What aspects should be open source? I think it is quite clear that any and all XML information models should be open. That is in the spirit of XML, is consistent with the public domain nature of legislative information, and will allow the data to be accurately interpreted long after any particular software application of company has run its course. But would providing foundational software packages that are also open source further encourage the adoption of the model? And if that is the case, what foundation software would be beneficial?

At this point we have answered the first question and the first part of the second. We have chosen to release the SLIM XML schema as open source by using a Creative Commons Attribution-ShareAlike license. The rest of the second question remains open. What else we should provide with an open source license? Certainly it cannot be our full software suite. We are a commercial business and we need to make a living. But parts of our packages could be released to promote the adoption of SLIM as a de facto standard of sorts. What do you think?

Welcome to my new blog on Legal Informatics

Imagine that all the world’s laws are published electronically in an open and consistent manner. Imagine that you or your business can easily research the laws to which you are subject. Imagine an industry that caters to the needs of the legal profession based on open worldwide standards.

Of course, there are many reasons why this is just not possible. Every legislature or parliament has their own way of doing things. Every country has their own unique legal system.  Every jurisdiction has their own unique traditions. It simply isn’t possible that all these unique requirements could be harmonized to achieve that vision. Of course not… But it will happen. It might take 50 years, but eventually it will happen. We can debate endlessly why it won’t. We can argue over nuances that get in the way forever. That’s not why I am writing this blog.

I want to open the discussion to how it might happen. What steps can we can start taking right now that will lead us towards our eventual goal? We live in an era where there is widespread dissatisfaction with the way our governments pass laws. There are constant calls for better transparency into the workings of the legislative process. The dissatisfaction we all feel has created an opportunity for entrepreneurial startups. Their goals are most often to affect change in government. For those of us with existing experience in this field, how can we harness our knowledge and work with these emerging efforts to achieve a greater good for us all?

I’ve spent the past ten years in this field, working as a consultant and developer primarily to the State of California. See my About for more about me. Now, with that experience to draw upon, I am hoping to make this blog a useful tool to others that might learn from my past. I’m going to make this blog a regular part of my life – posting regularly, maybe weekly. With each post I want to raise a number of questions and open up thoughtful discussions. Some of the topics I have in mind:

  • How do we balance openness and transparency with business opportunity?
  • Do we need open standards? If not now, when?
  • When it comes to openness and transparency, what is the government’s responsibility?
  • Are there technologies we need to focus on?
  • Isn’t this a Semantic Web for law? What does that mean anyway?
  • And from time to time I will share some of the questions I get each week about how to model legislation in XML. I’ll try not to get bogged down in technical minutiae.

What else? Please leave me a comment with your suggestions. Rather than just being a blog, I would like to see this grow into more of a conversation about how legal informatics can be applied to achieve a truly beneficial semantic web for law.

What could your role be in all this? Are you a government agency, a not-for-profit, a fledgling startup, a publishing company, or even a technology supplier or consultant like myself? Regardless of who you are, I am asking for your participation in this blog. Together we can shape the future of how legal information is shared around the world.

So let’s get started… My next post will start with a question I have been wrestling with lately – How can we heed the call for better open source data without hindering the for-profit motive that will foster an industry?

Follow

Get every new post delivered to your Inbox.

Join 72 other followers