A Brief Overview of Corpus Legis

Michael Genesereth
Computer Science Department
Stanford University

Abstract: Corpus Legis is a library of governmental regulations encoded in computable form. While various projects in the past have experimented with encoding laws in computable form in specific areas and specific jurisdictions, Corpus Legis is an attempt to assemble a comprehensive collection of rules across multiple application areas and multiple jurisdictions - federal, state, and local - with an initial focus on the United States. In order to facilitate the use of its rules, the site offers tools for browsing regulations in the library, scenario-based search, evaluating hypothetical scenarios, and evaluating hypothetical changes to the regulations. It also provides downloadable open-source software for incorporating rules in custom applications. This article is a brief description of Corpus Legis and the services it offers.

1. Computational Law

Computational Law is the branch of Legal Informatics concerned with the mechanization of legal analysis (whether done by humans or by computers). From a pragmatic perspective, Computational Law is important as the basis for computer systems capable of doing legal calculations, such as compliance checking, legal planning, regulatory analysis, and so forth. Some systems of this sort already exist.

Various systems of this sort already exist. Turbotax is a classic example. Based on values supplied by its user, it automatically computes the user's tax obligations, fills in appropriate tax forms, and files those forms with the appropriate authority. If asked, it can supply explanations for its results in the form of references to the relevant portions of the tax code.

Systems like Turbotax have value in administering many government programs - in dealing with privacy and security matters, in intellectual property rights management, reporting), in assessing compliance of plans with building codes (affected by local, county, state, and federal safety requirements), in electronic commerce (e.g. import/export restrictions on technology, drugs, and so forth), in labor law (e.g. occupational safety regulations and health care benefits, notably cases where state regulations interact with federal provisions), and so forth.

And the potential for deployment of such applications is substantial due to technological developments like the Internet, mobile systems (such as smart phones and smart watches), and the emergence of autonomous systems (such as self-driving cars and robots).

2. Rule-Based Systems

The question is how to build such systems. In traditional programming, systems take the form of computer programs written in languages like C and Java. These programs are typically regulation-specific and task-specific - each is designed to accomplish a specific task and embody a specific set of regulations. Our approach is to factor the regulations out or these programs and to supply them to regulation-independent programs as data.

One advantage of separating representation and processing in this way is that a single general legal system can be used multiple times, for different jurisdictions and for different combinations of jurisdictions. The dual of this is also true. Once a set of regulations is encoded formally, it can be supplied as input to different legal reasoning engines for different purposes, e.g. to check compliance, to plan for compliance, to detect inconsistencies or redundancies, and so forth.

The key to our approach is Logic Programming. In Dynamic Logic Programming, the behavior of an worksheet requires just three components: (1) a data model for the application, (2) definitions of key concepts expressed in a logic-like language, and (3) a specification of behavior in terms of transition rules.

Also automated reasoning to compute the legality of real and hypothetical scenarios, plan for compliance, and assess the completeness and consistency of regulations.

Admittedly, specifying these things requires some work, just as writing the formulas in an spreadsheet requires some work. However, our experience has shown that the work required to create and maintain logic programs is less than that required for systems built with traditional programming technologies. Moreover, it is far easier for people to learn to build systems this way than with previous technologies.

Today, there are multiple vendors of legal technology who use Computational Logic in one form or another. For example, there is Neota Logic, a software platform that enables non-programmers to build and deploy rule-based dialog systems for professional services in general and legal services in particular.

And there is a web-based service, called Worksheets, which enables users to create, publish, and manage online worksheets. The main feature of the service is its do-it-yourself nature. Administrators using the service can create and manage worksheets on their own, without the help of professional programmers. Just as it is possible for users without programming expertise to create traditional spreadsheets, it is possible for users without programming expertise to create and manage online worksheets, workbooks, and workspaces.

3. Corpus Legis

Today, different developers encode regulations independently of each other and expend effort to keep them updated. Thus, much work is done redundantly, and there is increased potential for mistakes - missing regulations or erroneous regulations.

One way of decreasing these problems is to have a store of regulations from which different platforms can draw, a utility similar to a power company or a water company, where regulations are published just once and provided for everyone to use, ideally with input by the regulatory agencies themselves.

Corpus Legis is a comprehensive library of governmental regulations encoded in computable form. While various projects in the past have experimented with encoding laws in computable form in specific areas and specific jurisdictions, Corpus Legis is an attempt to assemble a comprehensive collection of rules across multiple branches of government and multiple jurisdictions - federal, state, and local. In this regard, it is similar in scope to various previous projects, e.g. the Hammurabi Project.

Note that Corpus Legis can in principle also be used as a repository for non-governmental regulations - business rules and contracts, especially with gorillas like insurance companies. We have also talked about including the rules of games. This may seem frivolous but actually a good way for people to learn how to read and write rules. A repository of this sort (Gamemaster) has been around for the last ten years or so and has show the adequacy of the language.

4. Organization

In Corpus Legis, regulations are grouped together into rulesets, each of which is authored by a single individual or closely interacting group of individuals.

There is no constraint on how many regulations can be included in a given ruleset. The author of a ruleset typically decides on the coverage of the ruleset. That said, the ideal is non-redundancy and completeness. It is desirable for each regulation to appear in just one ruleset, and ideally, all regulations within a jurisdiction should (eventually) appear in at least one ruleset.

In Corpus Legis, each ruleset is annotated with a few pieces of metadata - a title, a short description (in English), the source document for the regulations encoded in that ruleset, the author of the ruleset, and the authority / jurisdiction of the ruleset.

The rulesets in Corpus Legis are grouped according to the authority or jurisdiction associated with the ruleset. For example, all rules in the United States code of regulations are grouped together; all California state regulation are grouped together; all Palo Alto regulations are grouped together; and so forth.

Authorities / jurisdictions are grouped together by containment. For example, all states and territories of the USA are listed as parts of the USA. Counties are listed with each state. Cities are listed with each county.

5. Search

Corpus Legis provides several ways for users to find rulesets - browsing, keyword search, and semantic search.

Browsing is the most straightforward method for finding regulations. If a user is interested in regulations from a specific jurisdiction, he can navigate the organizational hierarchy to find the jurisdiction of interest and can then examine the hierarchy of rulesets associated with that juriasdiction.

Keyword search allows the user to find applicable regulations by entering keywords into a search box. Corpus Legis uses the keywords specified by the user to produce a list of all rulesets in which those keywords appear in the title, short description, authority, or the original text of the associated regulations.

In semantic search, the user specifies a collection of facts about a specific case and specifies a question of interest. The system uses the specified facts to find rulesets that may be relevant to answering the specified question. The more specific the case, the narrower the set of results.

The downside of semantic search is that it requires more work on the part of the user than the other search methods. He must specify the details of the case. However, it is an extremely powerful search technique. It can be used for finding rulesets that span multiple jurisdictions, and it can be used to identify interactions between rulesets, either within or across jurisdictions.

6. Application

Once a user has identified a ruleset of interest, he can read the original text of the underlying regulations and he can read the rules and he can read recorded commentary on the ruleset. Corpus Legis also provides its users a simple way to try out the rules on the details of specific cases.

The schema associated with a ruleset partitions relations mentioned in the ruleset into three types - input relations, intermediate, and output relations. An input dataset is a collection of data involving only the input relations, and an output dataset is a collection of data involving the output relations. A ruleset is effectively a mapping from input datasets to output datasets; for each input dataset, there is one and only one corresponding output dataset.

Corpus Legis's application tool allows its users to explore this mapping by providing different input datasets and viewing the resulting output dataset. When the tool is invoked with a given ruleset, the system provides the user a generic "worksheet" that shows the input and output relations. The user can enter new values for the input relations, and the system computes and shows the the user the corresponding output relations.

Once a dataset is entered, the user can store the dataset. And, once datasets are stored, they can be retrieved for later use. This allows the user to explore the ruleset across multiple cases.

As we discuss below, this also allow authors of rulesets to evaluate changes to rulesets to determine whether those changes lead to changes in the results on any previously stored datasets, either for quality assurance purposes in encoding regulations or to evaluate the effects of hypothetical changes to regulations.

7. Distributed Development

One feature that differentiates Corpus Legis from other projects in Computational Law is its scope. It is intended to be a comprehensive library of governmental regulations encoded in computable form. Creating such a library is likely to require a large amount of work.

In order to meet this need, the roadmap for Corpus Legis identifies several sources of contribution - (1) research projects within CodeX and other centers specializing in Computational Law, (2) Law school clinics, and (3) other individuals and organizations in the Computational Law community (following the Wikipedia model).

Of course, the best way to deal with this problem, and the long term ideal for Corpus Legis, is for regulatory bodies to author regulations in computable form. To their credit, several agencies have expressed interest in making contributions of this sort. However, this approach is not likely to be practical until a substantial fraction of Corpus Legis has been built and its value has been demonstrated.

The main difficulty with distributed development is ensuring that contributions from different individuals or organizations can be combined effectively; and the main problem in this regard stems from differences in the vocabulary and schemas used by different contributors.

To some extent, these differences can be mitigated by providing standard vocabularies and naming conventions for entities of various sorts (e.g. countries, states, cities), information about these entities (e.g. organizational relationships and the roles of individuals), and definitions of standard concepts (e.g. the notion of compound interest).

Unfortunately, even with such standards and conventions, there are likely to be differences that make it impossible to compare or combine regulations and apply to them given data. For example, each state might prefer to represent information about its citizens in a different way.

One way of dealing with this problem is to use data integration technology. This requires the existence of a master vocabulary and a master schema. However, it does not require anyone to use that vocabulary and schema in encoding their data and writing their definitions; they can use whatever vocabulary and schema is most convenient. The catch is that they must separately write mapping rules defining their concepts in terms of the concepts in the master vocabulary and schema. Although this requires additional work, it allows different contributors to use their vocabularies and schemas; and those mapping rules allow systems like Corpus Legis to combine datasets and rulesets written using different vocabularies and schemas.

8. Conclusion

In a sense, Computational Law is the natural next step in a progression that began millenia ago. Around 1750 BC, Hammurabi had the laws of the land encoded in written form (literally etched in stone) so that citizens could know what was expected of them and what would happen if they violated those expectations. Since then, it has been the norm to encode rules in written form and disseminate first via books and more recently via the Internet. However, with the proliferation of rules and regulations, just writing things down is not enough when the laws are voluminous and difficult to understand. In a way, Computational Law is the first revolutionary bit of progress in this regard since the days of Hammurabi, and Corpus Legis is something like the Code of Hammurabi in computer form.