[ad_1]
Keshav Soni and Rónán Kennedy *

This article explores the integration of Large Language Models (LLMs) with Rules as Code (RaC) systems to address scalability challenges in coding the law. While RaC offers potential solutions to the “open texture” problem in law by encoding legislation into computer-readable formats, the complexity and volume of modern legal systems make manual encoding resource-intensive. The paper examines early experiments using LLMs to automate legal text conversion into code, highlighting the fundamental tension between the deductive reasoning of RaC systems and the inductive reasoning of LLMs. It identifies the “black box” problem in LLMs as a key obstacle and proposes potential solutions including Explainable AI and Chain-of-Thought prompting to reconcile accuracy with explainability. The article demonstrates how Chain-of-Thought prompting improves both accuracy and explainability in legal reasoning tasks, suggesting a promising direction for scaling RaC systems while maintaining interpretability
HLA Hart, in his book ‘The Concept of Law’ presents us with a positivist account of rules and legal system. In Chapter 7, he presents us with the problem of the open texture of law. He argues that using natural languages such as English while drafting legislation would necessarily lead to the problem of open texture. He argues that when laws are made in natural languages, there will be a set of plain meanings called the core and a set of unsettled meanings called penumbra. It is important to note that he purposefully attaches the problem of open texture to natural languages and attaches no such problem to symbolic languages or computer codes.
The Rules as Code (“RaC”) movement provides an exciting opportunity to resolve the problem of open texture by coding the law. It holds the potential to revolutionise our legal systems and make it more accessible to people. However, a problem that arises when we seek to encode our legislation is scalability. The laws in the 21st century look nothing like the ‘No Vehicles in the Park’ rule that Hart presents to us. They are far more complex with many intermixed applications. If we are to encode such a complex system of law, we must also try to make this process efficient. In this context, Large Language Models (“LLMs”) may be of some assistance, although detailed testing will be necessary before large-scale adoption.
This article deals with the use of LLMs in scaling up RaC systems, the challenges surrounding it, and how we can solve these challenges. Part I of the article highlights the problems associated with developing RaC systems for any legal system – the problem of rigidity and their error-prone nature. It suggests the utilization of LLMs as a potential solution to these problems to scale up RaC systems. Part II of the article explores the past experiments that sought to use LLMs to directly convert legal texts to codes. It highlights the takeaways and limitations from these experiments in employing LLMs to automatically extract legal representation from legal text. Part III of the article deals with challenges associated with utilizing LLMs in RaC systems. It highlights the problem of the ‘black box’ and the difference in reasoning in RaC systems and LLMs that ends up creating a trade-off between explainability and accuracy. Part IV of the article explores potential solutions to these problems and suggests the use of Explainable AI and Chain-of-Thought prompting. Part V of the article provides the final conclusion.
Scalability and Rules as Code
Rules as Code systems may provide us with an opportunity to make better policy outcomes and increase transparency and efficiency. However, while there are numerous benefits, it is also important to adopt a balanced outlook to ensure a realistic approach. As pointed out by Kennedy, they have often failed to achieve their promises. Rigid and unchangeable systems could be harmful if we need to make a course correction. The rigidity of computer codes would mean that RaC systems would be much slower to develop. Further, these systems are error-prone as there is a possibility that legal rules can often get lost in translation while encoding legislation. This presents us with a significant challenge in developing RaC systems for any legal system.
We will explore one of the potential solutions to these problems – using LLMs to scale up RaC systems. LLMs could help in extracting formal representation from legislation to directly convert text to a structured legal representation. This is an attractive solution not only because LLMs like generative pre-trained transformers (GPTs) can potentially improve productivity by prompting them in natural language, but also because representations generated by an LLM can sometimes outperform manually created representations. In the next section, we will explore the instances of usage of LLMs to develop RaC systems. We attempt to highlight the learnings from these experiments utilizing LLMs to develop RaC systems.
LLMs and Rules as Code: Exploratory First Steps
The possibility of expanding Rules as Code systems through LLMs has led some within the RaC community to experiment with them. While LLMs have not yet been successfully integrated to scale up Rules as Code systems, we analyze these experiments to understand the limitations in employing LLMs in the context of RaC. In this section, we will analyze three such experiments to observe the takeaways from these experiments.
In 2023, Janatian et al. employed LLMs to automatically extract legal representation from texts to create a rule-based expert system called JusticeBot to help understand laypersons how legislation applies to them. The legal representation created by the LLMs and humans was then rated in a blind comparison. The comparison demonstrates a negative correlation between the accuracy of the legal representation created by the LLM and the complexity of the legal text. For simple rules, the representation created by LLMs was preferred by 78% of the test participants which decreased to 50% and 37.5% in the case of normal and hard legal rules respectively. In the case of complex rules, the model produced incorrect output by missing important elements or taking an assumption which was not part of the text. This experiment highlighted the limitations of employing LLMs to automatically extract legal representations from legal text.

Figure 1 – Working of the JusticeBot to create a rules-based expert system by employing LLMs
Additionally, in 2023, Jason Morris, a key figure in the field, tried using ChatGPT4 to answer various legal questions and generate code to use with his BlawX programming tool. In this experiment, he tested GPT4’s capability in three situations. First, accuracy in providing legal advice. Second, accuracy in collecting and encoding fact scenarios and summarizing symbolic explanations. Third, accuracy in generating codes for law. Morris concludes that while GPT4 might be significantly better than its previous versions in interpreting the law, its flaws make it unsuitable for providing legal advice. While the model was successful in summarizing legal text into symbolic explanations, it failed to provide correct code semantically (there were errors in following the rules of the programming language) and syntactically (there were errors in the code). Thus, the use of LLMs in scaling up the development of RaC systems could potentially suffer from the problem of logical inconsistency and error in generating code. If one wishes to employ LLMs in RaC development, one must first tackle these issues.
Additionally, in September 2024, teams at Georgetown University tested whether Generative AI tools could help make policy implementation more efficient by converting policies into plain language logic models and software code under a RaC approach. The important and relevant takeaway from this experiment was that results from LLMs could be improved by incorporating human evaluation and providing the model a ‘template’ of what to expect i.e. by prompt engineering.
These experiments highlight the problem of flawed reasoning in LLMs in scaling up the RaC system through these models. In the next section, we will take a closer look at the challenges of encoding law through LLMs by focusing on the drawbacks of these models. We argue that there is an inherent inconsistency in employing LLMs in RaC system due to the difference in the reasoning between them. Thus, if one seeks to employ LLMs in RaC development, they must first reconcile these problems.
Challenges for Rules as Code and LLMs
The use of LLMs to extract information from legal texts to generate code is not a new concept; it has been employed in various scenarios by many scholars. However, LLMs have significant limitations when they are used in a RaC context, where accuracy, analysis and completeness are crucial. There have been some attempts at tackling this problem, with some limited success. A team at Stony Brook University used an LLM for knowledge extraction and the Prolog programming language for reasoning, achieving 100% accuracy. The Communications Research Centre in Ottawa developed prompts for an LLM that can generate artefacts such as a knowledge graph which can be the input to further development work. They also developed a retrieval augmented generation system for regulatory documentation that ‘has shown great promise’. Price and Bertl have developed a method for automatically extracting rules from legal texts which could be applied using LLMs for greater efficiency.
However, the problem in building RaC systems through LLMs persists as there is a fundamental difference in reasoning between the two systems. RaC systems are based on encoding fixed legal rules into computer-readable code which is used to increase efficiency in the legal system. This represents a simple expert system approach, a type of AI system which employs deductive reasoning based on the encoded laws to give the correct legal output that matches thecorrect legal reasoning based on the statute. On the other hand, LLMs are based on unsupervised Machine Learning systems which employ inductive reasoning where the output is determined at random (or ‘stochastically’) by mass correlations. It is a prediction algorithm which generates a string of words that are statistically likely to be found in a sequence. Furthermore, ML systems involve deep learning techniques to analyze and interpret complex data which lacks explainability and results in the ‘black box’ problem.
The black box problem in LLMs ends up creating a divide between explainability and accuracy where models with higher transparency score low on accuracy. This makes them unsuitable for use in scaling up rules-based expert models due to concerns of lack of transparency and explainability. Thus, if we want to employ LLMs to scale up RaC systems, we must solve the problem of the ‘black box’. In the next section, we outline some potential solutions for use of LLMs in scaling up RaC systems. While these may not completely resolve the problem, they can serve as a starting point to incorporate LLMs in RaC systems.
Potential Solutions for the Use of LLMs in Rules as Code
The problem of the black box plagues LLMs as it is widely believed that these models are inherently uninterpretable. However, the question that arises is whether scalability and explainability in LLMs are antithetical to each other. It may be possible to reconcile them to implement LLMs in situations where explainability is paramount. In their article, Rudin and Radin argue that it is a false assumption that we must forego accuracy for interpretability. It raises an important point of how lack of accuracy in black box models might hurt accuracy. Take the example of a human driver versus a self-driving car based on a black box model. One may prefer the human driver for their ability to reason and explain their action. However, such an approach assumes that we must compromise explainability for accuracy. This assumption has been disproved in many studies related to the criminal justice system where simple explainable models were as accurate as black box models. Moreover, in some scenarios using a black box model can lead to various fatal mistakes. The non-explainable black box models can mask errors in the dataset, data collection issues and various host of issues. This balance between explainability and accuracy can be better maintained if scientists understand the models they built. This can be achieved by building a larger model which is decomposable into different interpretable mini-models.
The use of interpretable mini-models may solve the problem of explainability, but how do we solve the flaws in reasoning in LLMs? One answer to this dilemma may be Chain-of-Thought prompting(‘CoT’). It involves providing a multi-step few-shot learning approach where a larger problem is broken down into smaller intermediate steps to solve before arriving at the final solution. The application of CoT in legal reasoning lies in breaking down complex legal questions into smaller steps that incorporate various legal equations such as court judgment, repeal of legislation and other factors. This approach not only improves accuracy, but also provides an interpretable window into the reasoning of the LLM to analyze how it may have arrived at a particular conclusion.
A simple example of CoT in legal reasoning could be employing GPT-4o to find out what is the law on the regulation of groundwater in Tamil Nadu. This is an interesting example because earlier this subject was regulated by the Tamil Nadu Groundwater (Development and Management) Act, 2003 (‘2003 Act’). However, the 2003 Act was repealed by the Tamil Nadu Groundwater (Development and Management) Repeal Act, 2013 after which Tamil Nadu lacked comprehensive state-wide legislation regulating groundwater. The question that arises is whether GPT-4o is able to give a correct answer without CoT? Our finding suggests that without additional prompting, it is not.

Figure 1 – Answer Provided by GPT-4o without Chain of Thought Prompting
In Figure – 1, GPT-4o incorrectly states that the 2003 Act was never brought into force without explaining its reasoning behind this conclusion. Further, it fails to highlight the current situation of lack of state-wide regulation of groundwater in Tamil Nadu. Thus, GPT-4o fails to accurately answer the problem and does not sufficiently explain the reasoning behind its answer. This demonstrates the inductive form of reasoning in LLMs where it formulates a string of words that are most statistically likely to be found together. This results in a lack of reasoning and randomness – i.e. the problem of the ‘black box’.
The question that arises is whether CoT would improve GPT-4o’s accuracy and explainability. Our findings suggest that the answer is in the affirmative. By using CoT to guide the LLM, we can prompt it on how to analyze the law governing a subject matter in a state by incorporating a multi-step process that involves analyzing whether the legislation has been repealed and if so, what is the current situation in the legal system.

Figure 2 – Answer Provided by GPT-4o with Chain of Thought Prompting
In Figure – 2, GPT-4o accurately answers the question by correctly identifying that the 2003 Act has been repealed and Tamil Nadu lacks comprehensive state-wide legislation regulating groundwater. By providing a multi-level reasoning process to GPT-4o via CoT, we receive output with increased accuracy and explainability. Of course, a single test in a specific domain does not conclusively demonstrate that this approach is universally useful. Whether CoT provides a means to scale up RaC by using LLMs as a support tool requires rigorous testing across a range of problems and legal systems.
Conclusion
The concept of ‘Rules as Code’ presents us with an opportunity to solve the problem of the ‘open texture’ of law. By encoding the law, we might be able to bring greater certainty and efficiency to our legal system. However, a major problem in encoding our legal system is volume. The laws in the 21st century are too substantial and complex to manually encode them without the allocation of significant resources. In this context, this article presents one potential solution for this problem – the use of LLMs to scale up the development of RaC systems. This article explores the feasibility of adopting such an approach and the challenges surrounding it. In conclusion, it suggests CoT as one potential solution to these challenges, alt
*Keshav Soni is a law student at the National Law School of India University (NLSIU), Bengaluru. He is interested in tech law, constitutional law, and criminal law
Dr Rónán Kennedy is an Associate Professor in the School of Law, University of Galway. He has written on environmental law, information technology law, and other topics, and co-authored two textbooks. He spent much of the 1990s working in the IT industry. He was Executive Legal Officer to the Chief Justice of Ireland, Mr Justice Ronan Keane, from 2000 to 2004. In 2020, he was a Science Foundation Ireland Public Service Fellow in the Oireachtas Library and Research Service, writing a report on ‘Algorithms, Big Data and Artificial Intelligence in the Irish Legal Services Market’. In January 2025, he was appointed to the Judicial Appointments Commission.
[ad_2]
Source link