Is AI actually useful in legal practice, or just a hotbed of errors?
Reconciling a pair of recent empirical studies
One of my favorite anecdotes about AI and legal practice comes from a talk I was giving last spring to a group of lawyers. In the Q+A after the talk, one law firm partner raised his hand and told me about his week. Early in the week, his firm had received a notice from its malpractice insurance provider. The insurance company was giving notice that its malpractice policy did not cover any task on which generative AI was used. Fair enough—with legal news coverage replete with stories of fake cases and other hallucinations, it seems like an insurance provider might want to be cautious about the technology. But then, later in the week, his firm got a letter from one of their biggest clients, a large technology company. The tech company said that it would no longer pay for associates’ time on routine tasks that could be handled by generative AI—in other words, asking the law firm to start using generative AI instead of associates on a number of tasks, to save money.
That firm’s story captures the split reactions to generative AI that have been common in legal circles in recent years—some people love it, some people hate it, and many people (like the partners at that law firm) are just trying to figure out what to do about it. Lots of the conversation around AI is still future-oriented, with many different flavors of hype, skepticism, utopianism, and pessimism all focused on what to make of a bunch of trend lines. But there are also a lot of decisions that have to be made by many people and institutions today, about how to deal with the technology as it currently exists.
The dichotomous nature of reactions to AI was on my mind recently as I read two empirical studies put out recently in the area of AI and legal practice. As I’ve mentioned, this is an area where I think we could use much more in the way of systematic empirical study. So I really enjoyed both of these papers. And, interestingly, they both point in somewhat different directions. The papers are nuanced and not written as advocacy pieces, but I think it’s fair to say that one comes across as more positive about the use of AI in legal practice, and another comes across as more negative. Yet they both made lots of sense to me, and in some ways were unsurprising. So I thought I would try and read the two papers in light of each other, and try to reconcile the worlds of generative AI that they describe.
The first study is by Daniel Schwarcz et al. on “AI-Powered Lawyering.” This team of researchers has conducted one of the very few instances of real, empirical, comparative assessments of legal work done with AI assistance vs. without it. They put together a randomized controlled trial in which law students were assigned six different legal tasks (writing a memo, drafting a motion, etc.). The students were randomly sorted into groups that would do the tasks either with AI or without AI. The work product was then graded by graders who were blinded as to whether AI was used or not.
They found that for five of the six tasks, using AI resulted in large improvements in the speed with which the students completed the assignment. And in terms of quality, using an AI tool either produced measurably higher-quality work, or there wasn’t a meaningful difference. They tested two AI tools: the legal-industry-specific Vincent AI, and the generalist reasoning model o1-preview from OpenAI. They noticed that o1-preview hallucinated more cases than Vincent, and that the overall rate of hallucinations was relatively low—with 18 hallucinations occurring across the 768 completed tasks. So this paper is, I think, relatively bullish on its implications about using AI in legal practice: the results suggest that, at least in the student context (more on that in a bit), AI saves time and either improves quality or doesn’t impose a quality cost.
The second study is by Lisa Larrimore Ouellette et al., entitled Can AI Hold Office Hours? This study tested the ability of mainstream AI tools to answer legal questions with reference to a specific legal text. Notably, this approach of grounding generative AI answers in specific reference texts is one approach taken to reduce hallucinations. But as the authors note, this study’s “results were not encouraging.” They uploaded a patent law casebook to each of three generative AI tools (GPT-4o, Claude 3.5 Sonnet, and NotebookLM), then asked the tools 185 patent-law questions that related to the casebook. The questions were derived from real student questions the authors had received, questions directly posed in the casebook, and questions that the authors themselves wrote based on their experience as teachers. They then graded the answers in a systematic way, noting when the responses contained complete and accurate answers, answers with the right outcome but misinformation, etc.
Their results show significant failings across the board. Across the board, the models gave “good” answers—complete and accurate—in around 50% or fewer cases (56% for Claude, 49% for 4o, and 37% for NotebookLM). And there were high rates of “unacceptable” answers—14% (Claude), 26% (4o) and 31% (Notebook LM). These “unacceptable” answers contain significant errors—the authors also had a possible grade of “acceptable” answers that were not as complete as the “good” answers and could include small errors. The authors are focused on AI tools’ usefulness for teaching and learning, and they conclude (very reasonably, to me) that these high rates of errors “fall short of the standards required for reliable use in legal education.” I think that these results also caution extreme hesitation about using these tools this way in legal practice.
Before getting to the question of how to square these two papers, I want to make a note about their limitations. These papers are very useful—they are detailed, methodical assessments of an important new technology in ways that plausibly map onto real world use cases. And they clearly are very time- and resource-intensive, and are real public goods. They are worth a look for anyone trying to come to an informed opinion in this area.
But as with any empirical study, there are limitations on how broadly you can generalize from them to other contexts in the world. Two in particular come to mind. First, as to the Schwarcz et al. study, law students are not trained lawyers. More experienced lawyers have a variety of ways to save time on tasks already, such as copying from existing work. And, indeed, the study included one assignment where participants were given an example template for the task—and for that task, AI tools did not lead to an improvement in quality or speed. This provides at least some reason to think that where experienced lawyers can draw on past work to complete new tasks—which is common—the returns to AI use may be lower. The paper suggests that that will be more true in transactional work than in litigation. Maybe so, but at least based on my experience, litigation can also often involve heavily drawing on previous work in the course of accomplishing a new task.
Second, as to the Ouellette et al. study, there are some limitations on generalizing from the authors’ approach to professionally developed, specialized AI tools that are designed for legal research. What the authors did—uploading a document with legal information to a general AI tool and asking questions about it—is certainly something that students and lawyers are currently doing, and need to be careful about. But it also doesn’t represent the current frontier of AI capacity regarding legal research. Specialized tools can be trained differently, can have built in checks and redundancies, and are optimized around different kinds of tradeoffs than general retail AI. I would be curious to see, for instance, what the results would be if the authors took their questions and asked them of Vincent AI. I would expect some bad answers, but I would also guess that the rate of problems would be much lower than what the authors found with their existing method.
Even with these caveats, though, it still seems like the papers are pointing in pretty different directions. On the one hand, we have evidence that AI saves time and increases the quality of lawyers’ (or at least law students’) work product. On the other hand, we have evidence that AI is pretty bad at answering legal questions, regularly getting things wrong or omitting important information. So how do we reconcile them?
The answer, I think, lies in the difference between using AI tools to manage text versus using AI tools to answer questions. By “managing text,” I mean the creation and manipulation of words at an author’s direction when the author already roughly knows what the general content of the words should be. By “answering questions,” in contrast, I mean the production of information (whether accurate or not) to fill a gap in an author’s knowledge.
This is obviously an imprecise dichotomy. After all, in a written context, questions are answered using text—it’s not like there’s a clean divide between the work a lawyer does writing and reading text and the work a lawyer does answering questions.
And yet, in conversations I’ve had with practicing lawyers who are using AI, this distinction often feels pretty intuitively workable. It’s often the case, for instance, that a partner may tell an associate something like “for this part of the brief, the main arguments should be X, Y, and Z.” And the associate goes off and writes X, Y, and Z arguments, pinning down the details, filling in the gaps, organizing the paragraphs, etc. In communicating to the associate, the partner might have used 500 words; the associate’s final work product might be 5,000 words. There is a way in which the associate is not really breaking new ground, but is instead implementing the partner’s ideas by making them more formally presentable, coherent, organized, etc. It doesn’t always work this way—sometimes the act of writing generates new questions that have to be answered, for instance. But the production of legal documents often involves taking knowledge that is in someone’s mind, or in existing documents, and transferring and managing that knowledge into the appropriate documentary form for a new context.
The point is that generative AI tools can be very good at that even while they are bad at directly and accurately answering legal questions. Some of what it takes to write a legal memo is figuring out the answer to the legal questions involved. But another big part of the task of writing a memo is producing and organizing the words to convey that answer. And tools that are good at the production and organization of words can help with that latter task even where they are bad at the former.
The tricky part, of course, is not letting errors and problems from inaccurate question-answering get in the way of productive text management. The difficulty of discovering and removing errors is real, and an important part of the widespread concern and hesitation around using generative AI tools in law.
But the difficulty of managing errors is not equally distributed across all legal tasks. A couple of areas where it might be easier: (1) where the author of the document already knows the relevant law and fact, and so can check AI-generated text easily through the act of reading and revising; and (2) where the legal questions involved are relatively straightforward and can be answered quickly by authoritative sources, so the author can quickly check the accuracy of AI-generated text through traditional legal research. (My guess, although this is speculation, is that many of the assignments in the Schwarcz et al. paper fell into category (2)).
In contrast, some areas are likely harder to manage errors in, especially where the relevant errors are likely to be errors of omission, in an area the author is unfamiliar with and so unlikely to spot or understand. In those contexts, just checking what shows up in the text isn’t enough, as the problem is what is not showing up. The mere existence of these areas—and the fact that it might not always be clear to someone that they are working in this kind of area—should make lawyers very cautious about overreliance on AI tools.
All this goes to show that it can be unsurprising to have studies that suggest both that generative AI tools are often getting legal questions wrong and that generative AI tools are improving the speed and quality with which people can draft legal documents. The errors that generative AI tools make are hugely important in the legal context, and understandably get a lot of attention. But that can make it easier to miss the other side of the productivity calculus for lawyers, in which these tools can be useful despite these serious (and hazardous) limitations. So despite the somewhat different implications of the studies, both strike me as consistent and realistic demonstrations of important features, positive and negative, of the generative AI landscape for lawyers.