Is AI actually useful in legal practice, or just a hotbed of errors?

One of my favorite anecdotes about AI and legal practice comes from a talk I was giving last spring to a group of lawyers. In the Q+A after the talk, one law firm partner raised his hand and told me about his week. Early in the week, his firm had received a notice from its malpractice insurance provider. The insurance company was giving notice that its malpractice policy did not cover any task on which generative AI was used. Fair enough—with legal news coverage replete with stories of fake cases and other hallucinations, it seems like an insurance provider might want to be cautious about the technology. But then, later in the week, his firm got a letter from one of their biggest clients, a large technology company. The tech company said that it would no longer pay for associates’ time on routine tasks that could be handled by generative AI—in other words, asking the law firm to start using generative AI instead of associates on a number of tasks, to save money.

That firm’s story captures the split reactions to generative AI that have been common in legal circles in recent years—some people love it, some people hate it, and many people (like the partners at that law firm) are just trying to figure out what to do about it. Lots of the conversation around AI is still future-oriented, with many different flavors of hype, skepticism, utopianism, and pessimism all focused on what to make of a bunch of trend lines. But there are also a lot of decisions that have to be made by many people and institutions today, about how to deal with the technology as it currently exists.

The dichotomous nature of reactions to AI was on my mind recently as I read two empirical studies put out recently in the area of AI and legal practice. As I’ve mentioned, this is an area where I think we could use much more in the way of systematic empirical study. So I really enjoyed both of these papers. And, interestingly, they both point in somewhat different directions. The papers are nuanced and not written as advocacy pieces, but I think it’s fair to say that one comes across as more positive about the use of AI in legal practice, and another comes across as more negative. Yet they both made lots of sense to me, and in some ways were unsurprising. So I thought I would try and read the two papers in light of each other, and try to reconcile the worlds of generative AI that they describe.

The first study is by Daniel Schwarcz et al. on “AI-Powered Lawyering.” This team of researchers has conducted one of the very few instances of real, empirical, comparative assessments of legal work done with AI assistance vs. without it. They put together a randomized controlled trial in which law students were assigned six different legal tasks (writing a memo, drafting a motion, etc.). The students were randomly sorted into groups that would do the tasks either with AI or without AI. The work product was then graded by graders who were blinded as to whether AI was used or not.

They found that for five of the six tasks, using AI resulted in large improvements in the speed with which the students completed the assignment. And in terms of quality, using an AI tool either produced measurably higher-quality work, or there wasn’t a meaningful difference. They tested two AI tools: the legal-industry-specific Vincent AI, and the generalist reasoning model o1-preview from OpenAI. They noticed that o1-preview hallucinated more cases than Vincent, and that the overall rate of hallucinations was relatively low—with 18 hallucinations occurring across the 768 completed tasks. So this paper is, I think, relatively bullish on its implications about using AI in legal practice: the results suggest that, at least in the student context (more on that in a bit), AI saves time and either improves quality or doesn’t impose a quality cost.

The second study is by Lisa Larrimore Ouellette et al., entitled Can AI Hold Office Hours? This study tested the ability of mainstream AI tools to answer legal questions with reference to a specific legal text. Notably, this approach of grounding generative AI answers in specific reference texts is one approach taken to reduce hallucinations. But as the authors note, this study’s “results were not encouraging.” They uploaded a patent law casebook to each of three generative AI tools (GPT-4o, Claude 3.5 Sonnet, and NotebookLM), then asked the tools 185 patent-law questions that related to the casebook. The questions were derived from real student questions the authors had received, questions directly posed in the casebook, and questions that the authors themselves wrote based on their experience as teachers. They then graded the answers in a systematic way, noting when the responses contained complete and accurate answers, answers with the right outcome but misinformation, etc.

Their results show significant failings across the board. Across the board, the models gave “good” answers—complete and accurate—in around 50% or fewer cases (56% for Claude, 49% for 4o, and 37% for NotebookLM). And there were high rates of “unacceptable” answers—14% (Claude), 26% (4o) and 31% (Notebook LM). These “unacceptable” answers contain significant errors—the authors also had a possible grade of “acceptable” answers that were not as complete as the “good” answers and could include small errors. The authors are focused on AI tools’ usefulness for teaching and learning, and they conclude (very reasonably, to me) that these high rates of errors “fall short of the standards required for reliable use in legal education.” I think that these results also caution extreme hesitation about using these tools this way in legal practice.