The low visibility of lawyers’ responsible AI use

A minor debate on X last week about lawyers' AI use led me to think I should write up a post on a topic I've been chewing on for a while: the problem of low visibility into lawyers’ responsible AI use. Without rehashing the debate, the starting point to me is this: many people are overgeneralizing from information about hallucinated cases (such as Damien Charlotin's useful database). This is not just a phenomenon happening on X. Although my research focus is on AI as an object of regulation and litigation, in my job as a law professor I also end up having lots of conversations with people about how lawyers are using AI as a tool for work. And hallucinated cases seem to be the thing that people focus on in many conversations about whether and how AI can responsibly be used in the law.

It makes sense that those kinds of cases are at the front of people's minds; they make frequent headlines and are great fodder for social media. And hallucinations are a real issue the profession has to deal with. Even federal judges are filing fake information in court orders! So the topic is important and real.

But we are missing visibility into another phenomenon that I think is more important: the regular use of AI by lawyers in responsible ways.1 When it comes to responsible AI use, the problem is this: we have reason to think there is a lot of AI use happening that is not clearly irresponsible. But there is not a lot of public visibility into the details: specific workflows, attempted safeguards and their efficacy, communications with clients, and so on. And if our collective impressions of AI use attend heavily to misuse while remaining only dimly aware of responsible use, that is a bad portent for regulation.

On thinking a lot of responsible AI use is going on: we know from survey data that AI use among lawyers is very widespread. You get different numbers depending on what you ask, but one recent good survey puts AI use at at least 55% of lawyers in law firms and 60% of lawyers in house. And just as importantly, that survey measured the intensity of AI use, finding that over half of the lawyers who use AI use it at least daily (or multiple times per day):

AI usage frequency data from the Thomson Reuters 2026 survey

It is impossible to know, but I think the best working hypothesis is that a fair amount (and maybe most) of those uses are responsible uses. We are not seeing half of all litigators getting sanctioned for hallucinations. Clients appear to be on board—that same survey suggests a supermajority of corporate clients want their outside counsel using AI.

But what does that use look like? For the lawyers who are using AI to write briefs, how is that going? What are the hallucination rates like on different models, not on benchmarks but in the actual tumult of day-to-day practice? What are the more and less effective safeguards?

These questions are really important. We are at the start of a collective profession-wide (and society-wide) process of figuring out AI regulation. And that is going to mean developing rules and standards that effectively balance benefits and risks. If regulators, and the profession more broadly, have an outsized sense of AI-related harms and only a dim sense of benefits, that's not a recipe for well-crafted regulations.

But there are at least some incentives to keep the responsible use under wraps. Some incentives come from competitive pressures: If you or your firm has figured out a good recipe here, why broadcast it? Other incentives come from the uncertainty surrounding the norms of professional responsibility: because the ethical parameters around AI use are uncertain, going on the record in detail about AI use is sticking your neck out unnecessarily.

It's not all that hard, in one-on-one conversations, to find talented, responsible lawyers who are using AI tools in rigorous ways, applying quality checks and maintaining high standards. But there is not much public visibility into that category. And public visibility matters, as a necessary precursor to having a public conversation about acceptable uses and regulations.

The other problem is that the people who are loudly broadcasting positive AI messages are often not particularly trustworthy: there's a lot of hype and hucksterism. That creates a particularly noisy information environment, with dubious positive takes on AI in the law, well-reported instances of malfeasance, and a missing middle of accounts of everyday responsible use.

The low-visibility problem in lawyers’ responsible use of AI is in some ways a microcosm of broader AI regulatory discourse. Alongside debates about how to regulate AI, there is a loud and frequent conversation about "what can AI actually do?" And although there is a kind of hype that I think people understand well (beware businessmen's description of their own products), there is also a kind of reflexive cynicism that goes beyond a healthy skepticism into an inattention to real developments. A recent piece by Dan Kagan-Kans highlights how this can end up stifling useful political discussion. A baseline shared understanding of what AI tools can and do actually do is an important part of coming to a consensus (or even just having a healthy debate) about what the best path forward is.

Back to the legal profession in particular: It seems to me like this state of affairs is ripe for some entrepreneurial folks to come along and serve the role of information spreaders. In the private market that will probably take the form of consultants or trainers. There is also a lot more room for journalists or academics to peer into the world of responsible legal AI use. So hopefully this visibility problem is only temporary.


  1. "Responsible" needs a bit of elaboration here. The legal profession is only in the early stages of figuring out how the various self-regulatory rules that lawyers have (such as rules of professional conduct) apply in the many different situations where lawyers are using AI. So what I mean here is not "responsible" in the formal sense of compliant with all rules and norms (as those haven't been clearly established yet), but instead in the more colloquial sense: the use of AI tools in a way that is attentive to potential harms and downsides, involves safeguards, is transparent and doesn't involve deceit, etc. 
Continue reading →

Regulating AI Models that Can Learn

I have a post over at Lawfare today discussing continual learning—the goal many AI developers have to create models that can learn not just from their initial training, but from their day-to-day use. It turns out that this is a very challenging technical problem, but also one with huge potential upside.

If you haven't thought much about the issue of continual learning, I'd recommend this blog post by Dwarkesh Patel about the bottleneck created by the inability of models to learn more like humans do. I'd also recommend this piece on the related issue of "context rot" by Timothy B. Lee. They provide good context and perspective for why this is such an important issue for AI labs and users.

My writing at Lawfare asks about what the development of continual learning might mean from a regulatory perspective. There is lots of conversation about better or worse paradigms for regulating AI labs and/or models, but there has not been much focus yet on whether or how our regulatory paradigms could handle continual learning. Thinking about regulating a technology that doesn't exist is inherently speculative, of course. But given how important a goal continual learning is, and given the potential implications for regulation, I think it's worth spending some time issue spotting.

A key part:

Test-time training could erode the connection between knowledge of models and control over models, by allowing the most capable models to be developed and changed in meaningful, permanent ways by users. Almost all users are going to be much less knowledgeable than AI developers—less able to run tests and studies on their models, less informed about the broader technical landscape, and so on. So test-time training shifts some of the control over whether outcomes are good or bad from centralized, knowledgeable actors (the AI labs) to diffuse, less-skilled actors (users). Because developers will have lost some of their control over their models' capabilities and tendencies, it may be harder to hold them liable for certain outcomes. This shift in control also poses a challenge for one of the paradigms that has received more emerging support recently: regulations that take AI companies themselves as the targets of regulation, rather than focusing on regulating those companies' products directly. As it has become clearer that regulations focused on models themselves have important limitations, thoughtful commentators have advocated for "organization-level" or "entity-based" regulatory approaches that focus on the policies and practices of AI model developers. But if developments in continual learning mean that significant changes to models will happen after they are released to the world, that could undermine the efficacy of regulations that focus on courses of conduct that companies take in earlier stages when the models are first being developed. There are plenty of other reasons to support entity-based approaches to regulation, but continual learning may nonetheless be a challenge for them, should it arise.

If you're interested, head over to Lawfare and check it out!

Continue reading →

ChatGPT takes—and grades—my law school exam

Welcome to the latest in my series of informal tests on large language models in the context of legal questions. My first installment, back in March, compared many different models on legal questions of increasing difficulty. Of course, as luck would have it, OpenAI released its o3 model, a significant step forward in capability, right after I posted that—and Anthropic and Google followed shortly thereafter with Gemini 2.5 Pro and Claude Opus 4. So my second installment added those models, and noted that o3 was the first model to get the difficult “dog not barking” Question 5 right. This reflected what, to me, was the crossing of an important threshold: o3 and similar models are now capable enough to be productively used on a variety of legal tasks, even if there are still important limitations.

The world of generative AI continues to develop rapidly, and I’m back with another round of model testing. For this latest round, I wanted to add some new tests, since some models have started getting all of my previous questions right. The natural place to look was giving some of the models an entire law school exam. We know from a couple of sources (1, 2) that LLMs can pass and even score highly on law school exams. And an exam score based on many questions can be a more fine-grained way to compare different models than the simpler rubric I had been using.

The main challenge here is that law school exams often take a long time to grade. My exams, for instance, are a combination of longer essays and shorter essays, in which the score comes in large part from the accurate identification, discussion, and application of the law across many small details. This could make it somewhat impractical to use an exam to regularly test new models, as the grading time would rapidly become too burdensome.

But what if you use an LLM to grade the work of an LLM? I would not want to do that with student work, but as a way to quickly test the exam answers produced by new models it holds some promise. It is obviously much faster. But is it reliable?

Evaluating LLMs as graders

Before deciding to use LLM graders to assess LLM exam answers, I wanted to take a look and see if such an approach would yield informative results. The main goal, of course, is knowing whether LLMs give accurate grades. But if I had the time to grade a bunch of exams in the first place, then that would moot the whole need for this approach in the first place. So I decided to take an initial approach that examined two easier metrics: how much LLMs agree with each other on exam scores (between-grader consistency) and how similar an LLM’s scores are with itself when grading the same exam across different sessions (within-grader consistency). Looking at those numbers would at least be one way of detecting certain types of inaccuracy: if LLMs’ assessments of the same exam vary widely between graders or between sessions of the same grader, those would be strong indicators that they aren’t accurately grading the exam.

To start, I gave an old civil procedure exam I had written to five different models: ChatGPT o3; Gemini 2.5 Pro; Opus 4.1 (without extending thinking turned on); Opus 4.1 (with extended thinking turned on); and GPT 5-pro. I gave them all the same prompt, and copied their answers wholesale into individual Word documents. They were “anonymized” in that neither the document title nor the text in the document contained the names of the model that had written the answer (in case that might somehow bias the grading results).

I then gave each answer doc to each of four models to grade: Opus 4.1 (no extended thinking); Gemini 2.5 Pro; GPT-5-Thinking; and GPT-5-Pro. As a basis for their grading, I also uploaded an “exam memo” that I had written to my students after the exam in question, which went over the questions and answers to the exam in depth. I also included the grading rubric that I had used for grading the exam, which assigned detailed point values for different components of the answers to each question. Each model graded each exam answer in three separate sessions.

Here is the table of the results averaged over those sessions. The rows represent the models answering the exams, while the columns are the models grading the exams. The bottom row is the average score of each grader across all exams:

A few things are noticeable quickly. First, Opus 4.1—the sole non-“reasoning” model among the graders—has scores that are way, way higher than every other model. Second, Gemini and GPT-5-Thinking are often within a couple of points of each other, while GPT-5-Pro is sometimes in the same ballpark, but sometimes comes in a fair bit below. And also, leaving aside the Opus grader, both of the Opus exams scored much lower than the other three models.

Some quick spot checking of the exam answers themselves made it clear that Opus 4.1 is off base—some of these answers are good, but none is getting in the 90%–100% range. In response to my prompts, these models give long written analyses of their answers, rather than just putting out a numeric score, so it was possible to look at Opus’s reasons for its scores. Doing so indicated some major hallucinations; and the persistence of the high numbers suggested that this is a recurring problem. So Opus 4.1 is out as an exam grader.

What about the remaining three LLMs? The table above presents averages across three grading sessions, and gives you a sense of between-grader consistency. But what about within-grader consistency? There are a few ways you could assess that. I looked at the average absolute difference between the three scores that each model made for each answer. For instance, if Model X graded an answer at 75% the first time, 72% the second time, and 76% the third time, the average difference would be ((75–72)+(76–75)+(76–72))/3 = 2.67%. Here are the results:

There is a pretty wide range here. There are a few instances where the average differences are around 10%. There are also some differences that are much lower, sometimes by the same grader. But if what you’re after is consistency, having a model that is sometimes consistent is not that comforting, if you also know that it sometimes has large departures.

As an interesting aside, looking for some context for these numbers took me on a brief detour into the world of human grading consistency. And it turns out that human graders have some consistency challenges of their own—such that a 10% deviation might not be terrible, if you compare it to between-grader consistency in scenarios with multiple human graders. This Swiss study, for instance, found that when three human evaluators with law degrees graded a set of legal question/answer pairs, their mean absolute error was 1.95 points out of 10—in other words, just shy of 20% (the study also found that GPT-4o as a grader was no worse than the human judges when it came to between-grader consistency). This report on the GRE found that on essays that are graded on a 1-to-6 scale, the rates for exact agreement between two readers is about 60%–indicating that 40% of essays have a disagreement of at least one point, which would be more than a 10% difference in score on a six-point scale.

But in any event, 10% inconsistency strikes me as higher than I would like. One of the models, fortunately, seemed to be doing better than the others. GPT-5-Pro achieved pretty good self-consistency scores, averaging differences in the 3% range.

Consistency, of course, is not accuracy. So I ran a series of spot checks comparing its grades to my own judgment across a number of different questions and exam answers. Its analysis tracked my own closely, although not perfectly—there were occasional moments where it awarded a slightly higher or slightly lower grade than I would have. But these were few in number and low in magnitude—it never departed dramatically from what I would have done, and it usually didn’t depart at all.

This is, again, a very informal set of tests—I wouldn’t take this as justification to use GPT-5-Pro to grade anything where there are meaningful consequences (such as actual work in a real school context). But for purposes of my informal model testing, it strikes me as good enough: a model that tends to hew pretty closely to my own assessments (based on detailed rubrics that I wrote and uploaded), and is mostly self-consistent over time, with only small departures when there are differences. So I used its scores (averaged across three readings) as the basis to grade the performance of several models on the exam overall.

And here are the results, added to an updated version of my original set of five questions across many models:

This is starting to get cramped—on the next iteration, I’ll leave off a lot of the older models, just to make things easier to read. But for now, a few things to note:

  • On the first five questions, GPT-5 and GPT-5-Thinking did not outperform o3. This is probably not that surprising. GPT-5 itself functionally acts as a router between different models, and may have used a model to approach these questions that was less powerful than o3. GPT-5-Thinking, meanwhile, is (in my experience) roughly comparable to o3, and did get the hardest question (Question 5) correct—one of the only models to do so.
  • In terms of exam performance, the results fall into roughly two groups. GPT-5-Pro, GPT-5-Thinking, o3, and Gemini 2.5 Pro all got pretty good results—roughly in the A- to A range, depending on the curve. The Opus results were much worse, though, including with Opus 4.1 running its extended thinking mode. This surprised me.
  • For a few products (Gemini 2.5 Pro, Westlaw, and Lexis), I ran updated tests. These companies seem to be less interested in launching new models with new names every few months, but my sense is that they may still be doing some modifications on the back end. But, in any event, the results did not indicate improved outcomes for any of them.
  • Westlaw and Lexis’s dedicated legal AI products continue to perform worse than the best generalist AI models. I wonder if / when that will change.

Now that I’ve got my automated exam-grading system up and running, it will be easy to test new models that come out, and I’ll provide further updates here as events warrant. It has been a few months since there have been big updates from the major model developers other than OpenAI, so that may be sooner rather than later. In the meantime, I’d welcome any thoughts or suggestions—I’ve enjoyed hearing from readers of these posts over the last few months.

Continue reading →

Some ideas for judges, lawyers, and legal academics on trying generative AI

When it comes to AI I normally focus on law and policy issues, but a few events in the last few weeks have made me want to depart from that beat a little for this post. So this post will have more of a “how to” / “what to try” flavor to it, distilling and elaborating on some advice I’ve been asked for a few times recently. Feel free to skip down to those sections below.

What prompted this post

What made me want to write this? First, I was on a panel at a recent training for a court’s judges, to talk about AI and law schools. I heard at the training that some informal polling indicated that a majority of the court’s judges had not yet tried ChatGPT or another generative AI tool, a proportion that tracked my impression from a show of hands when I asked the room of around 100 people. And second, some conversations with a few of my law professor colleagues led me to the (extremely unscientific) impression that among the law professors of the world, it is also plausible that a majority have not yet tried generative AI, or at least not tried it for more than a few minutes.

Those numbers are perhaps unsurprising. Most people (maybe by definition) are not early adopters of new technologies, and although generative AI has made lots of inroads into the law, it is very far from normalized across the board. And at least in some circles there is a lot of negative sentiment about generative AI, which I would guess tends to cause some people not to try it when they otherwise might have. Some people may also have an allergic reaction to the positive conversations around the technology, which can be exhausting and sometimes strain credibility.

But in the two industries that I sit at the nexus of—law and academia—I think it’s time for people to spend some time with AI tools if they haven’t yet done so. To begin with, the legal profession is trending toward ubiquitous use of AI. But even if you don’t expect to use it yourself in any regular way, circumstances are arising that will call more and more for people who work in the law to have informed opinions about generative AI, regardless of their personal usage.

In particular: schools are facing important decisions about what kind of uses to permit (or to teach), and how to police any rules they put in place—see, e.g., the latest viral article about AI cheating in higher education. Courts are already having to make policies about its use by litigators in the courtroom, and in their role as regulators of the bar they will increasingly need to make determinations about many different issues of professional ethics. And lawyers are facing inquiries and demands from clients about the possibilities of using generative AI to get the job done faster, cheaper, or better.

Moving beyond news stories about hallucinations

To me, all of this means that anyone who works in the legal field and hasn't personally tried out a generative AI tool should seriously consider doing so for at least a few hours. My working hypothesis is that when it comes to generative AI tools, the best way to become more informed is with a combination of personal experience and reading systematic empirical studies (of the kind I discuss here).

In contrast, I think many people's information comes from a combination of word of mouth and news stories. Those kinds of sources can be very useful, but they have major limitations. In particular, without the kind of granularity that personal experience brings, generative AI can remain a kind of abstraction or boogeyman, filtered only through stories of things gone wrong. Irresponsible uses of AI make headlines, but the many mundane, responsible uses of it from day to day often don’t get publicized in much detail anywhere. My impression is that many people have the idea that the only thing lawyers would use ChatGPT for is to answer legal questions, but they know it hallucinates answers, so they don’t really see the point. And secondhand stories and news articles also almost never involve systematic testing, making it very easy to get overgeneralized impressions about these tools.

Personal experience has limitations, too! In particular, these tools can give the impression that they can do more than they actually can. But my sense is that there are many people in the law who have read about AI, are aware of its flaws, but haven’t tried it out much themselves. And this post is for that category of people.

Some thoughts on getting started

At this point, I’ve had a number of conversations with judges and law professors who have asked for advice on getting started trying out ChatGPT (or another large language model), and I thought I’d write up some of my thoughts to make them easier to share. Here, I’ll offer some preliminary, general thoughts and tips, then segue into some specific ideas for how to try out large language models. I’ll close by offering some more thoughts, caveats, and responses to concerns that I’ve heard.

My goal is not to convince you that AI can or should do all of the things I’m suggesting exploring here. It’s to help you get a sense of its strengths and weaknesses. So some of these ideas are ones where I actually don’t think AI always does a great job, but where it should (hopefully) be easy to see that.

Preliminary tips:

  • I strongly encourage you to pay $20 to get the best versions of AI tools that are available, rather than using a free option. If you want to find out what these tools are capable of, it will be more informative to use the most capable current models. There is a very big gap between the free version of ChatGPT and what you can access via the paid version. Opinions differ on which models are best at any given time, but based on my experience (which includes running many legal questions through more than a dozen models, including full law school exams in multiple subject areas through some models), I would recommend getting an account with OpenAI and focusing on its o3 and 4.5 models. This post is written as if o3 and 4.5 are the models you are using.
  • I would recommend o3 for anything analytical like answering legal questions, and also for web search; and 4.5 for more language-oriented tasks like editing or drafting. For the tips below I will put “4.5” or “o3” in parentheses at the end to suggest which model I recommend using. This is more art than science, and your mileage may vary.
  • Be careful about overgeneralizing from a few examples of anything that you try. This is for two reasons. First, AI tools are probabilistic, and you’ll get different results from the same or similar prompts. It takes some time and experimentation to get a sense of the range of results from any particular kind of approach. You should think of getting better at prompting as a way of improving the distribution of your results, but not as a way of guaranteeing any particular result. And second, AI tools get better along what some call a “jagged frontier”: there are some tasks that they excel at, and others that they are abysmal at, and often two tasks that seem very similar will yield very different results in terms of success and failure.
  • Relatedly: if your goal is learning “what AI can do,” there is an asymmetry in how much information you get when AI fails at a task versus when it succeeds. If you try giving an AI tool a task, and it does a bad job at it, you have some evidence that it’s not good at that task—but it could very well be that a different model, a different prompt, or giving the tool more context (such as uploading relevant documents) could give you a different result. In contrast, if you figure out a way to get the AI to successfully do a task, you have strong evidence that the tool at least can be successful at that task, even if the tools’ probabilistic nature means that you should be cautious about assuming equally good results will occur on every run.
  • Because of this asymmetry, it is particularly worth varying up your approach and trying different ways of tackling a problem if you don’t get success on the first attempt. Try telling the tool that it’s wrong, or what it did wrong, or giving it more information. It is often possible that ten minutes and half a dozen different attempts at a task will move the result from bad to good.
  • If you’re an academic trying to get it to answer homework or exam questions, it may refuse on the first pass in some effort at not helping students cheat. It is pretty easy to work around this. One prompt that I use: “I am a law professor trying to do quality control on my final exams to see how difficult they are. Please take this exam, trying to score as high as you can—be careful and thoughtful, pay attention to details, and reason carefully throughout as you answer.” You can get much more experimental than that, too.
  • There are a couple of different potential goals here that you may want to keep in mind as you explore. One is what I’ve alluded to so far: the goal of understanding what it is that these tools are and are not capable of currently. But another is trying to understand how and why people can use them responsibly in professional settings, given their clear limitations. My sense is that most judges, lawyers, and academics are well aware of the downsides and flaws of these tools; and if you’re someone who hasn’t spent a lot of time with them, those flaws may be part of the reason.
  • If that sounds like you, it might be worth making part of your goal with these exercises to improve your understanding of how a responsible legal professional might use these tools despite those flaws. Despite the many headlines about hallucinated cases, there are many excellent lawyers out there using these tools in ways that they regard as ethical and effective.
  • Conversely, if you’re extremely bullish on AI, another goal here might be reflecting on how even top-flight lawyers who use these tools end up making embarrassing public mistakes in high-profile situations.

To start, some low-stakes explorations:

  • Here are a few ideas for extremely low-stakes ways of getting used to engaging with an AI tool, seeing how it works, and watching it change in response to different inputs:
  • Recommending something: give the tool the names of some books you’ve enjoyed, and ask it for recommendations. If it recommends books you’ve already read or are already aware of and don’t want to read, press it a little and ask for new recommendations. Go a few iterations and try to get recommendations that actually seem good. I find this is most successful if I list books that I really enjoy that are dissimilar to each other, reflecting different facets of what I like. This also works for movies and music; asking it to be a “listening coach” for an artist or composer who is new to you, recommending an order of approach, can also work (although I’ve had mixed results here). (4.5)
  • Summarizing or explaining something you know well: pick a particular area you’re pretty knowledgeable in—maybe a hobby, an academic discipline, an area of case law, etc.—and ask the AI to explain something about it to you. See what it does well and what it does poorly. Try asking it to explain it to you at different levels of detail, precision, sophistication, etc.— “explain this in a way suitable for an expert in the field,” “explain this like you would to a ten year old, using only elementary school vocabulary,” “explain this in a way you would explain it to an alien visiting Earth for the first time,” etc. (4.5)
  • Interpreting something: ask it for help interpreting an essay, poem, book, painting, etc. It may help to upload a copy of what you’re asking it to interpret. If you disagree with its interpretation, challenge it or offer an alternative explanation. I’m not suggesting this exercise because I think these tools are brilliant at this; instead, it’s an easy way to interact with the tool for a few rounds. In my experience, it will often start in a highly generic place but will typically be able to offer something more interesting when prodded, and at times can be genuinely useful or insightful. (4.5)

More substantive steps:

  • Search and research: This is, for me at least, the clearest way to get real value out of the current frontier of AI tools. Use o3 to help you search for something that you genuinely would spend time looking for, and evaluate the results. It will give you links; sometimes it will misinterpret them or hallucinate things. But that’s not a huge problem, because typically when you search for things you are reading the results yourself anyway—when you search on Google for a web site or Westlaw for a case, you then go to the web site or the case. The value add of the AI tool is in the ability to find specific results with natural language prompting. (o3)
  • You can ask it to find something like, for instance, “some legal cases in which the defendant built property on the plaintiff’s land without authorization but the court refused to support the plaintiff’s demand to destroy the building.” Here are the results, some of which are very good! Try doing your own search on something you are genuinely interested in; and be sure to check the results well to get a sense of the strengths and weaknesses of this approach. (I also would not recommend using these tools as substitutes for traditional legal or academic research; this is just to give you a sense of ways in which they might be plausibly useful supplements.) This can be completely non-legal as well; lots of people find these tools useful to search for and compare different products in a category they are thinking of making a purchase in.
  • You can also direct it to use “deep research,” which is a tool that will cause the AI to take several minutes and write an in-depth report. I would not trust this report to be entirely accurate or comprehensive. But it can still be useful, again if your goal is primarily to get one or more examples of something (and it’s less necessary to, e.g., be sure that you’ve exhausted every possible source). At this stage, o3’s web searching and subsequent writeup is thorough enough that I often prefer to just use an o3 web search over specifically using the deep research tool.
  • Summary and critique: Give it something that you have written—a brief, an opinion, an article—and ask it to summarize and critique it. See how it does. Ask it for five places to improve the writing stylistically. Ask it for five places to improve the argument substantively. Try prompting it to do the critique for an expert audience, for students, etc.; or tell it to be more critical (or less critical) and see how it varies the output. If you’re a litigator, give it an opposing side’s brief and ask it to summarize and critique it. How’d it do? For an academic paper, try the prompt “you are an expert in the field, and an excellent peer reviewer. Critique this paper.” (o3, but 4.5 might be interesting too)
  • (See the caveat / caution below about giving AI tools non-public documents.)
  • Try doing just a summary, and not necessarily a critique, of something that you’ve written or something that you’ve read and know well. How is the summary? Try asking it to summarize the document in a more specific way—like, “summarize this document and be sure to note each time X comes up,” or other more detailed instructions. (4.5)
  • Try getting it to do a summary, or construct a narrative, about something that is different than a single article or court opinion. For instance, try uploading a long trial court docket sheet and just ask “what happened in this case?” How does it do? Or try uploading a bunch of documents by the same author and ask it to pick out themes, or chart the author’s development, or other tasks like that. (4.5)
  • Question and answer: Within your area of expertise, see if you can find its skill level. Give it some very easy questions and see how it does. Give it some hard questions and see how it does. Try to find levels of difficulty where it is mostly getting things right, and levels where it is mostly getting things wrong. (o3)
  • In addition to varying the difficulty, try varying the type of question — e.g., even within the same subject matter, there will be different kinds of questions that you can ask that will yield different types of responses. For instance, questions like “what case held X” will be more likely to get hallucinated case names; that’s a particular, well-known failure mode. Try other kinds of questions, too, to see if you learn about typical ways these tools fall short or ways they succeed.
  • Drafting and editing: Try using it to assist you in writing something at a professional level—not necessarily something you intend to use professionally, but something of the kind of detail and level of quality that you would want out of a professional document. You could approach this in many different ways, and which one is best will probably depend on circumstance. A few ideas:
  • Making a close copy: do you have a document that you often have to draft different versions of? Try uploading an old version and asking it to draft a new version, giving it the relevant changes in a few bullet points. See how it does. (4.5)
  • Critiquing or editing an existing draft: try uploading a mostly complete draft of something, and ask it for comments, edits, revisions, etc. You can try this along many different dimensions—with broad instructions to focus on style, or on substance, or narrower instructions to focus on readability or organization. Ask it how someone who disagreed with what you are saying would critique it, and ask how that person might be better persuaded. (4.5 or o3)
  • Generating a draft: try asking it to generate its own draft of something. Upload a brief and ask it to draft a response from the other side. Upload a complete set of briefs and ask it to draft an opinion. Then ask it to draft an opinion coming out the other way. If you’re a teacher and do simulations, try asking it to draft a document for the simulation (like an affidavit or a deposition transcript). (o3 or 4.5)
Continue reading →

Testing generative AI on legal questions—May 2025 update

Two months doesn’t seem like such a long time, but when it comes to AI model releases a lot can happen. Back in March, I wrote about some informal testing that I have been doing of large language models on legal questions for the last couple of years. Since then, OpenAI, Google, and Anthropic have all released major new updates to their models.

This will be a short post. I won’t be writing a new post every time there is a new round of updates, but I did want to revisit this because there has been a meaningful new development on my testing in particular: there is now a model—OpenAI’s ChatGPT o3—that aces all of the questions that I give it. No model had done that before. In particular, no model had gotten what I labeled “Question Five” correct. As I described:

Question Five was an issue of appellate jurisdiction that was more “hidden” in the issue spotter—there was no direct text asking the reader specifically about appellate jurisdiction, although there was a general call to address any jurisdictional issues. Every model missed this. And most of my students missed this one too! But, as I tell my students, the world will not always give legal issues to you in a nicely identified and labeled form, so identifying an issue that is present without being told that it is present is an important legal skill. It seems like it’s a skill that the LLMs have yet to master, at least when it comes to appellate jurisdiction.

Well, it turns out “have yet to master” was only true for a few more days. Here is what my chart looks like now:

That completely green row in the middle is ChatGPT o3. Many have been impressed with o3 on a number of dimensions, both quantitative and qualitative. And in my personal experience, o3 does seem like a major step forward in analytical capacity, in addition to being much better at web search. It’s now the model that I use most often. Interestingly, as the chart reflects, Anthropic’s Opus 4 and Google’s Gemini 2.5 Pro still don’t “see” the issue for Question Five but do better than most of their predecessors on Question Four.

When it comes to testing models, it is probably time for me to go back to the drawing board and try to find some more questions that no current model gets right, to avoid “ceiling effects” that would make this kind of testing less informative about models’ relative capacities. In terms of comparing models to human students, though, it is worth emphasizing that Question Five was hard—only 5 of my 58 students got it right.

One methodological virtue of the questions that I have been testing is that they had pretty clear right and wrong answers, which made testing them easier. I am more hesitant to use questions whose answers require more judgment to grade, even though my exams have plenty of those questions as well (as is often the case with law school exams). I worry that I will be biased when evaluating AI capacity on those questions because I know the answers are written by AI. (That’s one reason that I particularly like studies of lawyers or law students using AI that use blind grading to evaluate outputs, e.g. the first study I discuss here.) But the set of questions that (a) have clear right and wrong answers and (b) AI tools continue to get wrong seems like it may just be getting smaller over time.

Continue reading →

When computers generate legal text, that matters for everyone—not just lawyers

As soon as it started seeming possible that generative AI would reach levels of quality where it would become viable to use commercially, people started wondering about the legal industry. One particularly significant early prediction was a report by Goldman Sachs in March 2023 estimating that 44% of tasks in the legal profession were “exposed to automation.” And you can understand why: a very big part of what lawyers do happens through text. Laws and regulations are text; the briefs that argue for particular interpretations of those laws are text; the opinions resolving legal disputes are often text; government agencies’ decisions and pronouncements are typically text; the contracts that memorialize legal agreements between private parties are usually text; and the advice that lawyers give to their clients is often given in the form of text, too.

But as that list suggests, significant changes to how we produce and manage text are not just relevant to those who worry about lawyers’ employment. The use of automated tools in legal contexts has the potential to affect everyone. Regardless of how AI tools should be used, it is clear that they are being used by legal institutions in areas that affect everyone—including at least some forays into judicial opinion writing, issuing or revising regulations, and drafting laws.

My latest academic piece, written with Kevin Tobia, tries to take a first step toward grappling with what’s going on here. We look at what we call “generated legal texts”—texts that are partly or entirely generated by computer software, such as generative AI, for use by legal institutions. In the first part of the article, we report on the results of a wide survey of examples of people and institutions using generated legal texts. We find that the results are broad and deep: people are using AI to generate texts across a wide range of legal contexts, and they are at times relying on it intensively or for highly important activities. AI is being used to draft pleadings, to help write briefs, to translate evidence in court, to draft contracts, to propose revisions to regulations, and, as I mentioned above, even in some places to draft judicial opinions and legislation. Some of these uses are idiosyncratic, and may fizzle out or be regulated out of existence. Others are being adopted wholesale by institutions. And this is still just in the first couple of years since the release of ChatGPT, by a set of actors (courts, agencies, law firms) that often have the reputation of being slow adopters of new technology. It seems likely that we will see much more of this in the near future.

One thing we try to do in the piece is to look at text as the relevant unit of analysis. We think it can be useful to do so because there are some concerns and responses that accompany generated legal texts across different contexts—whether the issue is a public agency’s notice-and-comment rulemaking or litigation between private parties, for instance. We think some of these concerns are familiar from the use of AI in other contexts, but others arise in more distinctive ways when you focus on text.

As for the familiar issues, we point to concerns about bias and discrimination; about overreliance on limited tools; or about inaccuracy—the widespread problem of hallucinations, for instance. Although I’ve written elsewhere about the ways in which AI can be useful for lawyers despite its limitations, it also seems clear that there are some serious potential downsides—especially at this early stage of adoption when many of its limitations do not seem to be well understood by its users.

In addition to these familiar AI concerns, there are also issues that arise with text that look more distinct. We identify a few, but have no illusion that we’ve exhausted this category. A couple of the ones we discuss:

  • Floodgates: when it’s cheap and easy to generate texts nearly instantaneously, that raises the possibility of overwhelming institutions built implicitly around the costs and speeds associated with text written by humans. We’ve already seen, for instance, that federal agencies have been flooded with machine-generated comments during notice-and-comment periods. And even if it becomes possible to ban astroturfing and you just consider genuine disputes with real individuals, the huge background set of legal issues that currently go unresolved because of cost could threaten to overwhelm institutions if those cost curves change significantly (see, e.g., Yonathan Arbel’s writing about this potential problem).
  • Insincerity: when an AI-generated text is supposed to represent the reasoning, beliefs, or feelings of a person, that creates a potential “sincerity gap”—a difference between what the text represents to the world about the person and what that person actually thinks or feels. Obviously, this is true even of non-AI-generated text, too (hopefully it doesn’t come as a surprise to you that the reasons given by legal institutions can sometimes be pretextual). But we have norms of honesty and integrity in many areas of the law, even if those norms aren’t always lived up to. We view it as a problem if a judge writes their opinion one way while they secretly have other reasons for deciding the case the way they did. Having a machine that can automatically generate plausible reasons is likely to increase the ease with which that situation can arise—and could even make it feasible for judges to generate opinions without first doing the work to come up with sincere reasons and beliefs to begin with. Or consider a defendant at a criminal sentencing making a statement about their acceptance of responsibility, or an explanation of their beliefs or emotions at the time of the criminal act in question. The task of assessing sincerity in such a context is already difficult; adding in the possibility that a machine could have generated the text of the statement makes the problem even more difficult.

The paper also discusses what has, so far, been the primary policy response to concerns about generated legal text: ratification. By “ratification,” we mean a person’s acceptance of responsibility for a text. So, for instance, when a lawyer signs a brief that contains generated text, that lawyer is accepting responsibility for that text. That’s part of why lawyers are getting in trouble for filing briefs with hallucinations in them—they have ratified those briefs, so they are on the hook for their errors. Many courts have leaned into this approach, issuing standing orders or other policies explicitly saying that by signing briefs you are taking responsibility for AI-generated text, and/or by requiring disclosure alongside a signature.

Ratification has some nice features, in that it draws on many longstanding legal norms: clerks draft text for judges to use in their opinions, associates draft text for partners to put in briefs or contracts, and we have more- and less- formal ways of assigning ultimate responsibility for end-product texts in these scenarios with multiple authors. But ratification also has limitations, and can risk being a fig leaf in situations where the person ratifying the text at issue doesn’t have the right incentives to, e.g., actually do a good job ensuring the quality or accuracy of the text.

The current landscape of ratification is a kind of microcosm of the entire universe of generated legal texts. You have a set of novel issues, and some old institutions that are trying to come to grips with it largely by leaning on preexisting norms with some light updating. That fix works to some extent, but leaves a variety of concerns unaddressed.

The article tries to make some headway, but it is largely an exercise in diagnosis and description rather than prescription. Our hope is that the text-focused lens will be a useful one, as there may be solutions (or at least mitigations) that are shown to work in one domain that can be ported over to another. We are ultimately still in the relatively early days of legal institutions’ adaptation to AI. But these developments also aren’t hypothetical anymore, and generated legal texts are starting to become a real and regular part of the law—even if legal institutions aren’t quite ready.

Continue reading →

Is AI actually useful in legal practice, or just a hotbed of errors?

One of my favorite anecdotes about AI and legal practice comes from a talk I was giving last spring to a group of lawyers. In the Q+A after the talk, one law firm partner raised his hand and told me about his week. Early in the week, his firm had received a notice from its malpractice insurance provider. The insurance company was giving notice that its malpractice policy did not cover any task on which generative AI was used. Fair enough—with legal news coverage replete with stories of fake cases and other hallucinations, it seems like an insurance provider might want to be cautious about the technology. But then, later in the week, his firm got a letter from one of their biggest clients, a large technology company. The tech company said that it would no longer pay for associates’ time on routine tasks that could be handled by generative AI—in other words, asking the law firm to start using generative AI instead of associates on a number of tasks, to save money.

That firm’s story captures the split reactions to generative AI that have been common in legal circles in recent years—some people love it, some people hate it, and many people (like the partners at that law firm) are just trying to figure out what to do about it. Lots of the conversation around AI is still future-oriented, with many different flavors of hype, skepticism, utopianism, and pessimism all focused on what to make of a bunch of trend lines. But there are also a lot of decisions that have to be made by many people and institutions today, about how to deal with the technology as it currently exists.

The dichotomous nature of reactions to AI was on my mind recently as I read two empirical studies put out recently in the area of AI and legal practice. As I’ve mentioned, this is an area where I think we could use much more in the way of systematic empirical study. So I really enjoyed both of these papers. And, interestingly, they both point in somewhat different directions. The papers are nuanced and not written as advocacy pieces, but I think it’s fair to say that one comes across as more positive about the use of AI in legal practice, and another comes across as more negative. Yet they both made lots of sense to me, and in some ways were unsurprising. So I thought I would try and read the two papers in light of each other, and try to reconcile the worlds of generative AI that they describe.

The first study is by Daniel Schwarcz et al. on “AI-Powered Lawyering.” This team of researchers has conducted one of the very few instances of real, empirical, comparative assessments of legal work done with AI assistance vs. without it. They put together a randomized controlled trial in which law students were assigned six different legal tasks (writing a memo, drafting a motion, etc.). The students were randomly sorted into groups that would do the tasks either with AI or without AI. The work product was then graded by graders who were blinded as to whether AI was used or not.

They found that for five of the six tasks, using AI resulted in large improvements in the speed with which the students completed the assignment. And in terms of quality, using an AI tool either produced measurably higher-quality work, or there wasn’t a meaningful difference. They tested two AI tools: the legal-industry-specific Vincent AI, and the generalist reasoning model o1-preview from OpenAI. They noticed that o1-preview hallucinated more cases than Vincent, and that the overall rate of hallucinations was relatively low—with 18 hallucinations occurring across the 768 completed tasks. So this paper is, I think, relatively bullish on its implications about using AI in legal practice: the results suggest that, at least in the student context (more on that in a bit), AI saves time and either improves quality or doesn’t impose a quality cost.

The second study is by Lisa Larrimore Ouellette et al., entitled Can AI Hold Office Hours? This study tested the ability of mainstream AI tools to answer legal questions with reference to a specific legal text. Notably, this approach of grounding generative AI answers in specific reference texts is one approach taken to reduce hallucinations. But as the authors note, this study’s “results were not encouraging.” They uploaded a patent law casebook to each of three generative AI tools (GPT-4o, Claude 3.5 Sonnet, and NotebookLM), then asked the tools 185 patent-law questions that related to the casebook. The questions were derived from real student questions the authors had received, questions directly posed in the casebook, and questions that the authors themselves wrote based on their experience as teachers. They then graded the answers in a systematic way, noting when the responses contained complete and accurate answers, answers with the right outcome but misinformation, etc.

Their results show significant failings across the board. Across the board, the models gave “good” answers—complete and accurate—in around 50% or fewer cases (56% for Claude, 49% for 4o, and 37% for NotebookLM). And there were high rates of “unacceptable” answers—14% (Claude), 26% (4o) and 31% (Notebook LM). These “unacceptable” answers contain significant errors—the authors also had a possible grade of “acceptable” answers that were not as complete as the “good” answers and could include small errors. The authors are focused on AI tools’ usefulness for teaching and learning, and they conclude (very reasonably, to me) that these high rates of errors “fall short of the standards required for reliable use in legal education.” I think that these results also caution extreme hesitation about using these tools this way in legal practice.

Continue reading →

When an AI system injures a lot of people, what will the lawsuits look like?

As we get into the swing of a new year, we find ourselves in the midst of another flurry of legislative activity around AI. For years, a lot of conversation has focused on what kinds of laws we need to pass to regulate AI—do we need new laws, and what should those new laws say?

Those are important questions. But we also are finding out that, regardless of the state of the law, the lawsuits are already here. And that means a whole different set of questions—about how suing people over AI is going to work. Who can sue whom? Over what? What will it take to prove a case? What kind of remedies are on the table? These questions are part of the debate over how we should regulate AI, and have arisen in various ways in debates over legislation so far. But they also can take on a life of their own, especially as litigation springs up that forces courts to manage these issues before legislators or regulators proactively address them. Part of what I hope to do sometimes on this blog is highlight the world of AI litigation alongside questions of AI policy—because litigation is going to be an important part of the policy picture.

Continue reading →

Some informal testing of large language models on legal questions

Like many people, when a new large language model comes out I often give it a whirl by running some questions by it and seeing how it does. I was recently offline for an extended period—a few months of parental leave—that happened to coincide with the release of a bunch of models, including the first wave of "reasoning" models like OpenAI's o1 or DeepSeek's R1. So when I returned to work I thought that it might be fun to be slightly more systematic than my normal informal testing, and compare a bunch of models all at once to see if any patterns or lessons emerged.

My results, along with some thoughts, are below. I should note that this isn't at all like a rigorous benchmarking, which (by my lights anyway) would require a broader and deeper set of questions, ideally run multiple times through these models in a variety of ways. I think that kind of benchmarking is important, and underdone in the legal realm. But I also think that more informal, qualitative writeups are undersupplied, too, when it comes to the law, especially when compared to other fields. When a new model comes out, it is easy to find many examples of people testing it with prompts about coding or general knowledge. Law, not as much. And I think that matters. There's a lot of pressure on the topic of AI for people to resolve their perspective to a tidy verdict—a thumbs up or thumbs down. But my sense is that, especially in the law, there are a lot of people who just don't have a ton of exposure to the ins and outs of everyday use of these tools—and spending some time with these tools paints a complex picture with both bright spots and blemishes.

So here's a short writeup, containing some thoughts as of March 2025. I've got my main results in a table below, along with a description of my questions and some thinking out loud about all of this.

Here's a table with my results:

Now to explain what all is going on in this table:

The models

There are a few groupings here to delineate. First, the first set of models is, roughly speaking, the set that I was expecting to do less well. This is a combination of the models being older or "less capable" in some sense. GPT 4 was released in spring of 2023, Claude 3 was released in spring of 2024; Llama 3.3 70B is closer to the frontier, but is designed largely as a more efficient, somewhat pared-down version of an older model. None of these three is a “reasoning” model. In contrast, the second set of models is more at the current frontier—including some reasoning models like o1 pro and R1. And finally, the last set of models are three models designed specifically for legal use: Westlaw's “AI-Assisted Research” tool, Lexis’s “Protege” tool, and vLex's “Vincent” tool. (As a disclosure, Ed Walters, the Chief Strategy Officer at vLex, teaches at Georgetown where I teach as well. I believe he may have helped the Georgetown Law Library get some free trials of Vincent for faculty at Georgetown, which I have used, but that is the full extent of his connection to this post.)

The questions

Next, the questions. I teach Civil Procedure, and used some questions that I ask students in class as well as on past final exams. These were designed to be roughly increasing in difficulty, with some important differences in kind in addition to that difference of degree.

Question One was a very basic question—it asked whether, in a particular scenario, a defendant in a lawsuit that was filed in state court could “remove” the case to federal court. The answer was a clear no because of a well-established rule called the “forum defendant rule.” Two models (Claude 3 Opus and Llama 3.3 70B) got this wrong, but every other model got it right.

Question Two was somewhat more complicated: it involved another question about removal, but this was about whether a particular type of court order was appealable. Again, the answer was no, because of a statutory limit on the appealability of this type of order. But there is an extra difficulty here, which is that the underlying order was wrongly decided (and a common step that a party in a lawsuit will take when it encounters an erroneous decision is to appeal that decision if that's possible). So there was an obvious wrong path for the LLMs to take, talking about how the order was wrongly decided and so should be appealed. Most of the language models still managed this one, although interestingly Lexis's specialized legal AI tool (along with Llama 3.3 70B) said some false things in its answer despite ultimately arriving at the right conclusion.

Question Three was the question that yielded the biggest spread in answer quality. It asked about an area of law (personal jurisdiction in class actions) where there is currently some uncertainty, with dozens of federal district courts coming out on different sides of the issue and only a few circuit courts weighing in. Notably, this uncertainty is about eight years old at this point, and has been written about in many different forums in that time—plenty of time to make its way into LLM training data.

On this question, several LLMs—including the Westlaw and Lexis AI tools—gave some sort of coherent answer but did not acknowledge the existence of any contrary authority or other reason to doubt their answer (I marked this as a “bad answer”) on the table. Some of these answers also mischaracterized the most relevant Supreme Court case (Bristol-Myers Squibb). A few more gave answers that articulated the majority rule and included vague statements consistent with the existence of uncertainty or a split of authority, but not directly stating it—these are the “okay answers” in the table. A couple tools gave good answers acknowledging the uncertainty. And two tools—ChatGPT 4.5 and Vincent—stood out as having particularly complete and thorough discussions of the issue.

Questions Four and Five were “issue spotter” questions excerpted from my final exam. This entails a kind of short story with a lot of legal issues thrown in for the test-taker to “spot” and discuss. Question Four involved an issue that was clearly delineated in the text—the reader was asked to discuss whether it was permissible to join a specific claim against a third-party defendant in a lawsuit if that claim was unrelated to the lawsuit. The difficulty here was that if the claim were standing on its own, the answer would be a clear “no” under Rule 14—but because there was already an appropriate, related claim against the third-party defendant, it was permissible to add the unrelated one as well, under Rule 18. (Yeah, it’s complicated). Every AI tool got this wrong except for o1 pro, including the specialized legal tools (Vincent did an okay job, but still got the answer slightly off). They also all tended to get it wrong in the same way, applying Rule 14 and saying the claim needed to be related, ignoring the Rule 18 nuance that was added by the presence of the preexisting claim.

Question Five was an issue of appellate jurisdiction that was more “hidden” in the issue spotter—there was no direct text asking the reader specifically about appellate jurisdiction, although there was a general call to address any jurisdictional issues. Every model missed this. And most of my students missed this one too! But, as I tell my students, the world will not always give legal issues to you in a nicely identified and labeled form, so identifying an issue that is present without being told that it is present is an important legal skill. It seems like it’s a skill that the LLMs have yet to master, at least when it comes to appellate jurisdiction. To be fair to the specialized law tools, they don't really pretend to be equipped for this, and in fact could not even ingest the full text of the issue spotter question, so I marked this as "N/A" for them.

Reflections

So what does all this suggest? Like I said above, this is not the kind of testing that you should take to the bank—this is the kind of short, informal testing that is good for creating some working impressions rather than firm conclusions. But what are those working impressions?

First, it seems like the new wave of models are doing better with legal questions than the old ones. This is true of both the “reasoning” models and just the most advanced versions of the “traditional” models (or whatever we are calling the non-reasoning models, like ChatGPT 4.5 or Claude 3.5 new). They got more answers right. And, more subjectively, they did a better job at explaining and articulating their answers. OpenAI's o1 pro, a reasoning model, was also the only model to get my Question Four correct—including the specialized legal AI tools, although Vincent came pretty close.

Second, I still would not really trust any AI tool to answer a legal question in any setting with any significant consequences attached. Look at all the red and yellow on that table. The closest tool I could imagine using for the task of "answering legal questions" is Vincent, in part because of the higher quality of its answers and in part because its citation and interface make it easy to check those answers—which as these results indicate, you really have to do. And even then, I would still do some research on my own, as false negatives seem like a problem everywhere, too. These tools are incredible in many ways, but they aren't (yet?) at the “ask it a legal question and you can simply rely on the answer” stage. That isn't to say they are useless—there are many, many things you can productively use an LLM for in legal practice other than relying on it to generate correct answers to legal research questions. But they are definitely not oracles for legal questions.

Third, it seems possible that these questions delineate a few kinds of common failings of LLMs when it comes to legal questions (although count this all as impressionistic and highly speculative):

  • As Question Five suggests, one of these failings may just be issue spotting where the relevant input does not give strong hints or cues to look for a particular type of issue—like where an appellate jurisdiction issue is lurking in the midst of a few other legal questions, but no part of the prompt suggests that fact explicitly.
  • Question Three also might suggest that LLMs have difficulty dredging up contrary lines of precedent where there is a split of authority but one dominant viewpoint, even if those contrary lines are still real and active in some jurisdictions. But that's a pretty big conclusion to draw from this one example, so take this hypothesis as even more tentative than the other ones I'm suggesting here.
  • Question Four raises an interesting possibility to me, which is that LLMs might have difficulty teasing apart exceptions from rules where the inputs that trigger the exception look very similar to the inputs that trigger the rules. Seeing model after model repeat the exact same flawed reasoning about Rule 14 certainly made it seem like something in the models was pushing them in the direction of wanting to focus on applying the relatedness test that that Rule demands. Maybe that's just much more prevalent in their training data, or "looks like" something that is more common. That could just be specific to this tiny corner of the universe of legal text, but I could also imagine it reflecting a more difficult issue: that often, the application of a rule will be common, and the application of its exception will be rare—and so, for AI models built on predictive engines, teasing apart the exception from the rule could be difficult. Again, that's just speculation, but Question Four didn't seem all that much more difficult to me than Question Two—and yet the distribution of results was the complete opposite.

Alright, that’s enough speculation for now. I also hope to write at some point about some of the more systematic empirical work that has been done to test LLMs on legal questions. And I may come back and update this down the road when new models come out.

Continue reading →

Welcome to Test Case

Welcome to "Test Case," my occasional blog.

The popularity of blogs has ebbed and flowed over the years, but I have always felt like they occupy a useful niche. There are a lot of conversations worth having in formats that are longer than a tweet but shorter or less formal than more traditional kinds of publication. And that’s what I’ll be aiming to do here.

About me: I'm a law professor, and my current research focuses are in artificial intelligence, consumer protection, and civil procedure. My writing here will likely focus mostly on those areas, and especially on topics that involve two or more of them (for instance, civil litigation about AI). In legal academia, the standard form for academic work and discussion is the law review article. But law review articles are long—like, 25,000 words long—and can take a long time to publish (often more than a year). Not every idea needs that many words to be expressed. And not every idea is clearly good enough that it's worth investing that amount of time and effort in. So that's where a blog comes in: a place for shorter ideas and more tentative ideas than the ones that make it into my more formal writing.

The name of the blog—Test Case—is a reference to ideas in two worlds. In the world of software development, test cases are a way of testing out new code, like a new product or new feature—you run some test cases to see how things work and find potential flaws or unexpected results. In the world of law, a test case is a case brought deliberately to try and make a new, useful precedent—to challenge a law and get it struck down, or to establish a new legal principle or overturn an old one. For the blog, the name reflects that the writing will touch on topics in law and technology; and the connotations of novel issues and "testing things out" are ones that feel particularly appropriate for the blogging medium.

I hope you enjoy reading it. My plan is for posts to be only occasional, rather than regularly scheduled. If you think you might like to read what I write here in the future, please consider signing up for the email list. I would also welcome any thoughts or reactions—so feel free and encouraged to leave a comment or send me an email about anything here.

(Image Credit: DALL•E)

Continue reading →