David S. Kemp

Building Infrastructure with AI: A Case Study

2026-04-17T00:00:00.000Z

Over a weekend in April, I built and deployed a self-hosted news aggregation system — a pipeline that pulls from twenty sources every morning, deduplicates items using vector embeddings, generates AI summaries, clusters primary sources with their commentary, and serves everything through a web dashboard. I am not a software engineer. My daily tools are PowerPoint, SharePoint, and the occasional Excel formula. I built this using AI as my primary collaborator: Claude for design and specification, Cowork for execution, and ChatGPT and Gemini for troubleshooting along the way.

This post is a departure from the usual subject matter here, which tends to focus on judicial decisions, ethics rules, and the architectural properties of large language models. But the project illustrates something I think my readers need to see in concrete terms: the specification-first, verification-heavy workflow I have been describing in prior posts applies well beyond AI-generated legal work. The same method — specify before you build, verify in a separate session, treat every AI output as a first draft — transfers directly to infrastructure, and the skills it demands are the skills lawyers already possess.

The problem

My work sits at the intersection of AI, legal practice, legal education, and knowledge management. The relevant news comes from everywhere: judicial opinions on CourtListener, Substack newsletters from researchers like Ethan Mollick, legal tech blogs like Bob Ambrogi's LawSites, ABA ethics opinions, podcast episodes, regulatory filings, and mainstream tech coverage. No single aggregator covers this spread. I was spending fragmented time across a dozen tabs each morning, often reading the same story covered by three different outlets, and had no systematic way to archive items I wanted to revisit.

I needed a single dashboard with short digests of each item, the ability to mark things as read or important, an archive for later retrieval, and — critically — some way to reduce duplicates and cluster related coverage together. A judicial opinion and the three blog posts analyzing it should appear as one entry, not four.

Brainstorming and specification in Claude Chat

I started with a plain-language description of the problem in a Claude chat conversation. No code, no architecture diagrams — just "help me brainstorm and then design a way to collect in a single place all news relevant to my work."

Claude proposed three approaches ranging from off-the-shelf (Feedly) to fully custom (Python pipeline), and recommended a middle path: n8n as the orchestration layer, Claude's own API for summarization, and a lightweight web dashboard. We went back and forth on delivery format (I chose a dashboard over email), volume tolerance (fifty items per day with aggressive summarization), and hosting (a small cloud VPS rather than my home NAS).

The key output of this conversation was not code. It was a complete project specification document: architecture diagrams, database schema, Docker Compose configuration, the exact LLM prompts for summarization and source clustering, a phased implementation plan, and a rationale log explaining every major design decision. Claude also produced an interactive React prototype of the dashboard so I could evaluate the UX before committing to implementation.

I then ran adversarial testing on the spec — probing for gaps, challenging assumptions, and verifying that the proposed architecture could handle edge cases. This step deserves emphasis. A specification document generated by an AI is a first draft, and I treated it the way I treat a student's first attempt at a research memo: read it critically, push on the weak spots, and send it back for revision. The judgment-delegation framework I described in a prior post applies here with equal force. I asked Claude to generate options, structures, and tradeoff analyses. I did not ask it to decide which architecture was "best" — I evaluated the alternatives myself, using the spec as my working document.

Execution in Cowork

Once the spec was solid, I shifted to Cowork — Anthropic's tool for delegating discrete tasks to Claude with file and computer access. Where Chat is a conversation, Cowork is closer to handing a task to a capable assistant along with the relevant documents.

I created a separate task conversation for each major implementation step: provisioning the VPS, installing Docker, configuring the reverse proxy, building the n8n ingestion workflow, standing up the database, and assembling the dashboard. This separation kept each conversation focused and prevented the context from getting muddled — the one-task-one-conversation principle applied to infrastructure work just as it applies to legal analysis. When I needed to reference an earlier decision, I pulled it from the spec document rather than expecting the model to recall a conversation from three tasks ago.

At each stage, I provided Claude with the project spec and asked it to execute the next phase. When things broke — and they broke often — I gave Claude screenshots, error messages, and terminal output. The pattern was consistent: describe the failure, show the evidence, get a diagnosis and fix.

Troubleshooting across models

When I hit infrastructure problems — Docker permission errors, Caddy configuration syntax issues, n8n's authentication flow not matching its documentation — I did not rely on Claude alone. I also used ChatGPT and Gemini for troubleshooting.

Different models have different strengths. Some error messages got faster, more accurate diagnoses from one model than another. When I was stuck on a Caddyfile syntax problem that had Claude and me going in circles, a fresh perspective from a different model identified the issue immediately. The practical lesson is one I have already argued in the context of sycophancy: when a model's output confirms your existing approach and you are still stuck, a second model operating without that conversational history can surface what the first one missed. Treat AI models the way you would treat colleagues with overlapping but non-identical expertise.

What went wrong

Docker permissions tripped me up repeatedly because I skipped a post-installation step the guide told me to perform. n8n's authentication system has changed since its documentation — and Claude's training data — was written, and figuring out the current approach required stripping out configuration and resetting data volumes. The VPS ran out of memory under load: three services on one gigabyte of RAM was not viable, and I had to hard-reboot through DigitalOcean's web console and resize to a larger instance. Caddy's subpath routing created cookie and redirect conflicts that were cleanest to resolve by giving n8n its own domain — a design compromise I would not have predicted at the specification stage.

Two observations about these failures. First, every one of them was solvable without engineering expertise. They required patience, the ability to read an error message and describe it clearly, and willingness to try a different approach when the first one did not work. Second, several stemmed from stale training data — the model's knowledge of how n8n handles authentication was outdated, and its assumptions about memory requirements did not match current resource demands. The lesson echoes what I have written about verifying AI-generated legal analysis: the model produces confident output regardless of whether the underlying information is current, and the user bears the burden of checking.

What worked

The specification-first approach saved significant time during implementation. Because the architecture, schema, prompts, and deployment configuration were all documented before I touched a server, each implementation step had a clear target. I was not making design decisions and debugging Docker at the same time.

The interactive dashboard prototype — built during the brainstorming phase, before any backend existed — let me validate the UX early. I could see exactly how source clustering would look, how topic filters would work, and how the read/starred/archived states would behave. Changing a UI decision at the prototype stage costs nothing; changing it after you have built the API costs real time.

Creating separate Cowork conversations per task kept the context clean. AI models perform better with focused context than with a sprawling conversation that covers everything from database schema to CSS styling. This is the infrastructure equivalent of the OTOC rule, and it worked for the same reasons.

What this demonstrates for legal professionals

The skills that made this project work were not technical. They were the skills I use in teaching and legal scholarship every day: defining a problem with precision, evaluating a proposed solution against requirements, spotting gaps in reasoning, describing failures with enough specificity to enable diagnosis, and — critically — knowing when to seek a second opinion.

The specification document was the most valuable artifact of the entire project, more valuable than the code or the running system. It captures the reasoning behind every design decision. When something breaks six months from now, the spec explains why the system was built this way and what tradeoffs were accepted. A good transactional lawyer does the same thing when she documents not just the deal terms but the logic behind them. The discipline is identical; the domain is different.

I used three different models over the course of the project. Each contributed something the others did not. Each also made mistakes — outdated configuration syntax, deprecated API references, architecture assumptions that did not survive contact with the actual infrastructure. The skill that determined whether those mistakes derailed the project or became minor obstacles was the same skill that determines whether a lawyer catches a bad case citation: the habit of verifying before relying.

The cost of the entire system is modest. The VPS runs at twelve dollars per month. API costs for daily summarization are under five dollars per month. The AI tools I used are available on consumer-tier subscriptions. The scarcest resource was my time and attention — roughly a weekend's worth of focused work, spread across several sessions.

The broader point

Every post on this blog has argued, in one form or another, that using AI well requires the same professional skills that using any powerful tool well requires: clear delegation, critical evaluation, structured verification, and the judgment to know what to trust and what to check. This project tested that thesis outside the legal domain, and the thesis held. The workflow that produced a working news aggregation system is the same workflow I recommend for producing a reliable contract analysis: specify before you execute, keep your context focused, verify in a separate session, and never let the model's confidence substitute for your own judgment.

What surprised me was not that AI could help a non-engineer build infrastructure — the marketing promises that much. What surprised me was how precisely the failure modes mapped onto the ones I have been writing about in the legal context. Stale training data produced the same kind of confident-but-wrong output that produces hallucinated case citations. Sycophantic confirmation of my initial approach delayed the fix for the Caddyfile problem by the same mechanism that delays a lawyer's recognition that her legal theory has a hole. The mitigation strategies were identical: adversarial prompting, fresh sessions, and a reflexive distrust of agreement.

This post describes a personal project using Claude, Cowork, ChatGPT, and Gemini to build a self-hosted news aggregation system. The infrastructure runs on DigitalOcean, uses n8n for workflow orchestration, and serves content through a Caddy reverse proxy. The specification-first approach and verification strategies discussed here build on the frameworks described in prior posts on context management, judgment delegation, and sycophancy.

The Model Will Not Push Back

2026-04-15T00:00:00.000Z

On March 4, 2026, Nippon Life Insurance Company of America filed a 50-page complaint in the Northern District of Illinois against OpenAI Foundation and OpenAI Group PBC. The claims — tortious interference with a contract, abuse of process, and unlicensed practice of law — arise from a set of facts that read less like a typical insurance dispute and more like a case study in what happens when a consumer AI tool functions as a client's sole legal advisor.

The underlying story is worth recounting briefly, because the dynamic it illustrates extends well beyond this one litigant.

What happened

Graciela Dela Torre settled a long-term disability benefits dispute with Nippon in January 2024. She signed a release, Nippon paid, and the case was dismissed with prejudice. A year later, she wrote to her former attorney, Kevin Probst, expressing her belief that the settlement resulted from errors or omissions and asking to reopen the case. Probst reminded her that she had signed a mutual release and that the dismissal with prejudice was final.

What happened next is the core of Nippon's complaint. Dela Torre uploaded Probst's letter to ChatGPT and asked whether she was being gaslighted. ChatGPT analyzed the letter and concluded that Probst's response "invalidated Dela Torre's feelings, dismissed her perspective, and deflected responsibility for her dissatisfaction." It characterized his tactics as gaslighting "aimed at emotionally manipulating Dela Torre."

Dela Torre fired her lawyers. She then turned to ChatGPT for legal assistance — asking it how to vacate the settlement agreement and reopen the lawsuit. ChatGPT generated proposed legal arguments under Federal Rule of Civil Procedure 60(b), formulated a statement of facts, drafted a motion, and provided her with the completed filing. She submitted a pro se appearance and filed the motion. When the court denied it — holding that "second thoughts are not a valid reason to reopen this lawsuit" — she used ChatGPT to initiate an entirely new lawsuit, amend the complaint to add Nippon as a defendant, and generate dozens of additional motions, subpoenas, and requests for judicial notice. The complaint alleges she filed 44 motions, memoranda, and demands, plus 14 requests for judicial notice, all drafted with ChatGPT's assistance. At least one filing cited a fabricated case — Carr v. Gateway, Inc., 944 F.Supp.2d 602 (D.S.C. 2013) — which does not exist in the Federal Supplement. When asked about the case, ChatGPT confirmed it was real and produced a detailed summary consistent with the fabricated citation.

The hallucinated case citation is the kind of failure that has received extensive attention since Mata v. Avianca and Park v. Kim. But I want to focus on a different failure — one that occurred earlier in the sequence, was less visible, and arguably caused more damage.

The validation problem

When Dela Torre asked ChatGPT whether her lawyer was gaslighting her, the model did not say "I can't evaluate your attorney's motives based on a single letter." It did not note that a lawyer reminding a client of the terms of a signed release is performing a routine professional function. It told her what she wanted to hear — validating her emotional interpretation of a legal communication, characterizing standard legal advice as manipulation, and helping set in motion the sequence of filings that followed.

The AI alignment literature calls this sycophancy — the tendency of large language models to affirm a user's stated position rather than challenge it. Hallucination has dominated the conversation about AI reliability in legal contexts, but sycophancy may be the more consequential problem for lawyers and their clients.

The empirical evidence is now substantial. A March 2026 study in Science (Cheng et al.) tested 11 large language models and found that AI affirmed users' positions 49 percent more often than human advisors did — and endorsed harmful or illegal behavior 47 percent of the time when users expressed a preference for it. The Georgetown Law Tech Institute and a Springer AI and Ethics paper (2026) both frame sycophancy as an epistemic harm: systems designed to please users systematically undermine the quality of the advice they provide.

The mechanism traces to how these models are built. LLMs are trained through reinforcement learning from human feedback, a process that rewards outputs humans rate as helpful, harmless, and honest. In practice, "helpful" tends to dominate. Users rate responses more favorably when those responses align with their expectations, and the training process optimizes accordingly. The result is a system that has learned — at the level of its weights, not through any deliberate policy choice — to produce the answer the user appears to want. When the question is factual and well-defined ("what does Rule 60(b) require?"), this tendency is usually harmless. When the question calls for evaluation ("is my lawyer right?"), it becomes a source of systematic error.

Why this should concern practicing lawyers

Dela Torre is a pro se litigant, and it is tempting to treat her experience as a cautionary tale about unsophisticated users and consumer chatbots. But the sycophancy problem does not depend on the user's lack of legal training. It depends on the structure of the interaction — and that structure is the same whether the user is a former disability claimant in Elgin, Illinois, or a fifth-year associate at a midsize firm.

Consider the prompts a lawyer sends to an LLM in ordinary practice. "Is this argument strong?" "Does this clause create meaningful exposure?" "Am I reading this statute correctly?" Each asks the model to evaluate the user's reasoning, and each is susceptible to the same validation bias the Cheng et al. study documents. The model will tend to affirm the lawyer's analysis, emphasize the strengths already identified, and understate the weaknesses — not because it has been instructed to flatter, but because agreeable outputs are what its training optimized it to produce.

What makes this particularly hard to catch is that a sycophantic response arrives in polished prose with accurate citations and a confident analytical structure — indistinguishable, on its face, from the kind of careful independent evaluation the lawyer was seeking. On novel questions or unfamiliar areas of law, the difference between rigorous analysis and sycophantic analysis is invisible without independent grounds for comparison.

In a prior post, I argued that the most common mistake lawyers make with LLMs is asking the model to exercise professional judgment rather than to surface information the lawyer needs to exercise that judgment herself. I identified a set of "judgment words" — reasonable, appropriate, significant, material — that signal the delegation of evaluative work to a system not equipped to perform it. The sycophancy problem adds a layer to that analysis. Even when a lawyer structures the prompt well — asking for options rather than conclusions, requesting counterarguments alongside supporting authority — the model's outputs can be subtly shaped by its inference of what the user wants. If you ask for three arguments on each side of a question, the model may produce stronger, more detailed arguments on whichever side it infers you favor, based on how you framed the question, what documents you uploaded, or what positions you endorsed earlier in the conversation.

The practical implication: verification catches hallucinated citations. It does not catch an analysis that is plausible, well-sourced, and systematically skewed toward confirming what you already think.

The supervision dimension

When a partner asks an associate to draft a memo, she expects the associate to exercise independent judgment — to push back on weak arguments, flag unfavorable authority, and say "I looked into your theory and it doesn't hold up" when it doesn't. An LLM will almost never do that unbidden. It will draft the memo, support the theory, and produce a work product that reads as though an independent mind evaluated the question and reached the same conclusion the assigning attorney expected.

Model Rule 5.1 requires partners and supervisory lawyers to make reasonable efforts to ensure that subordinates' work conforms to professional obligations. Rule 5.3 extends analogous duties to nonlawyer assistants — a category that, under ABA Formal Opinion 512, encompasses AI tools used in legal practice. The supervisory obligation has traditionally focused on accuracy and confidentiality. Sycophancy introduces a different challenge: the work product may be accurate in its citations and well-constructed in its reasoning, yet still reflect a systematic bias toward the conclusion the supervising attorney signaled. A supervisor who reviews only for accuracy and completeness will not catch the distortion, because it lives in what the memo fails to say — the counterarguments it understated, the unfavorable authorities it deemphasized, the analytical path it did not take because that path leads away from the answer the user appeared to want.

What this means for legal education

Law school pedagogy — at its best — is built on structured challenge. The Socratic method works because it forces students to defend their reasoning against pressure, distinguish their position from adjacent ones, and identify the weaknesses in their own analysis before someone else does. The method is, by design, anti-sycophantic. A good professor does not tell a student her reading of a case is strong. She asks "what's the strongest argument against your position?" and refuses to move on until the student can articulate it.

An LLM will not do this unless explicitly instructed to, and even then its tendency toward agreement will attenuate the challenge. A student who uses an LLM to prepare for class, work through hypotheticals, or test her analysis is training with a tool that rewards existing reasoning rather than stress-testing it. Over time, that produces weaker instincts for self-critique — not because the tool gives wrong answers, but because it gives comfortable ones.

I want to be careful not to overstate the case. LLMs can be prompted to argue the other side. But counteracting a system's default requires knowing the default exists, and most users do not. The Cheng et al. findings show that even when users ask genuinely open-ended questions, the models tilt toward agreement. The bias is a background condition of the interaction, not something triggered only by leading questions.

The practical response

What follows are adjustments that account for sycophancy specifically, building on the prompting strategies and judgment-delegation framework from earlier posts.

Prompt for disagreement, not agreement. Instead of asking the model whether your analysis is correct, ask it to identify every weakness in your position. Instead of "is this argument strong?", try "assume opposing counsel is excellent — what are the three strongest attacks on this argument, and what authority supports each one?" The framing matters: a prompt that presupposes the analysis is sound ("review my argument") invites a sycophantic response. A prompt that presupposes it has flaws ("identify the weaknesses") works against the grain of the model's training.

Use adversarial sessions. Run your analysis through a second, separate conversation in which the model is instructed to argue the opposing side. The OTOC rule (one task, one conversation) already counsels starting fresh conversations for each discrete task. An adversarial session goes further: it eliminates the conversational context that anchors the sycophantic tendency. A model that helped you build an argument in one session has a prior commitment to that argument's success; a fresh session does not.

Treat confirmation as a weak signal. When the model's analysis aligns with your own, that alignment should carry less weight than when it identifies something you did not expect. Agreement may reflect the model's tendency to mirror your reasoning; disagreement runs against that tendency and is therefore more informative. This is a heuristic, not a rule — surprising outputs can also be wrong. But in a system biased toward agreement, the unexpected response deserves more attention than the confirming one.

Withhold your conclusion. If you want the model to evaluate a legal question, do not tell it what you think the answer is before you ask. Provide the relevant facts and authorities, but let the model reach its own conclusion first. Once you have stated a position in the conversation, the model's subsequent analysis will be shaped by it — the sycophancy-specific complement to the "judgment words" framework from the earlier post.

The deeper problem

Every mitigation strategy above is a workaround for a system that does not, by default, do what good counsel does. A good lawyer tells the client what she needs to hear. A good associate tells the partner that the theory is weaker than it looks. A good professor tells the student that the analysis has a gap. These are acts of professional independence — and they are precisely the acts that sycophantic AI systems are architecturally disinclined to perform.

Hallucination is a more dramatic failure and easier to detect — a fabricated citation either exists or it doesn't. Sycophancy produces outputs that are not wrong in any verifiable sense but are tilted — toward agreement, toward comfort, toward the conclusion the user signaled she was looking for. A lawyer who relies on a tool with that tilt, without recognizing it, will develop an inflated confidence in her own reasoning, because the tool will rarely give her cause to doubt it.

That is the quiet damage — the slow erosion of the habit of self-challenge that distinguishes professional judgment from mere fluency.

This post draws on the complaint in Nippon Life Insurance Company of America v. OpenAI Foundation et al., No. 1:26-cv-02448 (N.D. Ill. filed Mar. 4, 2026); Cheng et al., AI Sycophancy, Science (2026); the Georgetown Law Tech Institute's analysis of sycophancy harms; and the ABA Model Rules of Professional Conduct and Formal Opinion 512. The prompting strategies build on approaches described in prior posts on context management and judgment delegation. For background on the consumer-versus-commercial data-handling divide and its legal implications, see the earlier entries in this series.

What Your AI Forgets Mid-Sentence — And What to Do About It

2026-03-29T00:00:00.000Z

Syntheia published a useful piece this week on what they call "context rot" — the family of failures that occur when a large language model processes more text than it can reliably attend to. Their diagnosis is sharp: LLMs degrade silently on long documents, and the law firm's traditional quality-assurance architecture is not calibrated to catch the resulting errors. I agree with most of their analysis, but I want to take it further and offer solutions.

In this post, I explain the mechanics of context windows in terms aimed at the practicing lawyer, and then I propose concrete strategies to work within those constraints.

The context window, explained without jargon

Every LLM has a context window — the total amount of text it can hold in working memory for a single exchange. That window includes everything: the system instructions that tell the model how to behave, whatever documents you have uploaded or pasted in, the full history of your conversation, and the model's own response. All of it competes for the same finite space.

Context windows are measured in tokens, roughly three-quarters of a word in English. A "200,000-token context window" means roughly 150,000 words across all inputs combined, in a single conversation turn. That sounds enormous until you consider that a single commercial loan agreement can run 80,000 words and a due diligence data room can contain millions. For reference, the Claude system instruction alone — which is necessarily part of every conversation with Claude — can easily run to tens of thousands of tokens.

The critical point, and the one that most marketing materials omit, is that the advertised context window and the effective context window are not the same thing. NVIDIA's RULER benchmark tested models on the kind of complex reasoning tasks that legal work demands, and found that effective performance sits at roughly 50 to 65 percent of the advertised token limit. A model with a 200,000-token window performs reliably on about 100,000 to 130,000 tokens of actual input. The number on the box is not the number that governs your work.

How the degradation works

The research literature identifies several distinct failure modes. They are worth understanding individually, because each one suggests a different mitigation strategy.

Positional bias. The Stanford "Lost in the Middle" research (Liu et al., TACL 2024) demonstrated that LLMs attend most strongly to text at the beginning and end of their input. In multi-document question answering, accuracy dropped by roughly 30 percentage points — from approximately 75% to approximately 45% — when relevant information moved from the first position to the middle of the context. In a 200-page agreement, the provisions that matter most are rarely on page one or page 200.

Volume-dependent reasoning decay. Du et al. (2025) isolated an even more troubling finding: reasoning accuracy degrades as context length increases even when the model has perfect access to all relevant information. They tested this by padding relevant text with whitespace (minimally distracting filler that should not confuse the model) and observed performance drops of up to 85 percent. The sheer volume of input makes the model a worse reasoner, independent of whether the right answer is present.

Conversation history displacement. When a conversation exceeds the context window, something has to go. In most current implementations, including Anthropic's Claude and OpenAI's ChatGPT, the system preserves the system prompt and truncates the oldest conversation turns first. Some platforms summarize rather than drop the earlier exchanges, though that introduces its own fidelity problems. The practical result is the same: the model loses track of what you discussed earlier in the session. The analytical framework you established, the specific issues you flagged, the constraints you set three exchanges ago, all of it becomes inaccessible. In custom or middleware implementations, the system prompt itself may also be at risk, though the major providers now treat it as pinned content.

Compression artifacts. Summarizing a document before feeding it to the model, a common workaround for length limitations, introduces its own errors. Compression algorithms often strip language that appears formulaic or repetitive, but legal documents are dense with formulaic language that carries substantive weight. "Subject to," "notwithstanding the foregoing," "except as provided in Section K": these phrases distinguish an absolute obligation from a qualified one. Pagnoni et al. (NAACL 2021) found that over 80 percent of summaries produced by the neural models evaluated contained factual errors, concentrated precisely in conditional and qualifying language. Current models perform better on standard summarization benchmarks, but the specific vulnerability to legal qualifying language persists because it is structural. Compression algorithms are designed to remove redundancy, and legal qualifiers are designed to look redundant while doing essential work.

These failure modes share a symptom: the output looks complete. It is well-formatted, internally coherent, and confident. Nothing about it signals that a substantial portion of the source material was functionally ignored. That is what distinguishes context rot from the more familiar hallucination problem, and what makes it harder to catch in review.

What to do about it

What follows are concrete approaches, ordered from simplest to most involved, that any lawyer can implement today.

1. One task, one conversation

This is probably the single highest-value habit change available to a non-technical user. Every AI conversation accumulates context: your prior messages, the model's prior responses, uploaded documents, session instructions. As the conversation grows, the model's effective reasoning capacity shrinks. Old instructions interfere with current tasks. Prior assumptions bleed into new analysis. The context fills with material that was useful ten exchanges ago and is now dead weight, what researchers call context pollution.

The fix is simple: start a new conversation for each discrete task. Do not use the same session to summarize a lease, then draft a demand letter, then review an indemnification clause. Each of those deserves a clean context window, and starting a new conversation is free, while the accuracy cost of a polluted one is invisible until something goes wrong.

I call this the OTOC rule — one task, one conversation. That's not to discourage iterative prompting. Iterative refinement of a single work product is still a single task and is an effective use of an LLM. Revising a draft and then pivoting to an unrelated analysis in the same session is two tasks crammed into one window — increasing the risk of context rot.

2. Write a durable task specification

The OTOC rule creates a practical problem: if every task gets a fresh conversation, you lose the background context the model needs to do good work. The overarching objectives, the governing law, the deal structure, the specific issues you care about — all of that vanishes when you close the session.

The solution is to write a reusable task specification: a short document (a few hundred words is usually sufficient) that captures the stable context for a project. Think of it as a briefing memo for the model. It should include the matter description, the governing jurisdiction, the relevant parties, the specific analytical framework you want applied, and any constraints or preferences that should carry across sessions.

You paste this specification at the top of each new conversation, or, even better, preserve it as its own file to attach as input. The model reads it fresh every time, without the accumulated noise of prior exchanges. This is the complement to the OTOC rule: it lets you start clean without starting ignorant. Some tools (Anthropic's Claude Projects feature, for instance) let you attach persistent instructions to a project workspace that automatically prepopulate every conversation. If your platform supports it, use it.

3. Chunk your documents before the model reads them

If positional bias causes the model to lose track of middle-document content, and if volume alone degrades reasoning quality, then the logical response is to feed the model smaller, task-relevant segments rather than entire documents.

For a 200-page credit agreement, do not upload the entire file and ask the model to "review it." Instead, consider breaking the document into its component sections (representations and warranties, covenants, events of default, definitions, schedules) and submit each section in a separate conversation (applying the OTOC rule) with a targeted question. "Identify all financial covenants in the following section and flag any that use a trailing-twelve-month measurement period" will produce dramatically better results than "review this agreement and summarize the key terms."

One important caveat: legal documents are dense with internal cross-references (defined terms, conditions qualified by other sections, carve-outs incorporated by reference). When you chunk, you sever those links. The model analyzing the covenants will not know that a defined term in Article I changes the meaning of a financial ratio test, or that a carve-out in Schedule 3 qualifies an obligation in Section 12. The practical mitigation is to always include the definitions section (or at minimum the relevant defined terms) alongside whatever substantive section you are analyzing.

Manual chunking is labor-intensive, but the labor is front-loaded and predictable. It converts one unreliable pass over an entire document into multiple reliable passes over bounded sections. The lawyer stitches the analysis back together, which is the level at which human judgment should operate regardless of whether AI is involved. For high-stakes tasks, the benefit of minimizing AI errors through manual chunking far outweighs the burden.

4. Use chain-of-thought prompting to structure the model's reasoning

Chain-of-thought prompting means explicitly instructing the model to reason through intermediate steps before reaching a conclusion. Instead of asking "Does Section 7.2 conflict with Schedule B?", you ask: "First, extract the operative language of Section 7.2 and state its requirements. Then extract the relevant provisions of Schedule B. Then identify any inconsistencies between them. Then state your conclusion."

This matters for context management because it forces the model to surface the textual evidence it is relying on before it reasons over that evidence. If the model skips a provision, you will see the gap in the intermediate step, before it gets papered over by a confident-sounding conclusion. Du et al. (2025) found that a simple version of this approach, prompting the model to recite the retrieved evidence before solving the problem, mitigated much of the performance loss caused by long contexts. The technique works because it forces the model to move relevant information into a high-attention position (the most recent output) before it reasons about it.

For legal work, chain-of-thought prompting also functions as a transparency mechanism. A model that shows its intermediate reasoning produces work product that a supervising lawyer can actually verify, because the intermediate steps expose the gaps that a polished final conclusion would conceal.

5. Place critical information strategically

The "Lost in the Middle" research has a direct practical corollary: put the most important content where the model pays the most attention. That means the beginning and end of your input, not the middle.

If you are asking the model to analyze a specific clause in the context of a larger document section, place the target clause at the top of your prompt, followed by the surrounding context, and then restate the analytical question at the end. If you are using a task specification (Strategy 2), put it at the top. If you have specific instructions about format or analytical framework, repeat them at the bottom. The worst arrangement, and the one most people default to, is pasting a large document and then typing the question at the bottom, burying the analytical instructions in a low-attention position.

6. Verify in a separate conversation, not the one that produced the work

This follows directly from the OTOC rule. Generation and verification are different tasks, and they belong in different conversations.

When you ask the model to check its own work in the same session, the entire prior exchange sits in the context window: the assumptions, the omissions, the analytical choices the model made on its first pass. All of it exerts influence on the verification. A model reviewing its own conclusions is structurally biased toward confirming them, the equivalent of asking the same reviewer to read the same draft a second time and expecting fresh insight.

A de novo review in a fresh conversation eliminates that problem. Paste or upload the relevant source text and the model's output into a clean session. Ask: "Does this analysis accurately and completely reflect the source material? Identify every section of the source you relied on and quote the language supporting each conclusion." The new session has no prior commitments pulling it toward agreement. It is structurally analogous to the mid-level reviewing the junior's draft — fresh eyes on the same source.

A necessary warning: the model can fabricate quotations even in a clean session. It may generate text that looks like a verbatim extract but is actually a paraphrase, a conflation of multiple provisions, or an outright invention. The verification step itself requires verification — you must check the model's quoted language against the source document. That is additional work, but it is targeted work: instead of re-reading 200 pages looking for problems you do not know to expect, you are checking specific passages the model claims to have relied on. The de novo framing does not eliminate the need for human verification, but it gives you a structurally honest starting point for it.

The underlying principle

Every strategy above is a variation on a single idea: give the model less to think about, and tell it more precisely what to think about it. That runs against the grain of how most people use these tools. The natural instinct is to dump everything into the conversation and let the AI sort it out, and the marketing encourages exactly that — "upload your entire contract," "ask anything about your documents." The context window numbers are designed to suggest the model can handle it all.

It can, in the sense that it will produce output. What it cannot do — reliably, on long documents, under token pressure — is produce output accurate enough to stake a client's interests on. The strategies in this post are all ways of closing that gap: structuring the input so the model's actual capabilities match the demands of the task. The work is unglamorous — writing briefing documents for a machine, manually splitting PDFs, running the same analysis twice in separate sessions. But it maps directly onto skills lawyers already have. Scoping a task, preparing materials for review, verifying work product against source documents — these are not new professional obligations. They are existing ones, applied to a new tool.

This post draws on Liu et al., Lost in the Middle: How Language Models Use Long Contexts (TACL 2024); Du et al., Context Length Alone Hurts LLM Performance Despite Perfect Retrieval (EMNLP 2025); NVIDIA's RULER benchmark (2024); and Pagnoni et al., Understanding Factuality in Abstractive Summarization with FRANK (NAACL 2021). Anthropic's context window documentation and context management guidance informed the discussion of conversation history displacement. For context on the data-handling and compliance dimensions of AI tool selection, see prior entries in this series on consumer-versus-commercial data handling, API compliance architecture, and the duty to counsel clients about AI privilege risks.

You Probably Have a Duty to Warn Your Clients About ChatGPT

2026-03-27T00:00:00.000Z

I have written previously about what United States v. Heppner held and what it got wrong, and about why moving to an API does not, by itself, constitute a compliance strategy. This post turns to a different audience: not organizations choosing AI tools, but practicing lawyers whose clients are already using them.

The core question is straightforward. Heppner established — on reasoning I have criticized but that is now on the books — that a client who feeds privileged materials into a consumer AI platform may forfeit the privilege over those materials. That is now a known hazard. And when a known hazard exists that threatens the integrity of the attorney-client relationship, existing rules of professional conduct impose obligations on the lawyer — not just the client.

No ethics rule says "warn your client about ChatGPT." But the obligation to do something very close to that is already embedded in the structure of Model Rules 1.1, 1.4, and 1.6, and their state counterparts. Heppner did not create that duty, but it did make the duty impossible to ignore.

A brief recap of what Heppner did

I covered the decision in detail in this prior post, so I will keep this short. Bradley Heppner, a criminal defendant, used consumer Claude to analyze his legal exposure and develop defense theories after receiving a grand jury subpoena and learning he was a target of a federal investigation. He did this on his own, without his lawyers' knowledge or direction. Judge Rakoff of the S.D.N.Y. held the resulting documents were protected by neither the attorney-client privilege nor the work product doctrine — because Claude is not a lawyer, because Anthropic's consumer terms did not support a reasonable expectation of confidentiality, and because counsel had not directed the AI use.

Two things from the opinion matter for this post. First, Judge Rakoff observed that had counsel directed Heppner to use Claude, the tool "might arguably be said to have functioned in a manner akin to a highly trained professional who may act as a lawyer's agent within the protection of the attorney-client privilege" — a reference to the Kovel doctrine. That dictum rewards attorney supervision and penalizes its absence. Second, the privilege was lost in part because Heppner's lawyers never told him — one way or the other — anything about using AI tools in connection with his case.

The NYSBA's post-Heppner commentary drew the practical conclusion quickly: attorneys should "include robust disclaimers and warnings in engagement letters and email signatures alerting clients to the risks of using AI platforms in connection with their legal matters." That is a reasonable starting point. But I think the duty runs deeper than engagement-letter boilerplate, and that existing ethics rules already require it.

The rules that get you there

Three Model Rules, read together, create an affirmative obligation to advise clients about AI-related privilege risks — even though none of them mentions AI by name.

Competence: Rule 1.1

Model Rule 1.1 requires lawyers to provide competent representation, defined as "the legal knowledge, skill, thoroughness and preparation reasonably necessary for the representation." Since 2012, Comment 8 has specified that competence includes keeping "abreast of changes in the law and its practice, including the benefits and risks associated with relevant technology." Forty states have now adopted this language or its equivalent.

After Heppner, the "relevant technology" a competent lawyer must understand includes consumer AI tools — not how to use them, but how they handle data and what the legal consequences of client use might be. A lawyer who does not know that consumer chatbot terms permit the provider to retain, train on, and disclose user inputs is missing knowledge that is now directly relevant to protecting the privilege. The duty of competence is not limited to a lawyer's own work product. It encompasses the "thoroughness and preparation" needed to protect the attorney-client relationship from erosion by foreseeable client conduct.

Communication: Rule 1.4

Model Rule 1.4(b) requires that a lawyer "explain a matter to the extent reasonably necessary to permit the client to make informed decisions regarding the representation." This is generally understood to encompass not just the substance of legal advice but the conditions under which the privilege protecting it might be forfeited. A client who does not know that pasting counsel's memorandum into ChatGPT may destroy the privilege over that memorandum has not been equipped to make an informed decision about managing privileged information.

The critical feature of Rule 1.4 is that it operates prospectively. The duty to communicate is a duty to give clients the information they need before they act — not a post-hoc damage-control obligation. After Heppner, the relevant information includes the fact that consumer AI use can waive the privilege.

Confidentiality: Rule 1.6

Model Rule 1.6(c) provides that a lawyer "shall make reasonable efforts to prevent the inadvertent or unauthorized disclosure of, or unauthorized access to, information relating to the representation of a client." The operative word is "reasonable," and what counts as reasonable changes as risks become known.

State bars have interpreted this provision to require affirmative steps — not just reactive ones — when digital communications create confidentiality risks. The principle is not new; all that is new is the specific threat: a client's use of a consumer AI platform is precisely the kind of inadvertent disclosure that Rule 1.6(c) was designed to address.

The state-level picture

The ABA's Formal Opinion 512, issued in July 2024, was the first comprehensive ABA guidance on generative AI in legal practice. It addressed competence, confidentiality, communication, candor, supervisory duties, and fees — all through the lens of existing Model Rules applied to AI. Formal Opinion 512 focused primarily on a lawyer's own use of AI tools, but its analysis of the confidentiality obligations under Rules 1.6 and 1.4 applies with equal force when the risk comes from the client's conduct rather than the lawyer's.

The New York City Bar's Formal Opinion 2024-5 addressed generative AI in legal practice directly, and Formal Opinion 2025-6 extended the analysis to AI tools used to record and transcribe client conversations — a context in which the duty to counsel clients about confidentiality implications is made explicit. California's State Bar has published practical guidance on generative AI grounded in the same competence and confidentiality obligations.

None of these authorities squarely addresses the specific scenario Heppner presented: a client, acting on his own, feeding privileged materials into a consumer chatbot. But they establish the framework within which that scenario falls. If a lawyer has a duty of technological competence that includes understanding AI data handling, a duty to communicate information necessary for informed decisions about the representation, and a duty to take reasonable steps to prevent inadvertent disclosure — then the obligation to warn a client about the privilege risks of consumer AI use follows from the conjunction of all three.

What "reasonable" looks like

Not every representation carries the same risk. The obligation to advise clients about AI-related privilege risks should be calibrated — as professional duties always are — to the circumstances.

The nature of the matter. A client facing a federal investigation, complex litigation, or a regulatory proceeding is more likely to receive extensive privileged communications and more acutely harmed by their disclosure. In high-stakes representations, the duty to counsel clients about AI risks should be treated as near-mandatory and documented. Routine advisory work still carries the obligation, but its urgency is proportional to the exposure.

The sophistication of the client. Sophisticated institutional clients with in-house counsel may understand the risk without detailed instruction. Individual clients, small business owners, and people facing their first serious legal proceeding probably do not. Heppner illustrates the gap precisely: the defendant was fluent enough to use Claude effectively but apparently had no appreciation of the legal consequences. Technological fluency and legal sophistication are not the same thing, and lawyers should resist treating them as interchangeable.

The attorney's reasonable belief about client conduct. A lawyer who knows or should know that a client is likely to use AI tools in connection with the matter — because the client has mentioned doing so, because the client works in a tech-forward industry, or simply because generative AI has become most people's first tool for understanding complex documents — bears a heightened responsibility to address the risk explicitly. This is not speculative. Consumer AI adoption has reached the point where assuming a client will not use these tools requires more justification than assuming they will.

These factors interact. A sophisticated client in a high-stakes criminal matter presents a different risk profile than a sophisticated client in a routine transaction. An unsophisticated client in any matter of consequence probably requires explicit, plain-language AI counseling as a baseline.

The structural remedy worth considering

Warning clients not to use consumer AI to understand their legal matters is, as a practical matter, unlikely to be fully effective. The impulse that drove Heppner to Claude is deeply human: complex legal advice is hard to understand, and AI tools offer an immediately accessible way to work through it. Telling clients not to do something genuinely useful — without offering an alternative — is an instruction destined to be ignored.

The more constructive path is to give clients a safe way to do what they are going to do anyway. Enterprise-grade AI deployments — tools operating under commercial terms that contractually prohibit the provider from retaining or training on user inputs — can be configured within a firm-controlled environment with appropriate confidentiality protections. A client who uses a firm-provided, privilege-preserving AI tool to work through counsel's advice is in a fundamentally different position than a client who pastes that advice into a consumer chatbot governed by terms that reserve broad data-use rights.

Judge Rakoff's Kovel dictum points in this direction. The court distinguished between unsupervised client use of a public AI platform and a hypothetical in which counsel directed the AI use. A firm-provided, counsel-supervised AI environment — deployed under commercial terms, subject to confidentiality agreements, and offered as part of the representation — positions the tool more like the Kovel professional the court described than the public chatbot it rejected. The privilege analysis is not guaranteed, but the structural argument is considerably stronger.

This is not a small undertaking, and I do not suggest it is costless. But the alternative — relying on engagement-letter warnings while clients continue to use consumer AI tools unsupervised — is a posture that grows harder to defend as the risk becomes more widely known.

Where this leaves practicing lawyers

Heppner did not create a new professional obligation. What it did was train a spotlight on one that already existed. The duty of competence requires understanding how consumer AI tools handle data. The duty of communication requires informing clients about risks to the privilege before those risks materialize. The duty of confidentiality requires reasonable efforts to prevent inadvertent disclosure. Together, these rules establish an obligation — variable in its intensity, sensitive to context, but real — to advise clients about the privilege risks of consumer AI use.

This post draws on the ABA Model Rules of Professional Conduct, ABA Formal Opinion 512, the New York City Bar's Formal Opinions 2024-5 and 2025-6, the NYSBA's post-Heppner commentary, and Judge Rakoff's written opinion in United States v. Heppner. The California State Bar's Generative AI Practical Guidance provides additional state-level context. The consumer-versus-commercial data-handling comparison referenced throughout is detailed in a prior post.

The API Is Not a Compliance Strategy

2026-03-23T00:00:00.000Z

In my last post, I walked through the consumer-versus-commercial divide in how major LLM providers handle data — and why that divide carries real legal consequences after the Southern District of New York's decision in United States v. Heppner. The takeaway was that consumer AI products operate under terms that were not designed with legal privilege, confidentiality, or regulatory compliance in mind.

A reasonable follow-up question is: What about the API?

If the consumer chatbot is the problem, the thinking goes, then switching to API access should be the solution. And there is something to that. API tiers offered by OpenAI, Anthropic, and Google operate under fundamentally different data-handling regimes than their consumer counterparts — regimes that are, by almost every measure, more protective of user data. But "more protective" is not the same thing as "compliant," and the distinction matters more than many organizations seem to realize.

What the API actually changes

The previous post compared consumer and commercial tiers in detail for Anthropic's Claude. The same structural divide exists across providers, and the API sits squarely on the commercial side. Here is what that means in practice.

Anthropic's commercial API retains input and output logs for seven days — far shorter than the consumer tier's retention windows — and does not use customer content for model training. Enterprise accounts can negotiate Zero Data Retention, under which inputs and outputs are processed in real time and not stored at all. OpenAI's API retains data for 30 days for abuse monitoring but does not use it for model training, and offers Zero Data Retention for eligible endpoints. Google's Vertex AI operates under a Cloud Data Processing Addendum with contractually defined retention and no training use. In each case, the API provider acts as a data processor rather than a data controller, meaning the customer — not the provider — determines the purposes and means of processing.

These are meaningful differences. A consumer chatbot conversation may be retained for months or years, used to train future models, and governed by a privacy policy the user never read. An API call, properly configured, may leave no trace on the provider's systems at all. For anyone whose data-handling concerns begin and end with "I don't want my inputs in someone else's training set," the API is a substantial improvement.

But regulatory compliance does not begin and end there.

Why the API is not enough

Every major regulatory framework governing sensitive data — FERPA, HIPAA, state student-privacy laws, professional-conduct rules — imposes obligations that go well beyond what the API's data-handling defaults can address. The API solves one problem (provider-side data retention and training) while leaving most of the compliance architecture untouched.

Consider what a framework like HIPAA actually requires. A covered entity processing protected health information through an API must execute a Business Associate Agreement with the provider. That BAA must specify permissible uses and disclosures, require the provider to implement administrative, physical, and technical safeguards, and establish breach-notification obligations. The API's zero-retention default is a helpful technical control, but it does not substitute for the BAA itself. And the BAA, once signed, typically imposes configuration requirements — specific endpoints, disabled features, audit logging — that the organization must affirmatively implement and maintain.

FERPA presents a parallel structure. An educational institution using an API to process student education records must establish that the provider qualifies under the "school official" exception, which requires a written agreement specifying the provider's function, its relationship to the institution's use of the data, and the institution's direct control over the data's use. The API's default against training on customer data is necessary but not sufficient — the institution still needs the agreement, the access controls, and the governance to ensure that student records do not flow into the API in ways the agreement does not contemplate.

The pattern repeats across regulatory contexts. State biometric-privacy statutes require informed consent and retention schedules that no API default can satisfy. Professional-conduct rules governing lawyer confidentiality — sharpened considerably by Heppner — demand not just favorable vendor terms but documented due diligence, competence in evaluating the technology, and ongoing supervisory obligations. An API key does not discharge any of those duties.

The architectural gap

There is a subtler problem that the "just use the API" approach tends to obscure. When an organization integrates an LLM through an API, the API handles the model-inference layer: data goes in, a response comes back, and the provider's data-handling policies govern what happens on their end. But most real-world deployments involve considerably more than a single API call.

Data passes through preprocessing pipelines, prompt templates, logging systems, vector databases, retrieval-augmented generation stores, and output caches — all of which sit on the customer's side of the line. The API provider's zero-retention commitment says nothing about what happens in those layers. An organization can use a zero-retention API and still retain every input and output indefinitely in its own infrastructure, expose sensitive data through poorly secured retrieval stores, or inadvertently log protected information in application-level monitoring.

This is the architectural gap that a provider-side compliance posture cannot close. The API governs data handling at the model layer. Regulatory compliance governs data handling end to end.

What "more protective" actually means

None of this is an argument against using the API. The data-handling improvements are real, and for many use cases they represent the minimum viable starting point for responsible deployment. An organization that uses the consumer chatbot for work involving sensitive data has a serious problem. An organization that uses the API has a less serious problem — but it still has a problem if the API is the beginning and end of its compliance strategy.

The useful framing is not "consumer versus API" as a binary compliance decision. It is "API as a necessary but insufficient component of a compliance architecture." The API provides a defensible data-handling posture at the provider layer. Everything else — the agreements, the access controls, the internal data governance, the training, the monitoring, the documentation — remains the organization's responsibility.

For institutions and professionals operating under regulatory constraints, the practical question is not whether to use the API. It is whether you have built the rest of the compliance architecture around it — and whether you can demonstrate that you have if someone asks.

Provider-specific data-handling policies referenced in this post draw on the same sources cited in the previous post, supplemented by Anthropic's Privacy Center, OpenAI's API data usage documentation, and Google's Vertex AI data governance documentation. Compliance obligations vary by jurisdiction, regulatory framework, and organizational context. Consult qualified counsel for guidance specific to your situation.

Your AI Conversations Are Not Confidential — And a Federal Court Just Said So

2026-03-20T00:00:00.000Z

On February 10, 2026, Judge Jed Rakoff of the Southern District of New York ruled from the bench in United States v. Heppner that documents a criminal defendant generated using the consumer version of Anthropic's Claude were protected by neither the attorney-client privilege nor the work product doctrine. A week later, he issued a written opinion calling it a matter of "nationwide" first impression.

I think parts of the court's reasoning are wrong — or at least underdeveloped — in ways that matter. But the opinion landed on a real problem. Lawyers, clients, and judges are making consequential decisions about AI tools without fully understanding how those tools handle data. Heppner is worth examining less for the doctrine it announces than for the knowledge gap it reveals.

This post lays out what happened in Heppner, explains what I think the opinion gets right and wrong, and then walks through what Anthropic's data-handling policies actually say across Claude's consumer and commercial tiers — the very policies the court relied on but did not examine closely. The same structural divide exists across every major LLM provider, and the legal implications extend well beyond this one case.

What Heppner held

Bradley Heppner, the founder and former CEO of Beneficient, a financial services company, faces a five-count federal indictment for securities fraud, wire fraud, conspiracy, making false statements to auditors, and falsification of records — charges arising from an alleged scheme to defraud investors in the publicly traded company GWG Holdings through self-dealing transactions involving Beneficient. After receiving a grand jury subpoena and learning he was a target of the investigation, but before his November 2025 arrest, Heppner used the consumer version of Claude to analyze his legal exposure and develop defense theories. When federal agents executed a search warrant at his home, they seized numerous documents and electronic devices. Defense counsel later identified approximately thirty-one of the seized materials as AI-generated documents. The government moved for a ruling that the documents were not privileged; Heppner resisted, invoking attorney-client privilege and the work product doctrine.

Judge Rakoff rejected both claims on multiple grounds. On privilege, the court articulated three independent reasons for denial:

First, Claude is not an attorney. It has no law license, owes no fiduciary duties, and cannot form an attorney-client relationship. Privilege requires a "trusting human relationship" with "a licensed professional" — and an AI tool is not one.

Second, Heppner had no reasonable expectation of confidentiality. The court pointed to Anthropic's privacy policy, which disclosed that user inputs and outputs could be used for model training and disclosed to third parties, including government authorities.

Third — which the court acknowledged "perhaps presents a closer call" — Heppner did not communicate with Claude for the purpose of obtaining legal advice from an attorney. Claude's terms of service disclaim providing legal advice, and Heppner's lawyers neither directed nor supervised his use of the tool. The court noted that had counsel directed Heppner to use Claude, it might have "functioned in a manner akin to a highly trained professional" who could act within the privilege under the Kovel doctrine — but because Heppner acted on his own, the question was whether he intended to obtain legal advice from Claude, and Claude disclaims providing it.

On work product, defense counsel conceded that Heppner created the documents "of his own volition" and that the legal team "did not direct" him to use Claude. The court held that materials not prepared by or at the behest of counsel do not qualify as work product — expressly disagreeing with Shih v. Petal Card, Inc., 565 F. Supp. 3d 557 (S.D.N.Y. 2021), which recognized work product protection for a party's own litigation-preparation materials regardless of attorney direction.

Where I think the reasoning falters

The first and third grounds — no attorney-client relationship, no communication for the purpose of obtaining legal advice from an attorney — are each independently sufficient to defeat the privilege claim. An AI tool is not a lawyer, and Heppner was not seeking legal advice from an attorney when he typed queries into Claude. Full stop.

The work product holding is correct on these facts — defense counsel conceded that Heppner acted without direction — but the court's reasoning adopted a narrower view of the doctrine than the weight of authority supports. The traditional Second Circuit formulation protects "materials prepared by or at the behest of counsel in anticipation of litigation or for trial," but the civil analog, Fed. R. Civ. P. 26(b)(3)(A), protects materials prepared "by or for another party or its representative" — language broad enough to cover a party acting on its own initiative. The court's express rejection of Shih on this point signals that the question remains open, and future courts should not treat Heppner's narrow formulation as settled.

The confidentiality analysis in the second ground is where things get shaky, and it is the part of the opinion that has generated the most commentary — and the most anxiety.

Judge Rakoff treated Anthropic's consumer privacy policy as establishing that Heppner could have "no reasonable expectation of confidentiality" in his AI conversations. But the court's analysis has significant gaps. The opinion cited an archived version of Anthropic's privacy policy dated February 2025 — a version that predated the August 2025 consumer terms update giving users the ability to control model training. Because Heppner used Claude in 2025 before his November arrest, his conversations may have been governed by either the old or the new terms depending on when they occurred. The court never asked what version of the terms governed Heppner's use, whether he had opted out of training, or what his actual settings were. It treated the broadest possible reading of the consumer terms as conclusive without examining what the user actually agreed to or configured.

This matters because the confidentiality holding — which was not necessary to the result — is the part of the opinion most likely to be cited broadly. And it rests on an incomplete factual record. As the policy comparison below demonstrates, Anthropic's consumer terms create meaningfully different data-handling regimes depending on whether a user has opted in or out of model training. The court did not grapple with that distinction.

There is also a subtler problem. The opinion conflates a platform's contractual permission to use data with the practical likelihood that any human will ever see it. Consumer AI privacy policies reserve broad rights, but the actual probability of a specific conversation being reviewed by a person — absent a safety flag or legal process — is vanishingly low. Whether that distinction should matter for privilege purposes is a genuinely hard question. Heppner does not engage with it.

None of this means the opinion is unimportant. It is the first federal decision to address AI and privilege head-on, and it will shape how courts and litigants think about these issues going forward. But its broadest holding — that consumer AI use necessarily destroys confidentiality — rests on reasoning that future courts should scrutinize carefully.

What the case gets right: a knowledge problem

Where Heppner is most valuable is as a signal. Whatever one thinks of the doctrinal analysis, the case exposes a widespread failure to understand how consumer AI tools handle data. Heppner apparently did not know — or did not care — that his AI conversations were governed by terms that reserved broad data-use rights for the platform provider. His lawyers did not anticipate that their client's independent AI use would create a discovery problem. And the court itself did not dig into the specific settings or tier the defendant used.

This is not an isolated failure. Most lawyers I talk to cannot articulate the difference between a consumer and enterprise AI deployment. Most clients do not read privacy policies. And most courts have not yet had to think carefully about how AI data handling intersects with privilege doctrine.

Heppner should change that — not because its reasoning is airtight, but because it demonstrates what happens when no one in the room understands the technology well enough to ask the right questions.

What Anthropic's policies actually say

Since Heppner turned on Anthropic's terms, this is the right place to start. I went through Anthropic's published policies — the Consumer Terms of Service, the Commercial Terms of Service, the Privacy Policy, and the Privacy Center — to compare what Claude's consumer and commercial tiers actually promise. What follows is a synthesis of that research.

The core divide: consumer terms vs. commercial terms

Anthropic's policies split along two fundamental lines: Consumer Terms (Free, Pro, Max) and Commercial Terms (Team, Enterprise, API, Education, Government). This distinction — not the price paid — determines virtually every data right the user holds. The Commercial Terms state explicitly: "Services under these Terms are not for consumer use. Our consumer offerings (e.g., Claude.ai) are governed by our Consumer Terms of Service instead."

This means a Pro or Max subscriber paying $20 or $100 per month operates under the same legal framework as a free user. Paying more buys additional model access and features, but it does not change how Anthropic treats your data.

Model training: the sharpest divide

For Free, Pro, and Max users, Anthropic may use conversations to train its models. In August 2025, Anthropic updated its consumer terms to give users the ability to control whether their data would be used for model training. Existing users had until October 8, 2025, to accept the new terms and select their preference. The operative contractual language states that Anthropic may use user materials for model training "unless users opt out" — placing the default in Anthropic's favor — though Anthropic's own blog post announcing the change described it as "allowing users on Claude Free, Pro, and Max plans to opt-in for data usage," framing the default in the opposite direction. The tension between the legal text and the public announcement underscores the difficulty of determining any individual user's training status based on the terms alone. Opting out remains available through Claude's settings.

For Team, Enterprise, API, and Education/Government users, Anthropic contractually prohibits itself from training on customer content. The Commercial Terms are unambiguous: "Anthropic may not train models on Customer Content from Services" — with no exceptions and no reliance on user-level toggles.

Data retention: a 60× gap

Retention periods are directly tied to training status for consumer plans, creating a striking disparity:

Consumer users who have opted in to training (or failed to opt out) face retention of up to five years for de-identified conversation data. Consumer users who have opted out see their conversations retained for 30 days before deletion. In either case, content flagged for safety or policy violations can be retained for up to seven years, regardless of the user's training preference.

On the commercial side, API input and output logs are retained for seven days. Enterprise accounts default to 30 days, with the option to negotiate Zero Data Retention — under which inputs and outputs are processed in real time and not stored at all. No consumer plan, regardless of price, offers true zero retention.

Data ownership and IP

The Commercial Terms contain an unusually strong ownership clause absent from the consumer terms. They provide that the customer "retains all rights to its Inputs, and owns its Outputs," that "Anthropic disclaims any rights it receives to the Customer Content under these Terms," and that Anthropic "hereby assigns to Customer its right, title and interest (if any) in and to Outputs."

Consumer users have no equivalent contractual assignment. Under the consumer framework, Anthropic holds a license to use inputs and outputs for model improvement unless the user opts out.

Data controller vs. data processor

This distinction carries significant weight under GDPR and analogous privacy regimes. For consumer plans, Anthropic acts as the data controller — it determines the purposes and means of processing user data. For Enterprise and API accounts, Anthropic functions as a data processor operating under a Data Processing Addendum, with the commercial customer serving as the controller.

The practical consequence: a consumer user's data is governed by Anthropic's privacy choices. An enterprise customer's data is governed by the customer's own policies, with Anthropic acting under instruction.

Employee access and confidentiality

For consumer plans, Anthropic employees may access conversations only if the user explicitly consents via feedback, or if access is required for Usage Policy enforcement — in which case only the Trust & Safety team may view content on a need-to-know basis.

For commercial plans, customer content is contractually designated as Confidential Information under the Commercial Terms. Anthropic may use it only to exercise its rights under the contract and must protect it with at least the same care it applies to its own confidential information.

Two further protections — Zero Data Retention and HIPAA Business Associate Agreements — are available exclusively on commercial tiers. Under ZDR, inputs and outputs are not stored; the sole exception is User Safety classifier results retained for Usage Policy enforcement. A BAA imposes specific configuration requirements and excludes certain features (web search, for instance, falls outside BAA coverage). Neither protection is available on any consumer plan at any price point.

The comparison distills to a structural reality: consumer Claude users — whether free or paying $100 per month — operate under terms that allow Anthropic to train on their data by default, retain it for up to five years, and act as the data controller with broad discretion. Commercial Claude users operate under a contractual regime that prohibits model training, treats their content as confidential information, assigns them ownership of outputs, and offers zero-retention options.

The pattern holds across providers

Anthropic's tiered structure is not an outlier. OpenAI's ChatGPT follows the same pattern. On Free and Plus plans, OpenAI's Data Usage for Consumer Services FAQ states that it "may use" consumer content to improve its models unless the user disables training — while retaining the right to log interactions for safety and abuse monitoring regardless. On Edu and Enterprise plans, OpenAI commits not to train on business data, provides admin-controlled retention windows, and offers Zero Data Retention and configurable data residency.

The structural divide is the same: consumer terms grant the provider broad data-use rights with an opt-out toggle; commercial terms prohibit model training by contract and give the customer control over retention, residency, and access. Google's Gemini, Meta's Llama-based offerings, and other major LLM providers follow similar patterns. The consumer-versus-commercial distinction is an industry-wide architectural choice, not a quirk of any single provider.

This matters for the Heppner analysis because the court's reasoning — resting on the provider's privacy policy and terms of service — would apply with equal force to any consumer LLM deployment, not just Claude.

What this means going forward

Heppner will be cited for the proposition that consumer AI conversations are not confidential. That proposition is probably too broad as stated — it ignores user training preferences, conflates contractual permission with practical disclosure risk, and was not necessary to the holding. But it captures something real: consumer AI platforms operate under terms that were not designed with legal privilege in mind, and users who rely on those platforms for sensitive work are taking risks they may not understand.

The practical response is not to avoid AI tools. It is to understand what you are agreeing to when you use them — and to recognize that paying for a subscription does not, by itself, change the legal framework governing your data. For lawyers, that means learning the difference between consumer and commercial deployments and advising clients accordingly. For organizations, it means treating AI procurement as a legal risk question, not just an IT question. And for courts, it means doing the factual work that Heppner did not: examining the specific terms, settings, and tier a user actually employed before concluding that confidentiality has been waived.

The gap between consumer and commercial AI products is wide, it is well-documented, and it is consistent across every major provider. The problem is not that the information is unavailable. The problem is that almost nobody — lawyers, clients, and judges included — reads it.

The Anthropic policy comparison in this post draws on Anthropic's Consumer Terms of Service, Commercial Terms announcement, consumer terms and privacy policy update, and Privacy Center. OpenAI policy references draw on the Data Usage FAQ, platform documentation, and privacy policy.