Dispatch № 43AI security & governance8 min read

Adversarial Distillation Explained.

How AI models get cloned through their own API, and the AI governance that stops it.

By Nabeel K. 25 Jun 2026 8 min read

Fig. 1 · Adversarial distillation, drawn as a game. The defender never loses the vault. The attacker reconstructs the reasoning one query at a time.

On June 10, 2026, Anthropic sent a letter to Senators Tim Scott and Elizabeth Warren and to White House officials. It accused operators affiliated with Alibaba and its Qwen AI lab of running the largest known adversarial distillation campaign on record against its Claude models. Between April 22 and June 5, nearly 25,000 fraudulent accounts generated more than 28.8 million interactions, all aimed at Claude's strengths in software engineering and agentic reasoning.

Most of the coverage treated this as a corporate espionage story. I want to treat it as a teaching case, because adversarial distillation is the AI model security risk most teams do not understand, and the defense against it is a governance problem that most teams have not built for. Let us fix both, in order.

The misunderstood risk

What is adversarial distillation?

Adversarial distillation is a technique for copying the behavior of a large, expensive AI model into a smaller, cheaper one, without ever touching the original model's weights or source code. The attacker does not break into the vault. The attacker asks the model enough carefully designed questions that its reasoning can be reconstructed from the outside and used to train a rival.

Here is the common misconception worth correcting. When people hear that a model was copied, they assume the weights were stolen, the way you might steal a file. That is not what happened in the Anthropic case, and that is exactly why it matters. The weights were never touched. The intelligence was reconstructed through ordinary API access, one answer at a time.

The mechanism

How adversarial distillation works, step by step.

Think of it this way. Imagine a master craftsman who will never hand over his blueprints. So you hire him for thousands of small jobs, and with each one you record which tool he reaches for, how he corrects a mistake, and the order in which he works. After enough jobs you do not have his blueprints. You have his judgment, captured in writing, which is enough to train an apprentice to imitate him at a fraction of the cost.

In machine learning terms, that craftsman is the teacher model and the apprentice is the student model. The process runs in four stages.

Stage 1

Query designPrompts engineered to make the target reveal not just answers but the path to them: reason step by step, show your work, critique your own output, handle the hard edge cases.

Reasoning exposed

Stage 2

Reasoning captureEvery response is stored. The valuable signal is not the final answer; it is the chain of thought, the way the model decomposes a problem and recovers from errors.

Signal harvested

Stage 3

Dataset curationThe captured interactions are cleaned and structured into a training set. Black-box distillation needs nothing but an API key and patience, which is what makes it so hard to stop.

Curriculum built

Stage 4

Student trainingThe smaller model is fine-tuned on that curated set until it imitates the teacher's behavior, reasoning like the original at a fraction of the development cost.

Clone produced

Fig. 2 · The four stages of black-box adversarial distillation. None of them require touching the weights. All of them ride on ordinary API access.

There is one more step that turns a competitive problem into a security problem. Safety alignment, the guardrails that stop a model from writing malware or weaponizable instructions, is not a natural property of intelligence. It is deliberate, expensive engineering layered on top of capability. Distillation copies the capability. It does not copy the conscience. A clone trained on stolen reasoning can arrive with the reasoning intact and the restraint stripped out. An unaligned copy of a frontier model is not a cheaper product. It is a capable system with the brakes removed.

"Distillation copies the capability. It does not copy the conscience." Adversarial Distillation Explained

The scale of it

Distillation at industrial scale.

The Anthropic case shows what this looks like when it is run as an operation rather than an experiment. Anthropic's own words were blunt. These attacks, it wrote, are "carried out illicitly, systematically, and at an industrial scale to harvest U.S. AI capabilities across frontier labs and repackage them as their own without incurring the training and R&D costs required to train U.S. frontier models."

It is also not isolated. In February, Anthropic flagged DeepSeek, MiniMax, and Moonshot AI for similar campaigns that together produced more than 16 million interactions through roughly 24,000 fraudulent accounts. The Alibaba-linked operation reportedly exceeded all three combined. A behavior that repeats at industrial scale is not a run of bad luck. It is a strategy, and the economics drive it. Training a frontier model is the most expensive bet in technology. Distillation lets a competitor skip the bet and absorb the result through an API.

The reframing

Why this is a governance problem, not a model problem.

Here is the reframing that changes what you do on Monday morning. For most of the last decade, AI governance was written as if the danger lived inside the model: bias, hallucination, and misuse by the person typing the prompt. The frameworks we still cite carry that assumption. The EU AI Act concerns itself with how high-risk systems behave. The United States, having moved past SR 11-7 toward guidance built for generative and agentic systems, still largely regulates the model and its outputs.

The Anthropic case is evidence that the perimeter has moved. Twenty-five thousand fraudulent accounts is not a model-safety failure. It is an access-control failure. The defenses that would have mattered were identity, rate limiting, geographic policy, and anomaly detection on the API. The model behaved exactly as designed. The boundary around it did not hold. AI model governance is not a property of the model. It is a property of the runtime around it.

The control plane

How to govern an AI model against distillation.

If you serve a model through an API, the controls that protect it are the same controls that make any high-value system defensible. This is the practical core, and it is the framework I implement with clients.

Identity and know-your-customer on access. Anonymous accounts at scale are the precondition for every distillation campaign. Verified identity on API access turns 25,000 throwaway accounts from ordinary traffic into a reportable anomaly.
Rate limiting and volume anomaly detection. Distillation requires volume. Per-account and per-entity limits, plus alerting when query volume or diversity spikes, catch the harvesting pattern before millions of interactions accumulate.
Extraction-pattern detection. Distillation traffic does not look like normal usage. It is unusually broad, unusually systematic, and unusually focused on reasoning exposure. Behavioral detection that flags this signature is the difference between noticing in week one and learning about it in a letter to your board.
Geographic and jurisdictional policy, enforced at the boundary. A geographic restriction that can be bypassed with fraudulent accounts is a policy on paper, not a control. Enforcement has to live in the access layer.
Output governance. Limiting how much verbose internal reasoning a model exposes by default, and watermarking outputs where feasible, reduces the quality of the signal an attacker can capture.
An audit trail that can answer the only question that matters. When the regulator or the board asks, you have to be able to prove who used the model, how much, from where, and whether the pattern made sense. Logging that cannot answer that question is not governance. It is record-keeping.

The test

How to know whether your AI governance would actually hold.

Most organizations have an AI governance policy. Very few have tested whether it survives contact with a determined adversary. Those are different things. A policy describes intended behavior. A test reveals actual behavior. The gap between them is where the 28.8 million interactions happened.

In practice

This is the work I do. I assess the governance around an AI model the way an attacker would approach it, then I help implement the control plane that closes the gaps. In practice that means two engagements. The first is an AI governance assessment, a structured red-team of your model's access controls, monitoring, and audit posture, ending in a prioritized findings report. The second is implementation, where we build the identity, rate-limiting, detection, and audit layer that makes distillation visible and accountable rather than silent and free. If you are shipping a model, that assessment is worth running before someone else runs your model for you.

What comes next

The future of AI governance.

So what does this mean for the next two years of AI regulation. Governance is migrating from the data science team to the security team, and the rules will follow the same path. Expect export controls to extend from chips to model access. Expect know-your-customer obligations on frontier APIs, so that anonymous accounts from a restricted region become a reportable event rather than ordinary traffic. Expect the audit question of the next cycle to change from "is your model safe" to "can you prove who used it, how much, and whether the pattern made sense." The institutions that win the next decade of AI will not be the ones with the most capable model. They will be the ones who can account for every interaction with it.

The lesson of the Anthropic case is simple and uncomfortable. In the age of adversarial distillation, building a trustworthy model is no longer enough. You have to be able to defend it, account for every interaction with it, and prove that the intelligence you released is still serving the purpose you gave it. The model was never the whole product. The governed boundary around it always was.

"The model was never the whole product. The governed boundary around it always was." Adversarial Distillation Explained

If you operate an AI model, the question is no longer only whether it is aligned. It is whether you would notice 28.8 million interactions quietly mapping its mind. If you are not certain of the answer, that is precisely the gap worth testing. Message me, and let us find out whether your AI governance would hold.

Sources

Reporting on Anthropic's June 10, 2026 letter to Senators Tim Scott and Elizabeth Warren and to White House officials: Crypto Briefing, CybersecurityNews, Global Banking & Finance, Stocktwits, AI Weekly, and Devdiscourse. Verified 2026-06-25. Figures cited (nearly 25,000 fraudulent accounts; more than 28.8 million interactions between April 22 and June 5; the February campaigns attributed to DeepSeek, MiniMax, and Moonshot AI) are drawn from that coverage.

Written by

Nabeel K.

Enterprise AI architect and governance advisor. Founder of Simplification and Director, Solutions Architect at iSystematic, advising regulated enterprises on governed production AI: model security, access governance, LLMOps, and AI governance. See how to work with me →

© 2026 Nabeel K. protected. The article "Adversarial Distillation Explained: How AI Models Get Cloned, and the AI Governance That Stops It" is the intellectual property of the author and is licensed under CC BY-NC-ND 4.0, Attribution-NonCommercial-NoDerivatives. You may share it with credit to the author; you may not sell it or distribute modified versions. The frameworks and methods described here are the intellectual property of Nabeel A. Khan.