LLM and Agent “Leaks” Are Not Edge Cases

Christina Richmond
Apr 8
4 min read

Updated: Apr 9

They Are Design Signals

Over the past year, a series of so-called “leaks” involving large language models (LLMs) and emerging agentic systems have captured industry attention. The most cited example is the exposure of system prompts and behavioral scaffolding behind models like Claude from Anthropic, alongside similar disclosures affecting models from OpenAI.

These events have often been framed as isolated incidents or, alternatively, dismissed as overblown artifacts of jailbreak culture. Both interpretations miss the point.

What we are seeing is neither a traditional breach nor a trivial curiosity. It is something more foundational:

The control mechanisms governing LLMs and agents are inherently observable, influenceable, and probabilistic.

This is not a bug. It is a property of the system.

What Actually “Leaked” and Why It Matters

In the case of Anthropic’s Claude models, a 59.8MB JavaScript source map file in its Claude Code v2 1.88 exposed 1,906 files of unobfuscated code, allowing developers to see the complete agent harness and workflow. The leak provided a roadmap of Anthropic's proprietary tools to competitors and created potential vulnerabilities for exploitation.

Additionally, researchers and users were able to extract elements of the system prompt, including:

Constitutional AI principles guiding responses
Safety and refusal logic
Tone, persona, and escalation instructions

This was not achieved through infrastructure compromise, but through interaction.

Carefully constructed prompts, recursive queries, and tool-mediated workflows surfaced what was intended to remain hidden.

Similar dynamics have been observed across other leading models. System prompts, tool policies, and behavioral constraints are not sealed. They can be inferred, reconstructed, and in some cases directly elicited.

The implication is straightforward but profound:

System prompts are not a security boundary.

They are part of the attack surface.

From Models to Agents: Expanding the Exposure Layer

If prompt exposure were the full story, the industry could treat this as a manageable transparency issue. The real shift emerges as organizations move from standalone models to agentic systems.

Agents introduce:

Tool access (APIs, databases, SaaS platforms)
Memory (persistent or session-based)
Autonomy (multi-step reasoning and execution)

In this context, prompt injection is no longer just a way to manipulate output. It becomes a mechanism for action and exfiltration.

We are already seeing credible demonstrations of:

Malicious content embedded in documents or web pages altering agent behavior
Retrieval-augmented generation (RAG) pipelines surfacing sensitive data that can then be extracted
Agents being induced to call external tools with unintended parameters or data

The critical issue is that the model does not reliably distinguish between:

Trusted system instruction
Retrieved content
Adversarial input

To the model, these are all tokens in a sequence. The burden of separation falls on architecture, not the model itself.

The Deeper Pattern: A Soft Control Plane

Across Anthropic, OpenAI, and the broader ecosystem of agent frameworks, a consistent pattern is emerging.

Control is Textual, Not Enforced

System behavior is governed by instructions written in natural language. These instructions are interpreted, not enforced.

Boundaries Are Probabilistic

Safety, policy, and task constraints are applied with high likelihood, not certainty. Under pressure, they can degrade.

Context Is Ambiguous

The model does not possess a native mechanism to assign trust levels to different inputs within its context window.

Taken together, this creates what can be described as a soft control plane. It is effective under normal conditions, but susceptible to manipulation under adversarial ones.

Why the Anthropic Case Resonates

Anthropic’s approach is particularly instructive because of its emphasis on Constitutional AI (see "What is Constitutional AI?" blog). By encoding principles and behavioral guidelines directly into the system prompt, the company has made its alignment strategy more explicit than most.

When those prompts are exposed, what becomes visible is not just implementation detail, but philosophy in action.

This has two effects:

It demystifies how alignment is operationalized
It reveals the limits of that approach under adversarial interaction

In this sense, the “leak” functions less as a failure of secrecy and more as a window into the current state of AI control systems.

Implications for Security and Risk Leaders

For CISOs and security leaders, the takeaway is not that LLMs are unsafe. It is that they must be understood on their own terms.

LLMs and Agents Are Influenceable Systems

They can be guided, steered, and in some cases manipulated through input alone. This places them closer to human operators than to deterministic software components.

Traditional Boundaries Do Not Apply

You cannot rely on hidden prompts, internal policies, or model alignment as hard controls. These are advisory layers, not enforcement mechanisms.

Agentic Architectures Increase Blast Radius

Once a model can take action, access data, or invoke tools, the consequences of manipulation expand from “incorrect answer” to “material impact.”

Aligning to a Probabilistic Security Model

These dynamics reinforce a broader shift already underway in security:

From deterministic control to probabilistic risk management.

In practical terms, this means:

Assuming that prompt injection will occur
Designing systems that constrain what an agent can do, not just what it is told
Implementing verification layers around high-impact actions
Maintaining human oversight where context and judgment are required

This is not a temporary phase. It is the operating model for AI-enabled systems.

The Path Forward: From Trust to Governance

The industry often frames AI adoption in terms of trust. Trust in the model, the vendor, or the outputs.

A more useful framing is governance.

What can the system access?
What actions can it take?
How are those actions monitored and validated?
Where does human authority intervene?

As organizations move toward agentic AI, these questions become central.

The Gist

The recent wave of LLM and agent “leaks” is not a series of isolated events. It is a signal.

The mechanisms we use to control AI systems are visible, influenceable, and inherently probabilistic.

As a result:

Security strategies must shift from concealment to containment
From preventing manipulation to managing its impact
From trusting the model to governing the system

The organizations that internalize this distinction early will be better positioned to harness AI’s capabilities without inheriting disproportionate risk.

LLM and Agent “Leaks” Are Not Edge Cases

They Are Design Signals

What Actually “Leaked” and Why It Matters

From Models to Agents: Expanding the Exposure Layer

The Deeper Pattern: A Soft Control Plane

Control is Textual, Not Enforced

Boundaries Are Probabilistic

Context Is Ambiguous

Why the Anthropic Case Resonates

Implications for Security and Risk Leaders

LLMs and Agents Are Influenceable Systems

Traditional Boundaries Do Not Apply

Agentic Architectures Increase Blast Radius

Aligning to a Probabilistic Security Model

The Path Forward: From Trust to Governance

The Gist

Comments

Contact

Links