LLM and Agent “Leaks” Are Not Edge Cases
- Christina Richmond
- 2 days ago
- 4 min read
Updated: 2 days ago
They Are Design Signals
Over the past year, a series of so-called “leaks” involving large language models (LLMs) and emerging agentic systems have captured industry attention. The most cited example is the exposure of system prompts and behavioral scaffolding behind models like Claude from Anthropic, alongside similar disclosures affecting models from OpenAI.
These events have often been framed as isolated incidents or, alternatively, dismissed as overblown artifacts of jailbreak culture. Both interpretations miss the point.
What we are seeing is neither a traditional breach nor a trivial curiosity. It is something more foundational:
The control mechanisms governing LLMs and agents are inherently observable, influenceable, and probabilistic.
This is not a bug. It is a property of the system.
What Actually “Leaked” and Why It Matters
In the case of Anthropic’s Claude models, a 59.8MB JavaScript source map file in its Claude Code v2 1.88 exposed 1,906 files of unobfuscated code, allowing developers to see the complete agent harness and workflow. The leak provided a roadmap of Anthropic's proprietary tools to competitors and created potential vulnerabilities for exploitation.
Additionally, researchers and users were able to extract elements of the system prompt, including:
Constitutional AI principles guiding responses
Safety and refusal logic
Tone, persona, and escalation instructions
This was not achieved through infrastructure compromise, but through interaction.
Carefully constructed prompts, recursive queries, and tool-mediated workflows surfaced what was intended to remain hidden.
Similar dynamics have been observed across other leading models. System prompts, tool policies, and behavioral constraints are not sealed. They can be inferred, reconstructed, and in some cases directly elicited.
The implication is straightforward but profound:
System prompts are not a security boundary.
They are part of the attack surface.
From Models to Agents: Expanding the Exposure Layer
If prompt exposure were the full story, the industry could treat this as a manageable transparency issue. The real shift emerges as organizations move from standalone models to agentic systems.
Agents introduce:
Tool access (APIs, databases, SaaS platforms)
Memory (persistent or session-based)
Autonomy (multi-step reasoning and execution)
In this context, prompt injection is no longer just a way to manipulate output. It becomes a mechanism for action and exfiltration.
We are already seeing credible demonstrations of:
Malicious content embedded in documents or web pages altering agent behavior
Retrieval-augmented generation (RAG) pipelines surfacing sensitive data that can then be extracted
Agents being induced to call external tools with unintended parameters or data
The critical issue is that the model does not reliably distinguish between:
Trusted system instruction
Retrieved content
Adversarial input
To the model, these are all tokens in a sequence. The burden of separation falls on architecture, not the model itself.
The Deeper Pattern: A Soft Control Plane

Across Anthropic, OpenAI, and the broader ecosystem of agent frameworks, a consistent pattern is emerging.
Control is Textual, Not Enforced
System behavior is governed by instructions written in natural language. These instructions are interpreted, not enforced.
Boundaries Are Probabilistic
Safety, policy, and task constraints are applied with high likelihood, not certainty. Under pressure, they can degrade.
Context Is Ambiguous
The model does not possess a native mechanism to assign trust levels to different inputs within its context window.
Taken together, this creates what can be described as a soft control plane. It is effective under normal conditions, but susceptible to manipulation under adversarial ones.
Why the Anthropic Case Resonates
Anthropic’s approach is particularly instructive because of its emphasis on Constitutional AI (see "What is Constitutional AI?" blog). By encoding principles and behavioral guidelines directly into the system prompt, the company has made its alignment strategy more explicit than most.
When those prompts are exposed, what becomes visible is not just implementation detail, but philosophy in action.
This has two effects:
It demystifies how alignment is operationalized
It reveals the limits of that approach under adversarial interaction
In this sense, the “leak” functions less as a failure of secrecy and more as a window into the current state of AI control systems.
Implications for Security and Risk Leaders
For CISOs and security leaders, the takeaway is not that LLMs are unsafe. It is that they must be understood on their own terms.
LLMs and Agents Are Influenceable Systems
They can be guided, steered, and in some cases manipulated through input alone. This places them closer to human operators than to deterministic software components.
Traditional Boundaries Do Not Apply
You cannot rely on hidden prompts, internal policies, or model alignment as hard controls. These are advisory layers, not enforcement mechanisms.
Agentic Architectures Increase Blast Radius
Once a model can take action, access data, or invoke tools, the consequences of manipulation expand from “incorrect answer” to “material impact.”
Aligning to a Probabilistic Security Model
These dynamics reinforce a broader shift already underway in security:
From deterministic control to probabilistic risk management.
In practical terms, this means:
Assuming that prompt injection will occur
Designing systems that constrain what an agent can do, not just what it is told
Implementing verification layers around high-impact actions
Maintaining human oversight where context and judgment are required
This is not a temporary phase. It is the operating model for AI-enabled systems.
The Path Forward: From Trust to Governance
The industry often frames AI adoption in terms of trust. Trust in the model, the vendor, or the outputs.
A more useful framing is governance.
What can the system access?
What actions can it take?
How are those actions monitored and validated?
Where does human authority intervene?
As organizations move toward agentic AI, these questions become central.
The Gist
The recent wave of LLM and agent “leaks” is not a series of isolated events. It is a signal.
The mechanisms we use to control AI systems are visible, influenceable, and inherently probabilistic.
As a result:
Security strategies must shift from concealment to containment
From preventing manipulation to managing its impact
From trusting the model to governing the system
The organizations that internalize this distinction early will be better positioned to harness AI’s capabilities without inheriting disproportionate risk.
