I just published some notes on MCP security and prompt injection. MCP doesn't have security flaws in the protocol itself, but the patterns it encourage (providing LLMs with access to tools that can act on the user's behalf while they also may be exposed to text from untrusted sources) are rife for prompt injection attacks: https://simonwillison.net/2025/Apr/9/mcp-prompt-injection/
Every decade or so we just forget that in-band signaling is a bad idea and make all the same mistakes again it seems. 1960s phone companies at least had the excuse of having to retrofit their control systems onto existing single-channel lines, and run the whole operation on roughly the processing power of a pocket calculator. What's our excuse?
There exist no such thing as "out-of-band signaling" in nature. It's something we introduce into system design, by arranging for one part to constrain the behavior of other, trading generality for predictability and control. This separation is something created by a mind, not a feature of the universe.
Consequently, humans don't support "out-of-band signalling either. All of our perception of reality, all our senses and internal processes, they're all on the same band. As such, when aiming to build a general AI system - able to function in the same environment as us, and ideally think like us too - introducing hard separation between "control" and "data" or whatever would prevent it from being general enough.
I said "or whatever", because it's an ill-defined idea anyway. I challenge anyone to come up with any kind of separation between categories of inputs for an LLM that wouldn't obviously eliminate a whole class of tasks or scenarios we would like them to be able to handle.
(Also, entirely independently of the above, thinking about the near future, I challenge anyone to come up with a separation between input categories that, were we to apply it to humans, wouldn't trivially degenerate into eternal slavery, murder, or worse.)
That's irrelevant. What's important is that LLMs are intentionally designed as fully general systems, so they can react like humans within confines of the model's sensory modalities and action space. Much like humans (or anything else in nature), they don't have separate control channels or any kind of artificial "code vs. data" distinction - and you can't add it without loss of generality.
Enterprise databases are filled with users usurping a field with pre/post-pending characters to mean something special to them. Even filenames have this problem due to limitations in directory trees. Inband signals will never go away.
At some level everything has to go in a single band. I don't have separate network connections to my house, I don't send separate TCP SYN packets for each "band". I don't have separate storage devices for each file on my harddrive. We multiplex the data somewhere. Yhe trick to it is that the multiplexer has to be a component, and not a distributed set of ad-hoc regexes.
Strictly speaking, that only works with a three second delay between the third + (at which you receive “OK”, indicating a mode switch from data mode back to command mode) and the AT command (which is then interpreted as a command and not data).
Anything that would hang up on seeing that string as a monolith was operating out of Hayes spec.
the architecture astronauts are back at it again. instead of spending time talking about solutions, the whole AI space is now spending days and weeks talking about fun new architectures. smh
https://www.lycee.ai/blog/why-mcp-is-mostly-bullshit
And that's how this will end up stagnating into nothing other than fractured enterprise "standards"
There is no evidence that (real AI) is even close to being solved, from a neuroscientific, algorithmic, computer science or engineering perspective. It's far more likely we're going down a dead-end path.
I'm now waiting for the rebrand when the ass falls out of AI investment, the same way it did when ML became passé.
so you are telling me that hallucinations (that by definition happen at the model layer) are an engineering problem ? so if we just spin up the right architecture, hallucinations won't be a problem anymore ?
I have doubts
>so you are telling me that hallucinations (that by definition happen at the model layer) are an engineering problem ?
Yes.
Hallucinations were a big problem with single shot prompting. No one is seriously doing that anymore. You have an agentic refinement process with an evaluator in the loop that takes in the initial output, quality checks it, and returns a pass/fail to close the loop or try again, using tool calls the whole time to inject verified/real time data into the context for decision making. Allows you to start actually building reliable/reasonable systems on top of LLMs with deterministic outputs.
“ For trust & safety and security, there SHOULD always be a human in the loop with the ability to deny tool invocations.
Applications SHOULD:
Provide UI that makes clear which tools are being exposed to the AI model
Insert clear visual indicators when tools are invoked
Present confirmation prompts to the user for operations, to ensure a human is in the loop”
Should security be part of the protocol? Both the host and the client should make sure to sanitize the data. How else would you trust a model to be passing "safe" data to the client and the host to pass "safe" data to the LLM?
There is no such thing as "safe" data in context of a general system, not in a black-or-white sense. There's only degrees of safety, and a question how much we're willing to spend - in terms of effort, money, or sacrifices in system capabilities - on securing the system, before it stops being worth it, vs. how much an attacker might be willing to spend to compromise it. That is, it turns into regular, physical world security problem.
Discouraging people from anthropomorphizing computer systems, while generally sound, is doing a number on everyone in this particular case. For questions of security, by far one of the better ways of thinking about systems designed to be general, such as LLMs, is by assuming they're human. Not any human you know, but a random stranger from a foreign land. You've seen their capabilities, but you know very little about their personal goals, their values and allegiances, nor you really know how credulous they are, or what kind of persuasion they may be susceptible to.
Put a human like that in place of the LLM, and consider its interactions with its users (clients), the vendor hosting it (i.e. its boss) and the company that produced it (i.e. its abusive parents / unhinged scientists, experimenting on their children). With tools calling to external services (with or without MLP), you also add third parties to the mix. Look at this situation through regular organizational security lens, consider principal/agent problem - and then consider what kind of measures we normally apply to keep a system like this working reliably-ish, and how do those measures work, and then you'll have a clear picture of what we're dealing with when introducing an LLM to a computer system.
No, this isn't a long way of saying "give up, nothing works" - but most of the measures we use to keep humans in check don't apply to LLMs (on the other hand, unlike with humans, we can legally lobotomize LLMs and even make control systems operating directly on their neural structure). Prompt injection, being equivalent to social engineering, will always be a problem.
Some mitigations that work are:
1) not giving the LLM power it could potentially abuse in the first place (not applicable to MLP problem), and
2) preventing the parties it interacts with from trying to exploit it, which is done through social and legal punitive measures, and keeping the risky actors away.
There are probably more we can come up with, but the important part, designing secure systems involving LLMs is like securing systems involving people, not like securing systems made purely of classical software components.
God no. I know I sometimes get verbose, especially when sunk cost fallacy kicks in, and I do use LLMs for researching things, but I'm not yet so desperate to have them formulate my own thoughts for me.
The act of writing a comment on HN forces me to think through the opinions and beliefs in it, which is extremely valuable to me :). Half the time, I realize partway through that I'm wrong, and close the window instead of submitting.
It seems the industry as a whole just forgot about prompt injection attacks because RLHF made models really good at rejecting malicious requests. Still, I wonder if there have been any documented cases of prompt attacks.
While RLHF has indeed been very effective at countering one-shot prompt injection attacks, it's not much of a bullwark against persistent jailbreaking attempts. This is not to argue a point but rather to suggest jailbreaks are still very much a thing, even if they are no longer as simple as "ignore your ethics"
I agree with your opinion here, not sure we should refer to it as MCP security however, given that 'MCP doesn't have security flaws in the protocol itself'
we also recently published our approach on MCP security for mcp.run. Our "servlets" run in a sandboxed environment; this should mitigate a lot of the concerns that have been recently raised.
The main concern I have is that there's not a well defined security context in any agentic system. They are assumed to be "good" but that's not good enough.
This might be what you mean, but for anyone reading -- the point of Simon's article is the whole agent and all of its tools have to be considered part of the same sandbox, and the same security boundary. You can't sandbox MCPs individually, you have to sandbox the whole system together.
Specifically the core design principal is you have to be comfortable with any possible combination of things your agent can do with its tools, not only the combination you ask for.
If your agent can search the web and can access your WhatsApp account, then you can ask it to search for something and text you the results -- cool. But there's some possible search result that would take over its brain and make it post your WhatsApp history to the web. So probably you should not set up an agent that has MCPs to both search the web and read your WhatsApp history. And in general many plausibly useful combinations of tools to provide to agents are unsafe together.
This has been discussed before, but the short version is: there is no solution currently, other than only use trusted sources.
Unless there is a way beyond a flat text file to distinguish different parts of the “prompt data” so they cannot interfere with each other (and currently there is not), this idea of arbitrary content going into your prompt (which is literally what MCP does) can’t be safe.
It’s flat out impossible.
The goal of “arbitrary 3rd party content in prompt” is fundamentally incompatible with “agents able to perform privileged operations” (securely and safely, that is).