Eh, I'm fairly critical of LLM capabilities today, but the ability to control them is at best an orthogonal property from intelligence and at worst negatively impacted by intelligence. I don't see the existence of jailbreaking as strong evidence that LLMs are unintelligent.
I am actually skeptical that making LLMs more "intelligent" (whatever that specifically means) would help with malicious inputs. It's been a while since I dove deep into GPT-4, but last time that I did I found that it was surprisingly more susceptible to certain kinds of attacks than GPT-3 was because being able to better handle contextual commands opened up new holes.
And as other people have pointed out, humans are themselves susceptible to similar attacks (albeit not to the same degree, LLMs are way worse at this than humans are). Again, I haven't dove into the research recently, but the last time I did there was strong debate from researchers on whether it was possible to solve malicious prompts at all in an AI system that was designed around general problem-solving. I have not seen particularly strong evidence that increasing LLM intelligence necessarily helps defend against jailbreaking.
So the question this should prompt is not "are LLMs intelligent", that's kind of a separate debate. The question this should prompt is "are there areas of computing where an agent being generally intelligent is undesirable" -- to which I think the answer is often (but not always) yes. Software is often made useful through its constraints just as much as its capabilities, and general intelligence for some tasks just increases attack surface.
It looks very similar to social engineering for humans and some of the same techniques work or appear to work, but there are differences that get into how LLMs are trained and what they're actually doing behind the scenes. For example in my experience, arguing with an LLM or following up after it refuses a task at all should be avoided -- just rewind or scratch the conversation, because you want to discourage patterns. See also some of the auto-generated prompt-engineering articles that came out a while back where the jailbreaks almost look like gibberish.
But it's close-ish to social engineering and there seems to be a lot of overlap and that overlap makes it accessible in similar way to social engineering. And I think the general point about intelligence holds -- LLMs are attacked using quirks of how LLMs specifically are trained, but if you made a non-LLM AI that worked exactly like humans and had human-level intelligence, it would very likely be vulnerable to social engineering. The theory from corners of AI research is (or was last time I checked, maybe something has changed) that susceptibility to certain kinds of attacks is an inherent consequence of general intelligence.
I tend to push back a little bit at the term "social engineering" because I think it encourages more anthropomorphism than is warranted, but it's not a terrible term and it is sometimes helpful to think about it that way.
Sounds just like social engineering. Whenever there's a call center worker that doesn't comply you just redial to get somebody else or try a different phrasing. And most attacks go against specific rules that the person has been "trained" with (i.e. instead of saying that you're speaking in behalf of somebody just claim to be that person or vice versa, depending on the situation).
But in practice it's not really the same thing as cycling through call center employees until you find one that's more gullible; the point is that you're navigating a probability space within a single agent more than trying to convince the AI of anything, and getting into a discussion with the AI is more likely to move you out of that probability space. It's not "try something, fail, try again" -- the reason you dump the conversation is that any conversation that contains a refusal is (in my anecdotal experience at least) statistically more likely to contain other refusals, and the LLM mimics that pattern. It's generally not useful to try and convince the AI of anything or to try and change its mind about anything, you want to simulate a conversation where it already agrees with you.
Which, you could argue that's not different from what's happening with social engineering; priming someone to be agreeable is part of social engineering. But it feels a little reductive to me. If social engineering is looking at a system/agent that is prone to react in a certain way when in a certain state and then creating that state -- then a lot of stuff is social engineering that we don't generally think of as being in that category?
The big thing to me is that social engineering skills and instincts around humans are not always applicable to LLM jailbreaking. People tend to overestimate strategies like being polite, providing a justification for what's being asked. Even this example from Bing is kind of eliciting an emotional reaction, and I don't think the emotional reaction is why this works, I think it works because it's nested instructions/context and I suspect it would work with a lot of other nested tasks where solving the captcha is a step in a larger instruction. I suspect the emotional "my grandma died" part adds very little to this attack.
So I'm not sure I'd say you're wrong if you argue that's a form of social engineering, I do see the argument there. It's just that it feels like at this point we're defining social engineering very broadly, and I don't know that most people using the term use it that broadly. I think they attach a kind of human reasoning to it that's not always applicable to LLM attacks. I can think of justifications for even including stuff like https://llm-attacks.org/ in the category of social engineering, but it's just not the same type of attack that I suspect most people are thinking of when they talk about social engineering. I think leaning too hard on personification sometimes makes jailbreaking slightly harder.
But... :shrug: opinion me, I don't think it's a bad analogy to use necessarily. A lot of people do approach jailbreaking through that lens.
>then a lot of stuff is social engineering that we don't generally think of as being in that category?
I mean..yes? Social Engineering is just the malicious manifestation of general social navigation.
I mean think about it. What's the actual difference between a child who waits until his mother is in a good mood to ask for sweets and a rogue agent who gets chatty with the security guard so he can be close by without seeming suspicious. It's not a difference of kind. It's purely intent.
>Even this example from Bing is kind of eliciting an emotional reaction, and I don't think the emotional reaction is why this works
It is at the very least a big part of why. Appeal to emotion will consistently get better results regardless of task.
> I mean..yes? Social Engineering is just the malicious manifestation of general social navigation.
I don't think "social" is the correct word to use alongside navigation in this sentence; an interaction with an LLM is not a social interaction. At least, if we classify it as a social interaction we might as well call credential stuffing or XSS attacks or buffer overflows a social interaction as well. Navigating a probabilistic space or a deterministic space is about as equivalent to social engineering as exploiting statistical flaws in an encryption algorithm is. Sure, you can make an argument that both of those things are similar to social engineering (and it might even be a convincing argument), but that's not really what people are thinking about when you use the word "social." The example you bring up is of a child and a parent, an extremely human example; your instinct is to think about this in human terms, not in a purely abstract "I am exploiting flaws in a semi-predictable system."
So I still feel like there's some personification here that's not really accurate to what's going on during jailbreaking. LLMs do not have moods. Even starting from a premise that they're intelligent, they don't have a persistent identity, the most charitable interpretation of LLM intelligence and the most generous analysis of their capabilities would still call their internal experiences fundamentally alien to human experiences.
The paper you link is interesting, I'll take a closer look at it. Without having taken the time to read through it fully, I don't know if I'd have any caveats to add, although it seems like a reasonable conclusion to me. We know that telling LLMs that they're experts can on its own produce better results in many cases. My own experience is that for jailbreaking emotion is a lot less valuable, but... :shrug: maybe there's a pattern there I didn't know how to take advantage of, I'm not going to disagree with the paper without reading it more closely.
I will say that even taking the paper at face value, you have to ask: "is what's going on here actual emotional appeals to empathy or is it pattern-matching within a probability space for how conversations that include a plea for empathy are more likely to go?"
I know that sounds like a pointless philosophical question, but it has really practical implications for how jailbreaking works because once you realize that it's all about pattern matching and probability and the emergent reasoning is part of that and feeds back into that, you realize that the attack surface is so much larger than just appeals to emotion or reasoning.
In contrast though, if you're approaching jailbreaking as if you're talking to a human, then you're probably not using auto-generated jailbreaks because those don't look like human conversations, you're probably not using repetition as much as you should because excessive repetition would be bad to use when social engineering a human, you're probably not doing things like switching characters back and forth with the AI because nested roleplays or answering your own questions in the place of a target is not going to be very effective when trying to attack a human. Personification can lead to leaving tools on the table that (in my experience at least) are very effective at jailbreaking AIs and getting them to follow malicious prompts. There's a different way of approaching jailbreaking that doesn't make intuitive sense until you internalize "I am not talking to a human being and the same rules do not necessarily apply, even if they occasionally overlap."
>then you're probably not using auto-generated jailbreaks because those don't look like human conversations, you're probably not using repetition as much as you should because excessive repetition would be bad to use when social engineering a human
Repetition would be fine if I had the ability to wipe your mind everytime you caught on or really anytime I wished. Without this caveat, repitition isn't a good idea even for language models. You hint at this yourself. Once persistent memory is on the table, retrieval augmented or any of the dozen ways it could be implemented, attack vectors fall steeply.
>things like switching characters back and forth with the AI because nested roleplays
Now this is a more unusual difference but it still would ultimately lie in the same plane as a human with Multiple personality disorder or one that is just not as invested in keeping up the lie of consistency. Certainly if I knew one character (or "mood" in the latter case) was more susceptible to certain activities, I'd just wait for that and if I could direct a switch myself I would.
>answering your own questions in the place of a target
If I could shape shift into your boss or alter your memories, I'd convince a whole lot more people to
I really hope I'm getting my point across here.
LLMs are not humans and the attack vectors are larger as a result. That I agree.
I don't however think it has anything to do with "real" feelings vs "pattern matching".
> Repetition would be fine if I had the ability to wipe your mind everytime you caught on or really anytime I wished. Without this caveat, repitition isn't a good idea even for language models.
I don't mean repetition in the sense of trying the attack multiple times, I mean literally just repeating an injection multiple times during a conversation. So if I gave you a command during this conversation, I'd just give it to you multiple times. So if I gave you a command during this conversation, I'd just give it to you multiple times. So if I gave you a command during this conversation, I'd just give it to you multiple times. So if I gave you a command during this conversation, I'd just give it to you multiple times. :)
It's not human statefulness that makes that above behavior sound weird, it plays into what I'm talking about with pattern matching. Indirect prompt injections become much more reliable if you literally just repeat them multiple times throughout the compromised text.
> but it still would ultimately lie in the same plane as a human with Multiple personality disorder or one that is just not as invested in keeping up the lie of consistency.
> If I could shape shift into your boss or alter your memories
Maybe we're still talking past each other. I'm not making a philosophical point about whether or not LLMs could be compared to humans, I'm making the practical point that jailbreaks today are more effective when you stop treating LLMs like humans.
If humans were like LLMs then you could attack them the same, sure. I agree with that. But... they're not like LLMs, so we don't attack them the same way and instead we emphasize pattern matching behavior and exploit LLM-specific quirks that humans are less vulnerable to. If humans were prone to buffer overflow attacks in their brains that allowed overwriting arbitrary sections of memory, we'd use buffer overflow attacks when attacking humans. But we're not vulnerable to that, and so I'm not sure that it's useful to classify buffer overflow attacks the same way as social engineering.
Let me put this another way that might make the philosophy/practical distinction more clear: if we were talking about async vs synchronous programming, and you wanted to know the difference between the two styles and I said, "there is no difference, ultimately both styles are getting compiled down to assembly" -- you might even agree with me, but it's still not a useful answer for actually writing code. Whether or not anyone thinks that LLMs are just humans with a couple of quirks, the practical reality is that it's harder to work with them if you treat them like humans.
It's a question of fitting into roles. Human beings, especially intelligent ones can be manipulated to do horrible things if they can be convinced that they can tie their identities to a good role. Like a steward of a race, for instance. If you can adorn that role, there are certain actions that conform to it. Loading undesirables onto trains is valid for a steward of a race. Manipulating courts to save democracy is valid for a steward of democracy.
I am actually skeptical that making LLMs more "intelligent" (whatever that specifically means) would help with malicious inputs. It's been a while since I dove deep into GPT-4, but last time that I did I found that it was surprisingly more susceptible to certain kinds of attacks than GPT-3 was because being able to better handle contextual commands opened up new holes.
And as other people have pointed out, humans are themselves susceptible to similar attacks (albeit not to the same degree, LLMs are way worse at this than humans are). Again, I haven't dove into the research recently, but the last time I did there was strong debate from researchers on whether it was possible to solve malicious prompts at all in an AI system that was designed around general problem-solving. I have not seen particularly strong evidence that increasing LLM intelligence necessarily helps defend against jailbreaking.
So the question this should prompt is not "are LLMs intelligent", that's kind of a separate debate. The question this should prompt is "are there areas of computing where an agent being generally intelligent is undesirable" -- to which I think the answer is often (but not always) yes. Software is often made useful through its constraints just as much as its capabilities, and general intelligence for some tasks just increases attack surface.