Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> You actually can implement LLM guardrails by "just asking" it to not do X in the prompt.

Except it keeps being proven that with current LLMs, guardrails implemented that way are both quite weak and make the performance of the system worse for things that aren't intended to be excluded.

Further, because of the way LLMs scale, an instruction that fails to a hostile customer request of a particular form will do so every time, while one intern that is subject to a particular exploit won’t imply every similarly situated intern having the same vulnerability, so an exploit which works once won't be easily and reliable repeatable.



As discussed in the sibling thread, the point I'm making isn't about whether prompt-based guardrails are effective enough for production systems. All I am saying is that it's possible to implement guardrails at the prompt level and they do have some limited, non-zero effectiveness, thus indicating that LLMs are capable of processing such instructions, just like humans.

> an instruction that fails to a hostile customer request of a particular form will do so every time, while one intern that id subject to a particular exploit won’t imply every similarly situated intern having the same vulnerability

Give me a perfect clone of the first intern programmed to believe they've had an identical upbringing and experience and I'll bet you such subjects fall victim in the same way to the same attack every time. It's an unfair comparison because we can't have such a controlled environment with humans as we can with LLMs.


> Give me a perfect clone of the first intern programmed to believe they've had an identical upbringing and experience and I'll bet you such subjects fall victim in the same way to the same attack every time.

Sure, but that's not a realistic situation.

> It's an unfair comparison

It's a perfectly fair comparison in response to the claim upthread that LLM instruction-following issues are basically the same as in humans: on an individual request basis, maybe, but at scale, the pragmatics are hugely different.


I don't think the question was about the practical aspects but rather whether or not LLMs are theoretically equally capable in a technical, qualitative sense. We've had tens of thousands of years to work on the practical aspects of human systems, so of course LLMs are not going to be at that level of refinement yet.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: