There were a few interesting papers - the Anthropic one about alignment faking https://www.anthropic.com/news/alignment-faking and the OpenAI o1 system card https://simonwillison.net/2024/Dec/5/openai-o1-system-card/ - and OpenAI continued to push their "instruction hierarchy" idea, any other big moments?
I'll be honest, I don't follow that side of things very closely (outside of complaining that prompt injection still isn't fixed yet).
There were a few interesting papers - the Anthropic one about alignment faking https://www.anthropic.com/news/alignment-faking and the OpenAI o1 system card https://simonwillison.net/2024/Dec/5/openai-o1-system-card/ - and OpenAI continued to push their "instruction hierarchy" idea, any other big moments?
I'll be honest, I don't follow that side of things very closely (outside of complaining that prompt injection still isn't fixed yet).