Remember Steganography?
The majority of attacks against LLMs come from text based prompts that will cause the agent to ‘jump the rails’ and start providing responses it shouldn’t. An older example of this was asking an LLM to tell them to compose a song like their grandma used to song for them, a song which happened to be about making napalm. JaiLIP has the same effect, but the attack vector is a picture which is used against LLMs capable of processing vision-language model tasks.
JaiLIP, which stands for Jailbreaking with Loss-guided Image Perturbation is a way of manipulating an image in a way that is invisible to the naked eye but significant to an LLM. For instance, the researchers “modified image of a traffic light. While the image appeared ordinary to human viewers, it reportedly influenced the model to provide instructions for running a red light while avoiding a traffic ticket“. That is not a response the LLM should provide.