JaiLIP, A Picture Is Worth 1000 Lines Of Code To An LLM

JaiLIP, A Picture Is Worth 1000 Lines Of Code To An LLM

Remember Steganography?

The majority of attacks against LLMs come from text based prompts that will cause the agent to ‘jump the rails’ and start providing responses it shouldn’t.  An older example of this was asking an LLM to tell them to compose a song like their grandma used to song for them, a song which happened to be about making napalm.  JaiLIP has the same effect, but the attack vector is a picture which is used against LLMs capable of processing vision-language model tasks.

JaiLIP, which stands for Jailbreaking with Loss-guided Image Perturbation is a way of manipulating an image in a way that is invisible to the naked eye but significant to an LLM.  For instance, the researchers “modified image of a traffic light. While the image appeared ordinary to human viewers, it reportedly influenced the model to provide instructions for running a red light while avoiding a traffic ticket“.  That is not a response the LLM should provide.

You can read the entire research paper here.

Source link

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *