JaiLIP, A Picture Is Worth 1000 Lines Of Code To An LLM

Remember Steganography?

The majority of attacks against LLMs come from text based prompts that will cause the agent to ‘jump the rails’ and start providing responses it shouldn’t. An older example of this was asking an LLM to tell them to compose a song like their grandma used to song for them, a song which happened to be about making napalm. JaiLIP has the same effect, but the attack vector is a picture which is used against LLMs capable of processing vision-language model tasks.

JaiLIP, which stands for Jailbreaking with Loss-guided Image Perturbation is a way of manipulating an image in a way that is invisible to the naked eye but significant to an LLM. For instance, the researchers “modified image of a traffic light. While the image appeared ordinary to human viewers, it reportedly influenced the model to provide instructions for running a red light while avoiding a traffic ticket“. That is not a response the LLM should provide.

You can read the entire research paper here.

Source link

JaiLIP, A Picture Is Worth 1000 Lines Of Code To An LLM

Remember Steganography?

The Gigabyte Aorus FO27Q5P Hertz So Good

Podcast #873 – The Steam Machine Episode, also starring DDR. Special guests FSR4.1, Computing History, & GTA VI + MORE

Don’t Tell Anyone But Windows 10 Will Get Another Year Of Updates

Comments

Leave a Reply Cancel reply