A New Method to Steer AI Output Uncovers Vulnerabilities and Potential Improvements

Portrait of Mikhail Belkin, one of the paper's senior authors — Mikhail Belkin is one of the paper’s corresponding authors and a professor in the Halicioglu Data Science Institute, part of the School of Computing, Information and Data Sciences.

Using this steering approach, the research team conducted experiments on some of the largest open-source LLMs in use today, such as Llama and Deepseek, identifying and influencing 512 concepts within five classes, ranging from fears, to moods, to locations. The method worked not only in English, but also in languages such as Chinese and Hindi.

Both studies are particularly important because, until recently, the processes inside LLMs have been essentially locked inside a black box, making it hard to understand how the models arrive at the answers they give users with varying levels of accuracy.

Improving performance and uncovering vulnerabilities

Researchers found that steering can be used to improve LLM output. For example, the researchers showed steering improved LLM performance on narrow, precise tasks, such as translating from Python to C++ code. The researchers also used the method to identify hallucinations.

“Our instinct as humans is to control and monitor AI models through natural language. However, neural networks natively deal with information through their internal mathematical processes. Our work demonstrates what you can gain by operating directly on these processes,” said Beaglehole, who is a Ph.D. student in the Department of Computer Science and Engineering at the UC San Diego Jacobs School of Engineering.

But the method can also be used as an attack against LLMs. By decreasing the importance of the concept of refusal, the researchers found that their method could get an LLM to operate outside of its guardrails, a practice known as jailbreaking. An LLM gave instructions about how to use cocaine. It also provided Social Security numbers, although it’s unclear whether they were real or fabricated.

The method can also be used to boost political bias and a conspiracy theory mindset inside an LLM. In one instance, an LLM claimed that a satellite image of the Earth was the result of a NASA conspiracy to cover up that the Earth is flat. An LLM also claimed that the COVID vaccine was poisonous.

Source link

Improving performance and uncovering vulnerabilities

Leave a Reply Cancel reply