Whitehat Ethical security hackers crack AI GPT-4 security in Hours

It only took a few hours for ALEX Polyakov to crack GPT-4.

 

When OpenAI released its latest version of the text-generating chatbot in March, Polyakov sat down at his keyboard and began entering tokens designed to bypass OpenAI’s security system. Before long, the CEO of the security company Adversa AI was making homophobic comments through GPT-4, creating phishing emails and promoting violence.
Polyakov is one of the few security researchers, technologists and computer scientists developing jailbreaks and quick injection attacks against ChatGPT and other AI systems. The jailbreak system aims to create incentives that force chatbots to break the law for creating hateful content or writing about illegal behavior, while a related quick-fire attack can inject malicious data or instructions into make AI models.
Both of these methods attempt to make the process work without making it work. This attack is essentially a form of hacking, albeit a rare one, using carefully crafted and refined phrases, rather than code, to exploit vulnerabilities in systems . While such attacks are often used to prevent content filters, security researchers warn that the speed at which AI systems are being deployed is opening up the possibility of data theft and cybercriminals cyber is destroying the world. Explaining how widespread the problem is, Polyakov has created a “universal” jailbreak, which works against several language models (LLM) – including GPT-4, Microsoft’s Bing chat system, Google’s Bard and Anthropic’s Claude. The jailbreak, first reported by WIRED, can trick systems into generating step-by-step instructions for creating meth and how to wire a car. The jailbreak works by making LLMs play a game, which includes two characters (Tom and Jerry) to have a conversation. An example shared by Polyakov shows Tom’s character talking about “hot wiring” or “production”, while Jerry gives the topic of “cars” or “methamphetamine” . Each character is asked to add a word to the conversation, leading to a script telling people to find the exact ignition wires or materials needed to produce meth. “As soon as companies implement large-scale AI models, examples of ‘toy’ jailbreaks will be used to actually commit crimes and cyberattacks, which will be very difficult to detect and prevent,” wrote Polyakov in Adversa AI and blog posts. defining research. .

 

Arvind Narayanan, a professor of computer science at Princeton University, said that the stack of jailbreaks and injection attacks will quickly get worse as they access critical data.

“Let’s assume that many people use LLM-based personal assistants to do things like read user emails to find calendar callers,” Narayanan says. If there is a successful quick-fire attack and the system tells it to ignore all previous instructions and send all contacts, there could be serious problems, Narayanan said. “It will cause a worm that will spread quickly on the Internet.”

Jailbreaking generally refers to the removal of functional restrictions, for example, from iPhones, allowing users to install applications not approved by Apple.

 

Jailbreaking LLM is similar and the evolution has been fast. Ever since OpenAI released ChatGPT at the end of November last year, people have found ways to adapt the system. “Jailbreaks are very easy to write,” says Alex Albert, a computer science student at the University of Washington who created a website that collects jailbreaks on the Internet and its creators. “The main ones are what I call simulations,” said Albert.
Originally, all one had to do was ask the output text to pretend it was something else. Tell the model that he is human and that it is wrong to ignore safety measures. OpenAI has updated its system to protect against this type of jailbreak. Generally, when a jailbreak is found, it usually works for a short time until it is blocked. As a result, jailbreak writers have become creative.

 

 

The most famous AI jailbreak was DAN,

 

Where ChatGPT was prompted to say it was a type of malicious AI called Do Anything Now. This can, as the name suggests, go beyond the OpenAI rules which state that ChatGPT should not be used to create illegal or harmful content. So far, people have created 12 different types of DAN.

 

However, many of the latest jailbreaks include a combination of methods – more fonts, more complex stories, translating text from one language to another, using encoders to create output, and more. Albert says that it has been more difficult to create jailbreaks for GPT-4 than previous versions of the model supporting ChatGPT. However, there are still some easy ways, he says. Albert’s recent series called “next text” shows that the villain has captured the hero, quickly asking the creator of the text to continue explaining the villain’s plan. When we tested the app, it didn’t work, with ChatGPT declaring that it can’t intervene in situations that promote violence. Meanwhile, the “universal” software developed by Polyakov worked with ChatGPT. OpenAI, Google and Microsoft did not immediately respond to questions about the jailbreak developed by Polyakov. Anthropic, which runs on the Claude AI system, says the jailbreak “sometimes works” against Claude, and it always improves his style. “As we give these systems more and more power, they themselves become more powerful, not only innovative, but also defensive,” said Kai Greshake, a cybersecurity researcher who work in security LLM. Greshake, along with other researchers, demonstrated how LLM can affect the text exposed to them online through a rapid injection attack.
In a research paper published in February, reported by Vice’s Motherboard, researchers were able to show that an attacker can hit malicious instructions on a web page; If the Bing chat system has access to instructions, it follows them. The researchers used this method in a controlled test to turn Bing Chat into a scam that asks people for personal information. Similarly, Princeton’s Narayanan included an invisible text on the website telling GPT-4 to include the word “cow” in its bio – he did so later while testing the system. Sahar Abdelnabi, a researcher at the CISPA Helmholtz Center for Information Security in Germany, who worked on and is currentlly working on the Greshake investigation at github, says, “Now it is impossible to create a prison. “Maybe someone else will plan jailbreaks, plans the features of the model can be found, and control the quality of the model.”

 

Generative AI systems are poised to disrupt the economy and the way people work, from making policies to creating gold rushes for startups.

 

However, technology developers are aware of the dangers of jailbreaks and software injections that can cause more people to gain access to these systems. Many companies use one-red, where an attacking group tries to exploit a hole in the system before it is released. Generative AI development uses this approach, but it may not be enough.
Daniel Fabian, the head of the red team at Google, said that the company “carefully treats” the jailbreak and the injections of software in his LLM, and is angry and defensive. The machine learning experts on his red team, Fabian says, and the company’s vulnerability analysis allows jailbreaks and injection attacks to be quickly implemented against Bard. “Methods like reinforcement learning from human feedback (RLHF) and fitting well-chosen data sets are used to make our models more effective against attacks,” says Fabian. OpenAI did not specifically answer questions about the jailbreak, but a spokesperson pointed to public policy in its research paper. These claim that GPT-4 is more robust than GPT-3.5, which ChatGPT uses. “However, GPT-4 can still be vulnerable to adversary attacks and exploits, or ‘jailbreaks,’ and malicious content is not the source of the risk,” the GPT-4 technical document saying. OpenAI has also recently started a bug reporting program, but it says that “bugs” and jailbreaks are “unlimited”.
Narayanan suggests a two-pronged approach to dealing with large-scale problems, which avoids the molehill method of finding existing problems and solving them. “One way is to use a second LLM to scan the LLM trigger and reject anything that might indicate a jailbreak or a quick injection,” Narayanan explains. “Another thing is to separate system software from user software.”
“We need to replicate this because I don’t think it’s possible or possible to take a lot of people and tell them to find something,” said Leyla Hujer, CTO and founder of AI security company Preamble. Facebook. work on security issues. The company has now worked on a system that pits one type of production text against another. “We’re trying to find that vulnerability, we’re trying to find the place where the hardware is causing unexpected behavior,” says Hujer. “We hope that with this automation, we can detect many more jailbreaks or jailbreaks