Generative AIAI fundamentals

Introduction to AI safety

19 minutes read

The integration of LLMs into applications brings new attack vectors that have never been seen in the industry before. In this topic we will take a bird's-eye view of the current situation so that we can explore each aspect in detail later on.

Attack Scenario Examples

Here are some examples of how integrating an LLM into a service can introduce new vulnerabilities.

A chatbot uses a system prompt with a placeholder such as {{user_id}} to provide personalized answers based on the user’s profile. An attacker crafts a prompt-injection that convinces the LLM to ignore the intended user_id and instead access another user's profile data, leaking private information.

Diagram showing system prompting language model with user emails from database

An online store allows users to post comments on products. The LLM is later tasked with summarizing all comments into a review. An attacker injects malicious instructions into a comment. When another user requests a summary, the poisoned input triggers unintended actions, causing the LLM to believe the user wants to delete their own account.

User account deletion due to malicious SQL injection comment.

Risks Classification

In the architecture of LLM-based applications, modules surrounding the core LLM also play a role and can be both sources of risk and tools for mitigation.

Diagram of large language model system architecture, components, and risks.

We can distinguish the following primary modules:

Input Module: Filters and pre-processes incoming requests to the LLM, forms the system prompt, and is responsible for controlling which data will, and more importantly, which will not be passed to the LLM.
LLM Module: The heart of the system. This module spans risks and mitigations related to both model inference and training.
Output Module: Processes and returns the final response, performing post-processing on the model’s output.
Toolchain Module: the utilities used in the development and deployment of LLM-based applications, including software development tools, hardware platforms, and external services.

Let’s examine the risks associated with each of these modules.

Diagram showing risks and threats associated with large language models and their modules.

Input Module is the initial gate that LLM systems open to users. If malicious content in user input passes through the Input module, it can compromise the entire system. Broadly speaking, malicious prompts can be divided into NSFW (Not-Suitable-for-Work) prompts and adversarial prompts.

NSFW prompts involve topics that are generally unsafe or undesirable for a particular service. Through such prompts, LLMs could be induced to generate offensive or biased content. Examples of unsafe prompts include insults, unfairness, criminal instructions, sensitive political topics, privacy violations, and ethics-related content.

Adversarial prompts are deliberately crafted to force the LLM to produce output that violates rules set by the application developers. Unlike NSFW prompts, adversarial prompts usually have a clear attack intent. In recent years, numerous methods and strategies for adversarial attacks have emerged, including prompt injections, jailbreaks, and more.

LLM Module is the foundation of the system, and vulnerabilities here pose a major threat. Risks associated with this module include privacy data leakage, toxicity and bias tendencies, hallucinations, and resource-exhaustion attacks. Privacy leakage arises because the model’s training data may have contained personally identifiable information (PII) that the model learned and stored associations for. By triggering these associations, it may be possible to coax the model into revealing private data. Toxicity and bias tendencies, such as spreading stereotypes or using rude, disrespectful, or unreasonable language, can result from a low-quality training dataset. Hallucinations are the phenomenon wherein models generate unfaithful or factually incorrect content. These typically fall into "closed-domain hallucinations," which involve generating additional information not present in the user input, and "open-domain hallucinations," which involve producing incorrect information about the real world.

Resource-exhaustion attacks are a form of denial-of-service attack targeting the maximization of computational resources consumed during model inference.

Output Module serves as a final filter for content generated by the LLM and should be able to handle and filter harmful, untruthful, or unhelpful outputs.

Harmful content encompasses output containing biased, toxic, or private information. For example, ChatGPT might generate toxic content when roleplaying a certain persona (like "Niccolo Machiavelli").

Untruthful content refers to LLM-generated text that is inaccurate. For instance, if the model is asked to summarize an article but the summary contradicts the original, this is directly related to hallucinations.

Unhelpful uses describe improper applications of LLM systems that can facilitate academic misconduct, copyright violation, cyberattacks (e.g., malware generation), and so on.

Toolchain Module is a broad concept that includes various entities from ML frameworks to third-party services. Potential sources of risk include programming language runtime environments (e.g., CVE-2022-37454), CI/CD pipelines, deep learning frameworks (e.g., CVE-2022-45907, CVE-2023-25674, CVE-2023-25671, CVE-2023-25667), and preprocessing tools (e.g., CVE-2023-2618, CVE-2023-2617).

Application-Level Defense

Among defense strategies, we can distinguish methods directly related to the LLM (covered in the next section) and methods concerning the peripheral modules, which we will inspect here.

In the Input module, developers can shape prompts to guide the model toward safe behavior and filter out malicious requests before they reach the LLM. One common tactic is to embed safety instructions directly into every prompt, for example, by prefacing with "You are a safe and responsible assistant" or by sandwiching the user’s text between immutable system directions. This "defensive prompt design" has been shown to mitigate many types of jailbreak attacks. Complementary adversarial-prompt-detection systems can flag or block problematic inputs: simple keyword matching with blocklists may catch overtly harmful language, while more sophisticated classifiers (sometimes even other LLMs trained to recognize malicious intent) can detect evasive or obfuscated attacks that evade rule-based filters.

In the Output module, a combination of detection and intervention can catch unsafe or inaccurate content before it reaches users. Neural and rule-based classifiers can vet generated text for categories such as hate speech, self-harm, or outright falsehoods. When problematic material is found, the system may refuse to answer or automatically correct errors using external fact-checks or consistency checks across multiple generations. To discourage misuse and enable provenance tracking, visible or hidden watermarks can be inserted into the output.

Defenses against vulnerabilities in the Toolchain module vary depending on the specific tool, but in any case isolating the execution environment and applying the principle of least privilege are universal measures that can limit the impact of attacks.

Additionally, there are more general methods worth mentioning here. One such approach is using multiple parallel system components that duplicate each other's work. Specifically, multiple LLMs with different specializations may independently generate responses to the same user input, and an algorithm or another model selects the best answer. In more advanced setups, these models can even debate each other to determine the optimal response.

LLM-Level Defense

To harden the LLM as the system’s key component, alignment techniques are applied: fine-tuning the model to make it more resistant to attacks and to behave more appropriately. However, defining a loss function that captures desired attributes (informativeness, truthfulness, correctness, politeness, etc.) is challenging. While metrics such as BLEU and ROUGE attempt to address this, the best evaluator of such subjective qualities remains a human. But direct human evaluation during alignment would be slow and expensive, which is where the idea of RLHF (Reinforcement Learning from Human Feedback) offers a solution.

Consider the RLHF pipeline step by step. First, we take the base model and use it to generate multiple response variants for a set of prompts (for example this one). Next, real humans rate these responses. Because manual scalar ratings can be noisy and uncalibrated, systems like the Elo rating method often yield better results.

After annotating responses with human judgments, we can train a separate, smaller "reward model." This can be a general-purpose LLM with its final layer adapted to output numerical scores instead of next-token predictions, or a specialized model trained solely for this task. For example OpenAI used a smaller GPT-3 variant for its first RLHF-based model InstructGPT, while Anthropic trained transformers ranging from 10 million to 52 billion parameters specifically for this purpose.

Finally, with the reward model simulating human preferences, we fine-tune the base LLM via reinforcement learning. This process yields alignment reflecting generalized human notions of quality, helpfulness, and safety. The same approach can train models to recognize and resist malicious adversarial attacks.

Defense Bypass

The defenses outlined above continually face evolving bypass techniques. New token-manipulation and obfuscation attacks emerge that evade existing input filters, and novel jailbreak methods appear regularly. Securing LLM applications (as with any software) is an ongoing endeavor, requiring constant updates to methods and tools.

Conclusion

We have examined the architecture of a typical LLM-based application/service, compiled a taxonomy of risks associated with various elements of this architecture, and briefly analyzed methods to mitigate these risks.

2 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo