Generative AIAI evals

LLM regression testing

11 minutes read

To improve accuracy and relevance, LLMs undergo regular updates and refinements. This process includes fine-tuning, retraining with new data, and optimizing algorithms to enhance their performance and reliability.

In this topic, we will learn about one of the processes for updating the LLMs to make sure everything works as expected — regression testing.

The LLM Regression Testing Process

When a company updates its customer support chatbot, the goal is to improve the relevance and comprehensiveness of its responses. However, while a new version may enhance certain aspects, it can also introduce inaccuracies that were not present in the previous version.

Regression testing is the process of re-evaluating a system after modifications to ensure that existing functionalities still work as expected and that no new errors have been introduced. In the context of LLMs, regression testing is used when adding new features, optimizing performance, or fixing bugs.

When performing regression testing for LLMs, you need to ensure that modifications work correctly and integrate well with the previous version. To do this, follow these steps:

Make the test cases
Run tests on the existing model
Implement model updates
Re-run test cases and compare results
Integrate the new version if no issues are found.

First, you need to establish an evaluation criterion to determine whether a tested response passes or fails. This helps you identify if regression has occurred by comparing the chatbot's new responses to its previous ones. One way to do this is by using semantic similarity.

Semantic similarity can be used here because chatbots may not always generate identical word-for-word responses, but their meaning should remain the same. Semantic similarity measures the meaning similarity between two texts and can be utilized to evaluate chatbot responses. With this method, you can assess whether a chatbot’s response remains stable. A high similarity score means the chatbot’s response is consistent with the expected response, so it passes the test. On the other hand, a low similarity score shows a big change from the expected response, meaning it fails the test.

To begin the LLM regression testing process, the first step is to create well-defined test cases to ensure the model functions correctly before and after an update. Regression testing for a chatbot in an online store would involve asking specific questions and checking if the chatbot's responses remain correct and consistent after updates or modifications. A quick way to get started is to use around five significant use cases that are questions, like common questions customers might ask. Here, the prompts have been manually created rather than pulled from real user interactions. These can be identified by analyzing FAQ sections, customer support logs, or industry-specific needs to ensure they reflect real-world interactions.

Suppose your previous chatbot could not assist people in changing their passwords, and you included this feature in the latest version. You would compare the chatbot's behavior before and after implementing this functionality to notice that it acts as expected.

In this case, here are some example test cases you may want to consider.

Pasted illustration Then, run the test cases to ensure the chatbot produces accurate responses before integrating the new feature. To evaluate the results, use the semantic similarity method to compare the chatbot’s actual responses with the expected ones. In this case, the threshold for similarity is fixed at 0.8 to get a high level of accuracy in matching the responses. A higher threshold means the chatbot's actual response must closely align with the expected response. If a test case returns a similarity score below 0.8, it fails. If the score is 0.8 or higher, the test is considered a pass.

A code example

The code uses cosine similarity, which works well with this embedding because cosine similarity measures how similar two sentences are based on the angle between their vectors rather than their magnitude.

To start, download the CSV file used for the testing.

Let's install the required libraries using:

pip install requests pandas

Next, you can begin calculating the semantic similarity using the following code:

Note: Make sure to replace "your-api-key-here" with your actual Hugging Face API key. Each user has a unique key, so be sure to use yours.

import requests
import pandas as pd

API_URL = "https://api-inference.huggingface.co/models/sentence-transformers/all-MiniLM-L6-v2"

headers = {"Authorization": "Bearer your_api_key_here"} 

csv_file = "chatbot_responses.csv"
df = pd.read_csv(csv_file)

similarity_scores = []

for index, row in df.iterrows():
    expected_response = row['Expected']
    actual_response = row['Actual']

    data = {"inputs": {"source_sentence": expected_response, "sentences": [actual_response]}}
    response = requests.post(API_URL, headers=headers, json=data)

    if response.status_code == 200:
        similarity_score = response.json()[0]
        similarity_scores.append(similarity_score)
    else:
        similarity_scores.append(None)  
        print(f"Error at index {index}: {response.status_code}, {response.json()}")

df['similarity_score'] = similarity_scores

df.to_csv("Semantic_Similarity_Result.csv", index=False)

print("Similarity scores saved to Semantic_Similarity_Result.csv")

Now, let's break down how this code works step by step.

First, import the requests and pandas libraries. Requests is used to send API requests, while pandas is used to handle CSV files.
API_URL defines the Hugging Face model used to compute semantic similarity. The headers contain the authorization key, which is required to access the model.
The variable csv_file holds the filename of the CSV file that contains the chatbot responses that we have created. Then, pd.read_csv() is used to read the file.
An empty list, similarity_scores, is created to store similarity results. The code loops through the rows of the DataFrame. For each row, expected_response is the correct chatbot response, whereas actual_response holds the output of the chatbot.
A request is sent to the Hugging Face API with the expected response as the source sentence and the actual response as the sentence to compare. If the API call is successful, the similarity score is extracted from the response and added to the similarity_scores list. If the request fails, None is appended, and the error is printed.
Finally, the similarity scores are added as a new column in the DataFrame, and the updated data is saved to a new CSV file called "semantic_Similarity_Result.csv." Lastly, a confirmation message is printed.

The output file generated after running the script will display the semantic similarity scores. Below is a table summarizing the semantic similarity test results for all prompts. Since all the scores were above 0.8, all test cases are considered to have passed.

First results of the test

As seen in the table, since all test cases have passed, we can now move on to the next step, which is implementing model updates.

The next step is to integrate the update to the model by adding a new feature that allows the chatbot to assist users with changing their passwords. After integrating the new feature, re-run all previously tested cases along with the new feature using the semantic similarity test to ensure that no errors were introduced and the chatbot still produces accurate replies. To check for any regression, test the updated model using a new CSV file that contains the updated actual results after the updates while keeping the expected results the same. The similarity scores and results after the integration of the new feature are shown in the table below.

The results after the regression testing

From the table, you can see that the semantic similarity score for the refund-related question has dropped, indicating that the chatbot no longer responds correctly. Previously, it provided a valid response, but after the update, the answers are inconsistent compared to the pre-update version. This indicates that a regression has occurred. To fix this, the first step is to resolve the bug and ensure that the refund query passes the test. Once the issue is fixed, regression testing should be repeated from the beginning to confirm that all functionalities, including the new feature, work correctly without introducing new errors. If all test cases pass, the new feature can be successfully integrated into the system.

If all the test cases pass, you can integrate the update into the final system.

Challenges in regression testing

Regression testing can be challenging when dealing with an evolving AI system like ChatGPT. You’ve likely noticed how its features and capabilities have expanded since you first used it. From image processing to fine-tuning language fluency, each improvement adds to the complexity of comprehensive regression testing. However, with these advancements come challenges. Here are some major challenges:

The method of evaluation may vary for different test cases. A chatbot may also include features designed to handle edge cases and ensure factual correctness. While semantic similarity works well for common user queries, it is less effective for evaluating edge cases, adversarial prompts, and factual correctness, which require alternative testing approaches. This might not be able to help notice issues with facts, such as dates, time, or numbers, where even a slight discrepancy could lead to delivering incorrect results.
Taking ChatGPT as an example, regression testing becomes a massive task due to its extensive capabilities. The test cases grow significantly because of the broad domains, such as language fluency, reasoning, and factual accuracy. When a new feature is introduced, such as improved contextual memory, the test cases must expand to cover these changes, and it will take longer to complete the process.
LLMs are improved over time by training on large and diverse datasets. However, during this process, the model can inherit biases from the data it learns from. Identifying these biases during regression testing can be challenging. To address this, continuous monitoring and testing are required.

Practical uses of regression testing for LLMs

LLM regression testing is used in the question-answering domain to ensure the model continues to perform well after modifications. This process helps identify any issues or errors that may arise due to updates. For example, customer support chatbots use LLMs to assist customers with inquiries by providing relevant and accurate answers. Since businesses rely heavily on these chatbots to deliver precise information, performing regression testing is crucial. It ensures the model provides up-to-date and correct responses after each update without introducing errors or misleading information.

Conclusion

LLM regression testing has its benefits and challenges, but it plays a crucial role in ensuring that model updates remain reliable. It helps maintain accuracy so that people continue to receive trustworthy results when the model evolves. Throughout this topic, you’ve explored the LLM regression testing process and the challenges involved. Understanding these processes helps you gain insight into regression testing and how it is useful for LLMs.

3 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo