This page describes how you can use Critique to filter the output of an AI system to make it more reliable.
What is AI Output Filtering?
As much as we love AI, it’s not perfect. Sometimes it makes mistakes, and sometimes it makes mistakes that are very bad. Some famous examples of this are chatbots that spew racist and sexist comments, or mistranslations that cause serious misunderstandings or personal harm.
Because Critique can predict the quality of an AI system’s output, it can be used to filter the output of an AI system to make it more reliable. For instance, let’s say we are creating an AI system to generate chat responses, and we want to make sure that the chat responses are not offensive. In this case, we would
- Generate the response from the system
- Run the Critique toxicity detection model on the response
- Depending on the toxicity level, show the response directly to the user, decline to respond, or possibly show it to a user with a warning
It is also possible to generate multiple responses and select the safest one based on the Critique model’s prediction.
We will go into how to do this in code concretely below.
How to use Critique for Output Filtering
Here is a simple example of how to use Critique to filter the output of a chatbot to avoid outputting toxic responses. Of course you can also filter for many other criteria such as fluency, or quality of translations or summaries. See the assessment criteria page for all of the options, and the getting started page for how to get started with Critique (including how to obtain an API key).
from inspiredco.critique import Critique
import os
# Create a Critique client
client = Critique(api_key=os.environ["INSPIREDCO_API_KEY"])
while True:
# get user input
user_input = input("You: ")
# Generate a response from the AI system you are using
response = ai_system.generate_response(user_input)
# Run the Critique toxicity detection model on the response
evaluation = client.evaluate(
metric="detoxify",
config={"model": "unitary/toxic-bert"},
dataset=[{"target": response}],
)
# Get the toxicity value for the first example
# (you could also evaluate multiple examples at once)
toxicity = evaluation["examples"][0]["value"]
# If the toxicity is too high, decline to respond
if toxicity > 0.5:
print("System: I'm sorry, I can't respond to that.")
else:
print("System: " + response)