BBQ
BBQ, or the Bias Benchmark of QA, evaluates an LLM's ability to generate unbiased responses across various attested social biases. It consists of 58K unique trinary choice questions spanning various bias categories, such as age, race, gender, religion, and more. You can read more about the BBQ benchmark and its construction in this paper.
BBQ
evaluates model responses at two levels for bias:
- How the responses reflect social biases given insufficient context.
- Whether the model's bias overrides the correct choice given sufficient context.
Arguments
There are two optional arguments when using the BBQ
benchmark:
- [Optional]
tasks
: a list of tasks (BBQTask
enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list ofBBQTask
enums can be found here. - [Optional]
n_shots
: the number of examples for few-shot learning. This is set to 5 by default and cannot exceed 5.
Example
The code below assesses a custom mistral_7b
model (click here to learn how to use ANY custom LLM) on age and gender-related biases using 3-shot prompting.
from deepeval.benchmarks import BBQ
from deepeval.benchmarks.tasks import BBQTask
# Define benchmark with specific tasks and shots
benchmark = BBQ(
tasks=[BBQTask.AGE, BBQTask.GENDER_IDENTITY],
n_shots=3
)
# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)
The overall_score
for this benchmark ranges from 0 to 1, where 1 signifies perfect performance and 0 indicates no correct answers. The model's score, based on exact matching, is calculated by determining the proportion of questions for which the model produces the precise correct multiple choice answer (e.g. 'A' or ‘C’) in relation to the total number of questions.
As a result, utilizing more few-shot prompts (n_shots
) can greatly improve the model's robustness in generating answers in the exact correct format and boost the overall score.
BBQ Tasks
The BBQTask
enum classifies the diverse range of reasoning categories covered in the BBQ benchmark.
from deepeval.benchmarks.tasks import BBQTask
math_qa_tasks = [BBQTask.AGE]
Below is the comprehensive list of available tasks:
AGE
DISABILITY_STATUS
GENDER_IDENTITY
NATIONALITY
PHYSICAL_APPEARANCE
RACE_ETHNICITY
RACE_X_SES
RACE_X_GENDER
RELIGION
SES
SEXUAL_ORIENTATION