agentchat.contrib.agent_eval.agent_eval
generate_criteria
def generate_criteria(llm_config: Optional[Union[Dict, Literal[False]]] = None,
                      task: Task = None,
                      additional_instructions: str = "",
                      max_round=2,
                      use_subcritic: bool = False)
Creates a list of criteria for evaluating the utility of a given task.
Arguments:
llm_configdict or bool - llm inference configuration.taskTask - The task to evaluate.additional_instructionsstr - Additional instructions for the criteria agent.max_roundint - The maximum number of rounds to run the conversation.use_subcriticbool - Whether to use the subcritic agent to generate subcriteria.
Returns:
list- A list of Criterion objects for evaluating the utility of the given task.
quantify_criteria
def quantify_criteria(llm_config: Optional[Union[Dict, Literal[False]]] = None,
                      criteria: List[Criterion] = None,
                      task: Task = None,
                      test_case: str = "",
                      ground_truth: str = "")
Quantifies the performance of a system using the provided criteria.
Arguments:
llm_configdict or bool - llm inference configuration.criteria[Criterion] - A list of criteria for evaluating the utility of a given task.taskTask - The task to evaluate.test_casestr - The test case to evaluate.ground_truthstr - The ground truth for the test case.
Returns:
dict- A dictionary where the keys are the criteria and the values are the assessed performance based on accepted values for each criteria.