Compare Audio Responses based on Africa Medical Quality

A bulk evaluator workflow that compares AI-generated answers (copilot responses) to a set of golden reference answers. Requires input data columns: "input_prompt" (the question/task) and "reference_answer" (the ideal response). The workflow uses custom evaluation prompts to compare outputs, scoring them for accuracy and penalizing hallucinations. Aggregates results to provide an overall performance metric for your AI answers.

Input Data Spreadsheet
Loading...
Input Data Preview

Here's what you uploaded:

Loading...


Evaluation Prompts
Loading...

Aggregations

Run cost = 30 credits

With each run, you agree to Gooey.AI's terms & privacy policy.

Download

Loading...


Compare Medical Answer Quality Aggregate:Mean

Loading...

Loading...

Compare Medical Answer Quality

Related Workflows