2 results

This recipe is used with https://gooey.ai/bulk to evaluate the latest private & open source speech recognition models (from Google, Meta, OpenAI and others). It takes a CSV file of golden (aka human provided) translations and compares those against a set of AI created translations to generate scores from 0 to 1. It then takes the mean of the scores to determine which model performed best.

โš–๏ธ

212 runs

A bulk evaluator workflow that compares AI-generated answers (copilot responses) to a set of golden reference answers. Requires input data columns: "input_prompt" (the question/task) and "reference_answer" (the ideal response). The workflow uses custom evaluation prompts to compare outputs, scoring them for accuracy and penalizing hallucinations. Aggregates results to provide an overall performance metric for your AI answers.

โš–๏ธ

88 runs