8 results

Evaluate any Gooey.AI Workflow output against a dataset of inputs and "golden" or expert-created desired answers. Score every row of any CSV, google sheet or excel with any LLM-as-Judge instruction prompt; then average every score in any column to generate automated evaluations.

โš–๏ธ

Updated to Include Gemini3, GPT5.1, LLaMA, Deepseek3.1

967 runs

Our general bulk evaluator to compare AI generated copilot answers against a collection of golden Answers.

โš–๏ธ

416 runs

This recipe is used with https://gooey.ai/bulk to evaluate the latest private & open source speech recognition models (from Google, Meta, OpenAI and others). It takes a CSV file of golden (aka human provided) translations and compares those against a set of AI created translations to generate scores from 0 to 1. It then takes the mean of the scores to determine which model performed best.

โš–๏ธ

212 runs

A bulk evaluator workflow that compares AI-generated answers (copilot responses) to a set of golden reference answers. Requires input data columns: "input_prompt" (the question/task) and "reference_answer" (the ideal response). The workflow uses custom evaluation prompts to compare outputs, scoring them for accuracy and penalizing hallucinations. Aggregates results to provide an overall performance metric for your AI answers.

โš–๏ธ

88 runs

Here we compare the top 5 ASR models from a set of Telugu samples. Speech output created from https://gooey.ai/bulk/?example_id=nrkx2u17

โš–๏ธ

308 runs

Here we compare the top 3 ASR models from a set of Kannada samples. Speech output created from https://gooey.ai/bulk/?example_id=m8c3mb98

โš–๏ธ

Here we compare the top 6 ASR models from a set of Hindi samples. Speech translations created from https://gooey.ai/bulk/?example_id=ueki9up0.

โš–๏ธ