Overview - Fetch Hive

Use Experiments when you want to compare prompts or agents against the same set of inputs. An experiment lets you:

upload or select a dataset
add dashboard prompt, deployed prompt, and agent candidates
run every dataset row against every candidate
review outputs, usage, cost, and failures in one place

Experiments are useful when you want to test quality before publishing a change or compare models, prompt versions, deployed prompt variants, and agent configurations.

What you’ll find here

Datasets — Upload CSV files, map columns, and understand dataset format
Add candidates — Add dashboard prompts, deployed prompts, and agents
Build an experiment — Create an experiment and prepare it for a run
Run an experiment — Start, track, and cancel experiment runs
Review results — Compare outputs, open request details, and inspect failures
Run analytics — Compare run cost, tokens, latency, and success rate
Evaluators — Understand current evaluator status and planned evaluator types

How experiments work

An experiment combines a dataset with one or more candidates. A dataset is a set of rows. Each row contains input values, optional expected output, and optional metadata. A candidate is the prompt or agent you want to test. Fetch Hive captures a snapshot when you add the candidate so later edits to the source do not change that candidate inside the experiment. A run executes the dataset against the candidates. If you have 100 dataset rows and three candidates, the run has 300 result cells. Your current plan limits how many result cells a new run can create. Existing experiments and past runs remain available if your plan changes, but new runs must fit your current plan. Each result cell stores the candidate output, status, duration, usage, cost, and links to request details when available.

Current scope

Experiments currently run dashboard prompt drafts, prompt versions, deployed prompt versions, and agent snapshots. Agent snapshots run as isolated single-shot calls. They do not write to the source agent’s dashboard chat history, and each dataset row starts without memory from other rows. Workflow candidates, evaluator execution, experiment-local model overrides, and custom evaluator code are planned future additions. Existing unsupported legacy candidates may still appear in older experiments for history; archive them before starting a new run. See also: Prompts, Publishing and versioning, and Log history

Logs Experiment datasets

​What you’ll find here

​How experiments work

​Current scope

What you’ll find here

How experiments work

Current scope