A run is a single execution of a workflow. Each time a workflow fires — on its schedule or triggered manually — Armature dispatches a tester agent, records every tool call and trace event, and then hands the transcript to a judge model that evaluates each of your criteria. The result is a structured run record you can inspect, filter, and compare against earlier baselines to spot exactly when and how behavior changed.Documentation Index
Fetch the complete documentation index at: https://docs.armature.tech/llms.txt
Use this file to discover all available pages before exploring further.
Run statuses
Every run ends in one of four terminal statuses:Pass — all required criteria satisfied
Pass — all required criteria satisfied
Every required criterion received a passing verdict from the judge model. The run is green. If some optional criteria failed, the run still shows as pass — only required criteria gate the overall status.
Partial — some criteria failed
Partial — some criteria failed
At least one required criterion received a failing verdict, but not all of them. The run completed without an infrastructure error. Partial runs warrant review: open the run detail to see which specific criteria did not pass and what evidence the judge cited.
Fail — required criteria failed
Fail — required criteria failed
One or more required criteria failed and the overall verdict is a failure. The agent completed its turn but the outcomes did not meet your rubric. Check the per-criterion verdict and the tool-call trace to understand what the agent did versus what was expected.
Error — infrastructure or transport failure
Error — infrastructure or transport failure
The run could not complete because of a transport error, an unreachable MCP server, an authentication failure, or another infrastructure problem. The agent did not produce a final answer. Error runs do not count against your pass rate but do appear in the run timeline so you can correlate them with server health events.
What you see in a run
Opening a run from the Run history page or the History tab of a workflow shows the full run detail. The detail view contains:- Prompt — the exact tester prompt used for this run, from the workflow version that was active at dispatch time.
- Criteria list — every criterion with its individual pass / fail / partial verdict and the judge’s explanation.
- Tool-call trace — an ordered list of every tool the agent called, including arguments, responses, and timing.
- Trace events — lifecycle events emitted during the run (e.g. agent turn start, tool dispatch, evaluation start).
- Model — the tester model and judge model used.
- Duration — wall-clock time from dispatch to terminal status.
- Cost — estimated token cost for the tester and judge calls combined.
The criteria list uses the criterion text from the workflow version active at run time. If you have edited the workflow since this run, the criteria shown in the run detail may differ from the current version’s criteria.
The Monitoring dashboard
The Monitoring dashboard is your at-a-glance view of the health of all your workflows. It shows:- Summary stats — total runs, overall success rate, failed run count, and median run duration for the selected time range.
- Activity timeline — a bar chart of runs over time, color-coded by status (successful, failed, running, pending). You can drag to select a sub-range on the chart and zoom in.
- Top failing workflows — a ranked list of the workflows with the most failures in the selected time range, with a link to filtered failed runs for each.
- Recent runs — the eight most recent runs across all workflows, with status, duration, and relative timestamp.
Time range filtering
Use the time picker in the top-right of the Monitoring page to scope all dashboard panels to a specific window. Available relative ranges are:| Option | Window |
|---|---|
| Past 15 minutes | 15m |
| Past 30 minutes | 30m |
| Past 1 hour | 1h |
| Past 4 hours | 4h |
| Past 12 hours | 12h |
| Past 24 hours | 24h |
| Past 48 hours | 48h (default) |
| Past 7 days | 7d |
| Past 30 days | 30d |
Filtering runs in the run history
The Run history page lets you filter runs across all workflows with four independent filters:- Workflow — scope to a single workflow.
- Model — scope to runs from a specific tester model.
- Status — filter by success, failed, running, or pending.
- Time range — last hour, last 24 hours, last 7 days, last 30 days, or all time.
Inspecting a run
Open any run from the run history table or from a direct link to see the full detail view. The detail view is organized into tabs:Criteria — per-criterion verdicts
Criteria — per-criterion verdicts
Each criterion is listed with its text, the judge’s verdict (pass / fail / partial), and the evaluator’s explanation. The explanation cites the specific tool call or trace event that informed the verdict. Use this to understand not just whether a criterion failed but why.
Trace — tool calls and events
Trace — tool calls and events
The trace tab shows every tool call in order: the tool name, arguments, response body, and duration. Trace events are interspersed in timeline order. This is the raw transcript of what the agent did — the ground truth the judge model evaluates against.
Overview — cost, duration, model
Overview — cost, duration, model
The overview tab shows the tester model, judge model, total cost, and wall-clock duration. Cost is broken down by tester and evaluator token usage so you can identify expensive runs.
Comparing runs
Comparing two runs is the most efficient way to identify what changed between a passing baseline and a failing run. Armature aligns criteria by id for runs on the same workflow version, by normalized text across versions, and by position when text is unavailable. Tool-call deltas highlight which tools appeared, disappeared, or changed status between the two runs. From the dashboard UI: open the run detail for the newer run and use the Compare action to select a baseline run from the same workflow. Via the MCP API: use thecompare_runs tool with the new run id and a baseline run id:
Related pages
- Workflow overview — how workflows are structured and what the list view shows.
- Authoring workflows — write criteria that produce clear, actionable run verdicts.
- MCP API: runs — trigger, inspect, and compare runs programmatically.
- MCP API: repair tools — propose and apply workflow patches in response to failing runs.