Understanding Armature workflow run results

A run is a single execution of a workflow. Each time a workflow fires — on its schedule or triggered manually — Armature dispatches a tester agent, records every tool call and trace event, and then hands the transcript to a judge model that evaluates each of your criteria. The result is a structured run record you can inspect, filter, and compare against earlier baselines to spot exactly when and how behavior changed.

Run statuses

Every run ends in one of four terminal statuses:

Pass — all required criteria satisfied

Every required criterion received a passing verdict from the judge model. The run is green. If some optional criteria failed, the run still shows as pass — only required criteria gate the overall status.

Partial — some criteria failed

At least one required criterion received a failing verdict, but not all of them. The run completed without an infrastructure error. Partial runs warrant review: open the run detail to see which specific criteria did not pass and what evidence the judge cited.

Fail — required criteria failed

One or more required criteria failed and the overall verdict is a failure. The agent completed its turn but the outcomes did not meet your rubric. Check the per-criterion verdict and the tool-call trace to understand what the agent did versus what was expected.

Error — infrastructure or transport failure

The run could not complete because of a transport error, an unreachable MCP server, an authentication failure, or another infrastructure problem. The agent did not produce a final answer. Error runs do not count against your pass rate but do appear in the run timeline so you can correlate them with server health events.

What you see in a run

Opening a run from the Run history page or the History tab of a workflow shows the full run detail. The detail view contains:

Prompt — the exact tester prompt used for this run, from the workflow version that was active at dispatch time.
Criteria list — every criterion with its individual pass / fail / partial verdict and the judge’s explanation.
Tool-call trace — an ordered list of every tool the agent called, including arguments, responses, and timing.
Trace events — lifecycle events emitted during the run (e.g. agent turn start, tool dispatch, evaluation start).
Model — the tester model and judge model used.
Duration — wall-clock time from dispatch to terminal status.
Cost — estimated token cost for the tester and judge calls combined.

The criteria list uses the criterion text from the workflow version active at run time. If you have edited the workflow since this run, the criteria shown in the run detail may differ from the current version’s criteria.

The Monitoring dashboard

The Monitoring dashboard is your at-a-glance view of the health of all your workflows. It shows:

Summary stats — total runs, overall success rate, failed run count, and median run duration for the selected time range.
Activity timeline — a bar chart of runs over time, color-coded by status (successful, failed, running, pending). You can drag to select a sub-range on the chart and zoom in.
Top failing workflows — a ranked list of the workflows with the most failures in the selected time range, with a link to filtered failed runs for each.
Recent runs — the eight most recent runs across all workflows, with status, duration, and relative timestamp.

Time range filtering

Use the time picker in the top-right of the Monitoring page to scope all dashboard panels to a specific window. Available relative ranges are:

Option	Window
Past 15 minutes	15m
Past 30 minutes	30m
Past 1 hour	1h
Past 4 hours	4h
Past 12 hours	12h
Past 24 hours	24h
Past 48 hours	48h (default)
Past 7 days	7d
Past 30 days	30d

You can also switch to Absolute time mode and enter a custom start and end time. The dashboard refreshes automatically on short relative ranges: every minute for ranges under 30 minutes, every 5 minutes for ranges up to 3 hours, every 30 minutes for ranges up to 24 hours, and hourly for longer ranges.

Filtering runs in the run history

The Run history page lets you filter runs across all workflows with four independent filters:

Workflow — scope to a single workflow.
Model — scope to runs from a specific tester model.
Status — filter by success, failed, running, or pending.
Time range — last hour, last 24 hours, last 7 days, last 30 days, or all time.

You can also search by run id, workflow name, or server name using the search box. All filters combine, so you can find, for example, all failed runs from a specific model on a specific workflow in the past 7 days.

Inspecting a run

Open any run from the run history table or from a direct link to see the full detail view. The detail view is organized into tabs:

Criteria — per-criterion verdicts

Each criterion is listed with its text, the judge’s verdict (pass / fail / partial), and the evaluator’s explanation. The explanation cites the specific tool call or trace event that informed the verdict. Use this to understand not just whether a criterion failed but why.

Trace — tool calls and events

The trace tab shows every tool call in order: the tool name, arguments, response body, and duration. Trace events are interspersed in timeline order. This is the raw transcript of what the agent did — the ground truth the judge model evaluates against.

Overview — cost, duration, model

The overview tab shows the tester model, judge model, total cost, and wall-clock duration. Cost is broken down by tester and evaluator token usage so you can identify expensive runs.

Comparing runs

Comparing two runs is the most efficient way to identify what changed between a passing baseline and a failing run. Armature aligns criteria by id for runs on the same workflow version, by normalized text across versions, and by position when text is unavailable. Tool-call deltas highlight which tools appeared, disappeared, or changed status between the two runs. From the dashboard UI: open the run detail for the newer run and use the Compare action to select a baseline run from the same workflow. Via the MCP API: use the compare_runs tool with the new run id and a baseline run id:

compare_runs(runId: "<new-run-id>", baselineRunId: "<baseline-run-id>")

The response includes a criterion-level diff — which criteria changed from pass to fail or vice versa — and a tool-call delta summarizing which tools had changed outcomes. This is the recommended approach in automated repair loops where you want to confirm a patch actually resolved the regression.

When using the MCP repair flow (repair_failing_workflow prompt), compare_runs is called automatically after run_workflow_now so you get the before-and-after diff without extra steps.

Workflow overview — how workflows are structured and what the list view shows.
Authoring workflows — write criteria that produce clear, actionable run verdicts.
MCP API: runs — trigger, inspect, and compare runs programmatically.
MCP API: repair tools — propose and apply workflow patches in response to failing runs.

Get Started

Workflows

MCP Servers

Alerts

Settings

Understanding Armature workflow run results

Run statuses

What you see in a run

The Monitoring dashboard

Time range filtering

Filtering runs in the run history

Inspecting a run

Comparing runs

Get Started

Workflows

MCP Servers

Alerts

Settings

Documentation Index

​Run statuses

​What you see in a run

​The Monitoring dashboard

​Time range filtering

​Filtering runs in the run history

​Inspecting a run

​Comparing runs

​Related pages

Run statuses

What you see in a run

The Monitoring dashboard

Time range filtering

Filtering runs in the run history

Inspecting a run

Comparing runs

Related pages