> ## Documentation Index
> Fetch the complete documentation index at: https://docs.armature.tech/llms.txt
> Use this file to discover all available pages before exploring further.

# Understanding Armature workflow run results

> Interpret pass, partial, fail, and error statuses for workflow runs. Inspect tool-call traces, filter by time range, and compare runs to spot regressions.

A run is a single execution of a workflow. Each time a workflow fires — on its schedule or triggered manually — Armature dispatches a tester agent, records every tool call and trace event, and then hands the transcript to a judge model that evaluates each of your criteria. The result is a structured run record you can inspect, filter, and compare against earlier baselines to spot exactly when and how behavior changed.

## Run statuses

Every run ends in one of four terminal statuses:

<AccordionGroup>
  <Accordion title="Pass — all required criteria satisfied">
    Every required criterion received a passing verdict from the judge model. The run is green. If some optional criteria failed, the run still shows as pass — only required criteria gate the overall status.
  </Accordion>

  <Accordion title="Partial — some criteria failed">
    At least one required criterion received a failing verdict, but not all of them. The run completed without an infrastructure error. Partial runs warrant review: open the run detail to see which specific criteria did not pass and what evidence the judge cited.
  </Accordion>

  <Accordion title="Fail — required criteria failed">
    One or more required criteria failed and the overall verdict is a failure. The agent completed its turn but the outcomes did not meet your rubric. Check the per-criterion verdict and the tool-call trace to understand what the agent did versus what was expected.
  </Accordion>

  <Accordion title="Error — infrastructure or transport failure">
    The run could not complete because of a transport error, an unreachable MCP server, an authentication failure, or another infrastructure problem. The agent did not produce a final answer. Error runs do not count against your pass rate but do appear in the run timeline so you can correlate them with server health events.

    In-flight runs (queued, running, or evaluating) are also excluded from pass-rate calculations until they reach a terminal status, so a workflow with two passing runs and one still-running run reports 100%, not 67%.
  </Accordion>
</AccordionGroup>

## What you see in a run

Opening a run from the **Run history** page or the **History** tab of a workflow shows the full run detail. The detail view contains:

* **Prompt** — the exact tester prompt used for this run, from the workflow version that was active at dispatch time.
* **Criteria list** — every criterion with its individual pass / fail / partial verdict and the judge's explanation.
* **Tool-call trace** — an ordered list of every tool the agent called, including arguments, responses, and timing.
* **Trace events** — lifecycle events emitted during the run (e.g. agent turn start, tool dispatch, evaluation start).
* **Model** — the tester model and judge model used.
* **Duration** — wall-clock time from dispatch to terminal status.
* **Cost** — estimated token cost for the tester and judge calls combined.

<Note>
  The criteria list uses the criterion text from the workflow version active at run time. If you have edited the workflow since this run, the criteria shown in the run detail may differ from the current version's criteria.
</Note>

## The Monitoring dashboard

The **Monitoring** dashboard is your at-a-glance view of the health of all your workflows. It shows:

* **Summary stats** — total runs, overall success rate, failed run count, and median run duration for the selected time range.
* **Activity timeline** — a bar chart of runs over time, color-coded by status (successful, failed, running, pending). You can drag to select a sub-range on the chart and zoom in.
* **Top failing workflows** — a ranked list of the workflows with the most failures in the selected time range, with a link to filtered failed runs for each.
* **Recent runs** — the eight most recent runs across all workflows, with status, duration, and relative timestamp.

### Time range filtering

Use the time picker in the top-right of the Monitoring page to scope all dashboard panels to a specific window. Available relative ranges are:

| Option          | Window          |
| --------------- | --------------- |
| Past 15 minutes | 15m             |
| Past 30 minutes | 30m             |
| Past 1 hour     | 1h              |
| Past 4 hours    | 4h              |
| Past 12 hours   | 12h             |
| Past 24 hours   | 24h             |
| Past 48 hours   | 48h *(default)* |
| Past 7 days     | 7d              |
| Past 30 days    | 30d             |

You can also switch to **Absolute time** mode and enter a custom start and end time. The dashboard refreshes automatically on short relative ranges: every minute for ranges under 30 minutes, every 5 minutes for ranges up to 3 hours, every 30 minutes for ranges up to 24 hours, and hourly for longer ranges.

## Filtering runs in the run history

The **Run history** page lets you filter runs across all workflows with four independent filters:

* **Workflow** — scope to a single workflow.
* **Model** — scope to runs from a specific tester model.
* **Status** — filter by success, failed, running, or pending.
* **Time range** — last hour, last 24 hours, last 7 days, last 30 days, or all time.

You can also search by run id, workflow name, or server name using the search box. All filters combine, so you can find, for example, all failed runs from a specific model on a specific workflow in the past 7 days.

## Archiving and restoring runs

Use the **Archive** action on a run row to hide it from the default **Run history** view without deleting the underlying trace. Archived runs stay in the database, remain linked from their workflow, and continue to count toward historical pass-rate calculations — they are just collapsed out of the active list.

To review or restore archived runs, toggle **Show archived** above the run table. Archived rows display an **Archived** badge and expose an **Unarchive** action that returns the run to the default view.

<Tip>
  Archiving is useful for one-off debug runs, runs from a deprecated workflow version, or any run you want to keep on file but stop seeing in day-to-day triage.
</Tip>

## Inspecting a run

Open any run from the run history table or from a direct link to see the full detail view. The detail view is organized into tabs:

<AccordionGroup>
  <Accordion title="Criteria — per-criterion verdicts">
    Each criterion is listed with its text, the judge's verdict (pass / fail / partial), and the evaluator's explanation. The explanation cites the specific tool call or trace event that informed the verdict. Use this to understand not just whether a criterion failed but why.
  </Accordion>

  <Accordion title="Trace — tool calls and events">
    The trace tab shows every tool call in order: the tool name, arguments, response body, and duration. Trace events are interspersed in timeline order. This is the raw transcript of what the agent did — the ground truth the judge model evaluates against.
  </Accordion>

  <Accordion title="Overview — cost, duration, model">
    The overview tab shows the tester model, judge model, total cost, and wall-clock duration. Cost is broken down by tester and evaluator token usage so you can identify expensive runs.
  </Accordion>
</AccordionGroup>

## Comparing runs

Comparing two runs is the most efficient way to identify what changed between a passing baseline and a failing run. Armature aligns criteria by id for runs on the same workflow version, by normalized text across versions, and by position when text is unavailable. Tool-call deltas highlight which tools appeared, disappeared, or changed status between the two runs.

**From the dashboard UI:** open the run detail for the newer run and use the **Compare** action to select a baseline run from the same workflow.

**Via the MCP API:** use the `compare_runs` tool with the new run id and a baseline run id:

```text theme={null}
compare_runs(runId: "<new-run-id>", baselineRunId: "<baseline-run-id>")
```

The response includes a criterion-level diff — which criteria changed from pass to fail or vice versa — and a tool-call delta summarizing which tools had changed outcomes. This is the recommended approach in automated repair loops where you want to confirm a patch actually resolved the regression.

<Tip>
  When using the MCP repair flow (`repair_failing_workflow` prompt), `compare_runs` is called automatically after `run_workflow_now` so you get the before-and-after diff without extra steps.
</Tip>

## Related pages

* [Workflow overview](/workflows/overview) — how workflows are structured and what the list view shows.
* [Authoring workflows](/workflows/authoring) — write criteria that produce clear, actionable run verdicts.
* [MCP API: runs](/mcp-api/tools/runs) — trigger, inspect, and compare runs programmatically.
* [MCP API: repair tools](/mcp-api/tools/repair) — propose and apply workflow patches in response to failing runs.
