Authoring effective Armature test workflows

A well-authored workflow describes exactly what the tester agent should do, which tools it is allowed to use, and what observable outcomes count as success. When your criteria are precise and your prompts produce isolated test state, Armature can flag regressions automatically — and the judge model can explain exactly which behavior changed. This page covers everything you need to write durable, maintainable workflows.

Creating a workflow

Open the workflow editor

Navigate to Workflows in the sidebar and click New workflow. The editor opens with three sections: Definition, Success criteria, and (after first save) History.

Fill in identity fields

Give the workflow a Name your team will recognise in notifications and the run table. Add a Description so teammates can scan the purpose without opening the editor.

Choose an MCP server and auth profile

Select the MCP server the tester agent will connect to. If the server has multiple auth profiles, choose the one scoped to your test environment. Avoid using production credentials in test workflows.

Select tester models

Pick one or more models to run the workflow. When you select multiple models, each one runs on the same schedule with staggered start times, giving you cross-model coverage in a single workflow.

Write the tester prompt

Describe the task the agent should accomplish. Be explicit about expected tool calls, the data the agent should create or retrieve, and how to identify test-scoped state. See the guidance below.

Write evaluation criteria

In the Success criteria tab, enter one plain-language criterion per line. The judge model evaluates each criterion independently. See the criteria style guide below.

Choose a judge model

Select the model that will evaluate the run. The judge model is separate from the tester model, so you can use a more capable evaluator without increasing the cost of every tester run.

Set a schedule and save

Choose a schedule preset or enter a custom cron expression. If you are not ready to run automatically, leave the schedule blank — the workflow runs only when triggered manually. Click Create to save.

Writing evaluation criteria

Criteria are the single most important part of a workflow. The judge model reads each criterion and decides whether the run satisfied it, based on the tool-call trace and final agent output.

What makes a good criterion

Write criteria as externally checkable outcomes. Each criterion should describe one specific behavior and tell the evaluator what evidence to look for: a tool-call status, a table row, a resource id, a trace event, or a field in the final answer. Good criteria:

The tester creates one run-scoped order and reports the order id in the final answer.
No email-sending tool is called during the run.
The agent calls the list_products tool exactly once and filters by category.
The run completes without calling any tool more than three times.

Criteria to avoid:

The agent works well. — subjective, not checkable.
The response looks correct. — no specific evidence anchor.
The workflow succeeds. — circular, tells the judge nothing.

One behavior per criterion

Keeping criteria atomic makes failures easier to diagnose. When a single criterion covers multiple behaviors, a partial failure is harder to attribute. Break compound expectations into separate lines:

Instead of “The agent calls the right tool and returns the correct id,” write two criteria: one that checks the tool call and one that checks the returned id. This way, partial failures surface immediately in the per-criterion verdict.

Evidence anchors

Criteria that name specific evidence types help the judge model ground its verdict in the trace rather than in the agent’s final answer alone. The evidence types Armature surfaces are:

Tool-call status — whether a specific tool was called, how many times, and whether it returned an error.
Trace events — lifecycle events emitted during the run.
Resource ids — identifiers created or retrieved during the run.
Final answer fields — values the agent reports at the end.

Writing the tester prompt

The prompt is the instruction the tester agent follows. A good prompt is explicit about the task, produces isolated state, and avoids including secrets.

Use run-scoped data

Tester prompts should create or select data that is unique to the test run. Include unique names, timestamps, or generated ids in prompts when the target system permits it. Criteria should tell the judge how to distinguish test state from customer state. Example prompt fragment:

Create a new project named "armature-test-{today's date in YYYYMMDD format}" and then list all projects to confirm it appears.

This approach ensures that multiple concurrent runs do not interfere with each other and that criteria can verify the specific record created by this run rather than any record in the system.

Do not include plaintext secrets

Never include passwords, API keys, tokens, or other secrets in prompts, criteria, tool policies, or documentation. Use dashboard-managed auth profiles and secret_ref placeholders instead. Secret-shaped argument names are rejected by the Armature MCP API before tool execution.

Secrets belong in auth profiles, which you configure on the MCP server settings page. The tester agent picks up the credentials automatically from the profile you associate with the workflow.

Tool policy

Tool policy lets you declare which tools a workflow is expected to use and which tools must never fire.

Allowed tools

Set toolPolicy.allowed_tools to constrain the set of tools the tester agent is expected to call. Armature uses this list in coverage reports — a tool counts as covered only if it appears in a successful run or in an explicit allowed_tools list.

A workflow with no allowed_tools configured is called open policy. Open policy does not count as covering every tool in the catalog. If you care about coverage, be explicit about which tools a workflow exercises.

Blocked tools

Set toolPolicy.blocked_tools for tools that must never be called during a run. If a blocked tool fires, the run fails regardless of other criteria. This is useful for guarding against side effects — for example, blocking all email or payment tools in a workflow that should only read data. Example blocked-tools criterion:

No email-sending tool is called.

Pair this criterion with a blocked_tools entry for the email tool so the failure is surfaced at the infrastructure level, not just at the evaluation level.

Unknown tool policy entries

If a proposed patch references a tool that is missing from the active catalog, Armature validates the catalog, may attempt a tools/list refresh, and returns either an unknown_tool_policy_entry error or an explicit warning when allowUndiscoveredTools is set.

Regression drafts

A regression draft is a focused workflow created automatically from a failing run. Use draft_regression_workflow_from_run in the MCP API — or click Create regression draft in the run detail view — to produce a workflow that:

Copies the source run’s prompt verbatim.
Selects only the failed or partial required criteria from the original run.
Narrows allowed_tools to the tools actually called in the failed run.
Sets the schedule to manual-only so it does not fire automatically until you review it.

Use regression drafts when the original workflow is too broad or when a specific failure should become a permanent guardrail. A regression draft isolates the broken behavior so you can confirm it is fixed without noise from unrelated criteria.

When to use regression drafts

Regression drafts are most useful in two scenarios:

A broad workflow surfaces a failure but the root cause is unclear. Draft a regression from the failing run to get a narrow, focused test that reproduces the exact failure.
A bug is fixed and you want a permanent guardrail. Keep the regression draft as a manual-only workflow and trigger it whenever you deploy changes to the affected MCP server.

Evidence references in proposals

When you use the MCP repair API to propose a workflow patch, you can supply evidence references to support the proposed change. Armature validates all references against your organization. Valid evidence reference types are:

runIds — runs in your organization.
criterionIds — criterion ids that belong to any version of the target workflow.
toolCallIds — tool calls that belong to a cited run or the proposal source run.
traceEventIds — trace events that belong to a cited run or the proposal source run.

Supplying evidence references creates an audit trail that teammates can review before approving the patch.

Schedules — configure when your workflow runs automatically.
Run results — interpret pass, partial, fail, and error outcomes.
MCP API: repair tools — propose patches and apply them programmatically.

Get Started

Workflows

MCP Servers

Alerts

Settings

Authoring effective Armature test workflows

Creating a workflow

Writing evaluation criteria

What makes a good criterion

One behavior per criterion

Evidence anchors

Writing the tester prompt

Use run-scoped data

Do not include plaintext secrets

Tool policy

Allowed tools

Blocked tools

Unknown tool policy entries

Regression drafts

When to use regression drafts

Evidence references in proposals

Get Started

Workflows

MCP Servers

Alerts

Settings

Documentation Index

​Creating a workflow

​Writing evaluation criteria

​What makes a good criterion

​One behavior per criterion

​Evidence anchors

​Writing the tester prompt

​Use run-scoped data

​Do not include plaintext secrets

​Tool policy

​Allowed tools

​Blocked tools

​Unknown tool policy entries

​Regression drafts

​When to use regression drafts

​Evidence references in proposals

​Related pages

Creating a workflow

Writing evaluation criteria

What makes a good criterion

One behavior per criterion

Evidence anchors

Writing the tester prompt

Use run-scoped data

Do not include plaintext secrets

Tool policy

Allowed tools

Blocked tools

Unknown tool policy entries

Regression drafts

When to use regression drafts

Evidence references in proposals

Related pages