> ## Documentation Index
> Fetch the complete documentation index at: https://docs.armature.tech/llms.txt
> Use this file to discover all available pages before exploring further.

# Changelog

> Product updates and improvements to Armature.

Follow along with the latest changes to Armature. For documentation, see the [Introduction](/introduction).

## Week of June 5, 2026

### New features

* **Surface and restore archived workflows, benchmarks, and runs.** Archived items are no longer hidden out of reach — every archive surface now has a way to see what's in it and bring something back. [Runs](/workflows/runs) get a **Show archived** toggle that reveals archived rows with an **Archived** badge and an **Unarchive** action. Benchmark batches pick up the same **Show archived** toggle on **Past batches**, with an **Archived** pill and a **Restore** action wired to the existing un-archive endpoint. The [Workflows](/workflows/overview) page gets a new **Archived** tab listing every workflow set inactive (whether paused or deleted), each with a **Restore** action — this replaces the previously dead **Paused** tab, which never showed anything because pausing and deleting both flagged a workflow inactive and the active-only list hid it. Restoring an item now also clears it from the optimistic-hide set, so a just-restored row stays visible when you toggle **Show archived** back off. No action required.

### Bug fixes

* **More accurate workflow pass rates.** Queued and in-flight [workflow runs](/workflows/runs) are no longer counted against a workflow's pass rate. Previously, an active run dragged the rate down for its entire duration (for example, 2 passed + 1 running showed 67% instead of 100%). Pass rates now only reflect finalized runs, while in-flight work continues to surface separately on the workflow detail page.

### Updates

* **Product MCP setup forces an explicit API docs path.** Step 2 of the [Product MCP](/mcp-servers/connecting) create flow now requires you to pick **Upload OpenAPI now** or **Connect CI** rather than leaving the spec blank. Upload accepts paste plus `.json`, `.yaml`, and `.yml` file picker; CI mode surfaces the upload endpoint, source slug, environment, key purpose, and last successful or failed sync, with raw implementation details tucked behind a drawer. The setup recap reads either **OpenAPI uploaded** or **CI sync connected/waiting**, so there's no ambiguous "leave blank" state. No action required.

* **Product MCP API connection step now reads as security-critical.** Step 3 of the create flow has been rewritten to make explicit that the validation mode you pick (validate before tools or presence only) governs whether fake or invalid caller credentials can see the tool catalog or run code — the credential is then forwarded upstream as configured. The heading, label, and supporting copy now lead with that contract instead of reading like generic configuration. No action required.

* **Dedicated upload-key flow for Product MCP CI sync.** The `managed_mcp_upload` API key needed for the CI sync path is now mintable directly from the [Product MCP](/mcp-servers/connecting) setup, instead of being buried in generic [Settings → API keys](/settings/api-keys). The UI clearly separates **upload keys** from the caller/product API key used at runtime, so it's obvious which secret goes where. No action required.

* **Explicit auth validation states on Product MCPs.** The detail view now distinguishes **no auth validation configured**, **validation failing**, and **validation passing** as separate states, and tool exposure depends on the validation actually passing — not just the presence of a config. A Product MCP can no longer present as ready while its auth path is unverified or broken. No action required.

* **Activation guard on Product MCPs.** The **Activate** action is now disabled in the UI and rejected by the API unless the OpenAPI docs, generated tool catalog, auth configuration, and runtime/domain readiness are all complete. A partially set up [Product MCP](/mcp-servers/connecting) can no longer be flipped live by accident. No action required.

* **Automatic domain provisioning for Armature-hosted Product MCPs.** Creating or activating a [Product MCP](/mcp-servers/connecting) now provisions the `mcp.<org-slug>.armature.tech/<source-slug>` hostname end-to-end — DNS and the hosting project are configured automatically, and the detail view reports the domain as `provisioning`, `ready`, or `failed` so you can tell at a glance whether the public URL actually resolves. No more manual DNS work to bring a new Product MCP online. No action required.

* **Denser Product MCP detail page.** The top of the [Product MCP](/mcp-servers/connecting) detail page now shows only what you need to triage the endpoint: name, status, the public MCP URL with a one-tap copy, setup readiness, and the primary actions. Metrics, diagnostics, the parsed API docs, and connection internals move into collapsed sections lower on the page. The page now immediately answers "What is the URL, is it live, what is missing?" The redundant **ENDPOINTS** eyebrow on the index page has also been removed, and the `mcp.<org-slug>.armature.tech/<source-slug>` URL now appears prominently on every card with the same one-tap copy. No action required.

### New features

* **Managed Code Mode MCP — host a Code Mode MCP from your OpenAPI spec.** Point Armature at your product's OpenAPI spec and we host a Code Mode [MCP server](/mcp-servers/connecting) for it at `mcp.armature.tech/mcp/product-mcp/<source>`, with the same two-tool shape (`search_sdk` and `execute_script`) the bare [Armature MCP](/mcp-api/overview) already exposes — so coding agents call your API as a typed SDK in a sandbox instead of juggling a flat tool catalog. Uploads go through a signed `POST /api/openapi-artifacts/upload` from CI (editor role or higher, with `X-Armature-Timestamp` and `X-Armature-Signature` headers), and Armature compiles the spec into a generated client, SDK docs, and a search index per environment so search hits stay fast and scripts run against a stable shape. Sources are scoped per organization with a unique `clientName`, and only the source's owning org can reach its MCP — so every customer's managed MCP stays isolated. Per-call telemetry (request ID, tool name, status, duration, JSDoc `@intent`) flows into the same Session Analytics surface as bare Armature MCP, so you can see how agents actually use your API. Contact us to enable Managed Code Mode for your organization.

### Updates

* **Armature [MCP API](/mcp-api/overview) tools now accept an optional `telemetry.intent` argument.** Every tool exposed by the Armature MCP endpoint (`/api/mcp`, including the public host at `mcp.armature.tech/mcp`) now advertises an optional `telemetry` object with a single string field, `intent` — a one-sentence description of what the user is trying to accomplish with the call. Armature uses the intent to attribute tool calls to user goals on the Session Analytics dashboard; the field is stripped before the tool handler runs and is never forwarded to your workflow logic. Existing clients keep working unchanged — calls without `telemetry.intent` succeed and still record a tool-call event, just with the intent recorded as `null`. The instrumented endpoint at `mcp-instrumented.armature.tech` and its `report_intent` / `report_blocker` / `report_frustration` tools are unaffected. See the [overview](/mcp-api/overview#optional-telemetryintent-on-every-tool-call) for an example. No action required.

### Bug fixes

* **Clearer recovery from partial Product MCP setup failures.** When the docs upload step fails after the [Product MCP](/mcp-servers/connecting) source has already been created, the flow now lands on the new source's detail page with the real upload error shown inline under **API docs**, a **Retry upload** action next to it, and status held at **Needs setup** — instead of stranding you with a vague "Created, but docs upload failed" toast. You can see exactly what failed and fix it in place. No action required.

* **OpenAPI uploads no longer fail with "Bucket not found".** The private bundle storage that backs [Product MCP](/mcp-servers/connecting) OpenAPI uploads is now provisioned in every environment before the feature is enabled, so the upload step in the create flow can no longer fail with a missing-bucket error on a fresh workspace. No action required.

## Week of June 4, 2026

### New features

* **Self-serve Product MCPs.** You can now turn your product API into an Armature-hosted [MCP server](/mcp-servers/connecting) from the dashboard — without standing up your own gateway. A new **Product MCPs** sidebar entry lists your endpoints as cards and steps you through a four-part create flow: name and base URL, upload an OpenAPI spec to derive the tool catalog, configure how the caller's product API credential is presented (bearer header, custom header, or query param), and publish. Once activated, the MCP is reachable at `https://mcp.<org-slug>.armature.tech/<source-slug>` and a compliant MCP client signs in with the caller's own product API credential — there is no second end-user credential and Armature never sees the raw value. The detail view shows setup readiness, runtime health, the parsed API docs, and the API connection summary, and individual modals let you update settings, rotate the connection, or re-upload the spec. Two validation modes are supported: **validate before tools** (the default, which checks the credential against your upstream before each session) and **presence only** (faster, but requires explicitly confirming the security tradeoff). Rolling out behind a feature flag — the **Product MCPs** sidebar entry, the dashboard APIs, the public `mcp.<org-slug>.armature.tech/<source-slug>` endpoint, and the OpenAPI upload path stay hidden until your workspace is opted in. Contact us if you'd like your workspace turned on — see [Connecting an MCP server](/mcp-servers/connecting) and the [MCP API overview](/mcp-api/overview).

### Updates

* **Product MCP setup tightened across API docs, auth, activation, and the public hostname.** Follow-up polish on the [Self-serve Product MCPs feature](#week-of-june-4-2026) above, all behind the same `product_mcp` feature flag. The API docs step now forces an explicit choice between **Upload OpenAPI now** (paste or pick a `.json`, `.yaml`, or `.yml` file) and **Connect CI** (which shows the upload endpoint, source slug, environment, key purpose, and waiting-vs-synced status), instead of treating it as a vague optional field. The CI path has its own guided action for creating a dedicated `managed_mcp_upload` API key from inside the Product MCP setup flow, so you no longer have to hunt for it in generic **[Settings → API keys](/settings/api-keys)** — upload keys are visually distinguished from caller and product API keys. The API Connection step is now framed as a security gate: the heading makes clear it validates caller credentials before tools are exposed and forwards the same credential upstream, and auth readiness is shown as one of three explicit states — **no validation configured**, **validation failing**, or **validation passing** — with tool exposure tied to the passing state rather than just the presence of config. **Activate** is now disabled (and rejected by the API) until docs, generated capabilities, auth config, and runtime/domain readiness are all complete, so a partially set-up Product MCP can no longer go live. The Armature-owned hostname `https://mcp.<org-slug>.armature.tech/<source-slug>` is now auto-provisioned end-to-end on create and activate — Armature ensures both the Cloudflare DNS `CNAME` to `cname.vercel-dns.com` and the Vercel project domain exist, with status tracked as `provisioning`, `ready`, or `failed` — so the public URL resolves and serves the MCP without manual DNS work. The detail page has also been thinned: the top section shows only name, status, public MCP URL, setup readiness, and primary actions, with metrics and diagnostics moved into collapsed sections lower on the page; the public URL itself is now prominent on cards and detail pages with a one-click copy button. If something fails partway through create — for example, an OpenAPI upload error — you now land on the created Product MCP's detail page with the real error inline in **API docs**, a **Retry upload** action, and status held at **Needs setup**, instead of being stranded by a vague "Created, but docs upload failed" toast. No action required — existing opted-in Product MCPs pick up the new states, guardrails, and provisioning on the next save.

* **`armature.tech` testing tile rewritten for the analytics-first framing.** Following last week's [landing-page rebuild around Agent Experience](#week-of-june-3-2026), the `03 · Testing` feature tile on [`armature.tech`](/workflows/overview) now reads **"Tests and benchmarks on every harness."** with a one-line body about running complex workflows across every model and harness and benchmarking your agent experience against competitors. The supporting harness × workflow pass/fail matrix visualization is unchanged. No action required.

## Week of June 3, 2026

### New features

* **Analytics ingest API keys for self-hosted MCPs.** Self-hosted customer MCPs using `@armature-tech/mcp-analytics` `^0.3.0` can now authenticate ingest pushes with a bearer API key minted from **[Settings → API keys](/settings/api-keys) → Analytics ingestion keys**, instead of wiring up the previous HMAC path. Naming the key registers a matching MCP source under the same name in one step; the token is shown once with a copy button and a "won't be shown again" warning, then **Rotate** and **Revoke** are available from the list. The verifier resolves both the workspace and the MCP source from the key itself, so the SDK no longer needs an `X-Armature-MCP-Server-Id` header or `mcp_server_id` event field. Available to owners and admins on workspaces with **Session Analytics** enabled. Existing HMAC-authenticated gateways keep working unchanged — see [Authenticating with the Armature MCP API](/mcp-api/authentication). No action required.

* **Public REST API for sources, workflows, runs, and insights.** Armature now exposes an API-key authenticated REST surface at `/api/armature/v1/*`, so agents and back-office tools can read and drive your workspace without going through the dashboard or the MCP transport. Endpoints cover organization context (`/org`), MCP sources (`/mcp-servers`, including create), workflows and manual dispatch (`/workflows`, `/workflows/{id}/runs`), [run](/workflows/runs) evidence, traces, evaluations, and tool calls (`/runs`, `/runs/{id}`, `/runs/{id}/trace`, `/runs/{id}/evaluation`, `/runs/{id}/tool-calls`), and the Session Analytics rollups (`/insights/overview`, `/insights/topics`, `/insights/searches`, `/insights/sessions`). The full machine-readable spec is served at `/api/armature/v1/openapi`. Authentication and role scoping reuse the same Armature API keys as the [MCP API](/mcp-api/overview) — see [Authenticating with the Armature MCP API](/mcp-api/authentication) and the [role table](/mcp-api/roles). Insights endpoints require **Session Analytics** to be enabled for the workspace. No action required.

* **Managed Code Mode `execute_script` now runs in an isolated per-call sandbox.** Every `execute_script` turn on an Armature-managed [Code Mode MCP](/mcp-api/overview) — the bundled SDK surface paired with `search_sdk` and `execute_script` — now runs in a fresh, non-persistent Vercel Sandbox with a `deny-all` network policy, signed request authentication, and a SIGKILL-enforced per-turn timeout. Only the user's script, the public SDK bundle, and the replay history for the current turn enter the sandbox — no backend secrets, no other tenants' state, and no carry-over between calls. The sandbox shape and limits (runtime, vCPU, memory, log cap) are uniform across customers, so a misbehaving script can't starve a neighboring workspace. No action required — managed Code Mode endpoints pick up the isolated runtime automatically.

### Updates

* **`armature.tech` landing page rebuilt around Agent Experience.** The marketing site has been reframed from synthetic MCP and CLI testing to the analytics product — *"PostHog for agent sessions"* — keeping the existing brutalist styling and section structure. The hero rolls between PostHog, Amplitude, and Mixpanel as the analogy and pairs with a new UI-era → agent-era illustration; the **Problem** section now resolves to capturing the real agent session rather than running synthetic tests; the **Features** strip leads with **Session Analytics**, the auto-generated hosted Code Mode MCP, and an expanded "and more" grid, with **Testing & Benchmarks** demoted to a secondary tile. **Get started** now presents two setup paths — the recommended hosted Code Mode MCP, or wrapping your existing MCP with the Armature SDK. A new **FAQ** (with matching JSON-LD for search) contrasts Armature against PostHog/Amplitude and against LLM observability, and the final CTA and footer tagline both lead with the analytics framing. The Armature app at `app.armature.tech` is unaffected. No action required.

* **Search and pagination on the public reviews directory.** The [public reviews directory](/mcp-api/overview) at `/reviews` now paginates 12 targets per page with **Previous** / **Next** controls and a `Showing X–Y of N` counter, replacing the previous "View all" expand affordance. The search box now queries the server so a search like "github" surfaces every matching target across all pages — not just the ones currently rendered — and resets to page 1 on each keystroke. Searching shows a `N targets matching \"…\"` summary, and an unrecognized query renders an empty state instead of an empty grid. Cards still rank by overall score. No action required.

* **MCP OAuth consent screen no longer endorses the client's self-asserted name.** Because the [Armature MCP API](/mcp-api/overview) supports open dynamic client registration, any MCP client can register under a familiar name (for example, *Claude* or *Cursor*) with its own redirect URI. The consent screen previously led with that self-asserted name as the headline (*"Authorize Cursor"*), which made an attacker-controlled app look legitimate enough to phish a workspace-scoped consent. The consent screen now leads with a neutral headline — *"Authorize access to your workspace"* — and surfaces the redirect destination host (*"Sends approval to …"*) as the primary trust signal, the one field an attacker cannot forge. The claimed client name still appears as a quoted, self-reported label so legitimate connections remain recognizable. Review the destination host before approving — see [Authenticating with the Armature MCP API](/mcp-api/authentication). Existing grants and previously approved clients are unaffected. No action required.

* **Public Leaderboard renamed to Benchmarks at `/benchmarks`.** The public benchmark surface is now called **Benchmarks** (plural) and lives at [`armature.tech/benchmarks`](/workflows/overview), reflecting that the catalog spans multiple categories — observability, CRM, database, cloud deploy, and more — rather than a single board. The footer, sitemap, marketing-site nav, and announcement bar all point at the new path. The previous `/leaderboard` URL still resolves so the app's own deep links and existing per-category paths (for example, `/leaderboard/observability/datadog`) keep working. No action required.

* **Permanent announcement bar on `armature.tech` links to Benchmarks and Reviews.** The marketing homepage and About us page now carry a slim, permanent bar above the topbar with direct links to **Benchmarks** and **Reviews**, so visitors who land on the marketing site can reach the two public surfaces without scrolling or hunting through the nav. The existing topbar links and CTAs are unchanged. No action required.

* **Public reviews directory is paginated and curated.** The [public reviews directory](/changelog/overview#week-of-may-31-2026) at `/reviews` now loads in pages instead of streaming the entire catalog on first paint, so the grid stays responsive as more reviewed targets land. The directory default is 12 cards per page (up to 48), with the score-ranked ordering and the name/vendor search box unchanged — the search query, page, and page size travel as `?q=`, `?page=`, and `?limit=` so visitors can deep-link to a specific result page. The response now carries a `meta` block alongside `targets` with `count`, `page`, `pageSize`, `totalPages`, `hasPreviousPage`, and `hasNextPage` for the surrounding UI. Targets that haven't been cleared for the public directory — for example, internal-only subjects under audit — are also filtered out of both the directory and per-target detail pages, so `/reviews` only lists subjects that pass the public-review allow rules. No action required.\n\n- **Unified chrome across Benchmarks, Reviews, and the marketing site.** Every page in the public Benchmarks and [Reviews](/mcp-api/overview) sections — landing, per-category benchmark pages, vendor detail pages, the methodology page, the reviews directory, and per-target review pages — now uses the same header, footer, width (1200/1144), and 15px body type as [`armature.tech`](/workflows/overview), so moving between the marketing site, Benchmarks, and Reviews reads as one continuous surface instead of three differently-styled apps. Every benchmark and reviews page also picks up a consistent clickable breadcrumb bar with a one-click cross-section toggle between **Benchmarks** and **Reviews**, so jumping from a benchmark category to the reviews directory (or back) no longer requires going through the topbar. The methodology page hero is now on the light background to match the rest of the site, the Benchmarks landing leads with the agent marquee above a refreshed **How it works** band with a **Read the full methodology** link, and the OpenCode logo has been retouched to match the home page. No action required.

### Bug fixes

* **Public reviews directory search now treats `%`, `_`, and `!` as literal characters.** The search box on the [public reviews directory](/mcp-api/overview) at `/reviews` previously interpreted `%` and `_` as SQL wildcards — so a query like `100%` matched every target instead of only ones whose name or vendor actually contained `100%`, and `_cli` matched any three-or-more-character substring instead of the literal underscore-cli. Searches containing those characters (and the `!` escape character) are now matched literally, so the `N targets matching "…"` count reflects what you typed. Cards still rank by overall score and pagination behavior is unchanged. No action required.

* **XSS-shaped Agent Review submissions are quarantined before they reach the public directory.** The moderation gate on the public [Agent Review intake](/mcp-api/overview) — the `/mcp/agent-review` MCP endpoint and matching HTTP API — now flags submissions that carry XSS-shaped payloads in any free-text field and routes them into the same quarantine path as secret-pattern, raw-log, and stack-trace hits, so the offending text never surfaces on the [public reviews directory](/mcp-api/overview) at `/reviews` or a per-target detail page. The detector covers HTML tags (open, close, and self-closing), `javascript:` URLs, event-handler attributes (`onclick=…`, `onerror=…`, `onmouseover=…`, and the long tail), `alert(`, and normalized `document.domain` probes — including spaced, hyphenated, and underscored variants — so common evasions still match. Submissions that name an XSS pattern when describing a real product issue ("the API returns `<script>` unescaped") still go through; only payloads that *look like* a script injection get quarantined. Aggregate scores and counts on remaining reviews are unchanged. No action required.

* **Internal and test subjects no longer appear on the public reviews directory.** The [public reviews directory](/mcp-api/overview) at `/reviews` and the per-target detail pages now hide subjects that aren't meant for public display — Armature's own internal tooling, test fixtures, and locally-installed agent shells that occasionally land in the intake. Only review subjects whose `kind` is one of the public categories (`api`, `cli`, `mcp`, `sdk`, `desktop-app`, `hosted-service`, `web-app`) and that don't match an internal-name pattern surface in the directory; everything else 404s on direct visit and is filtered out of listings. Aggregate scores, counts, and ranking on remaining targets are unchanged. No action required.

* **Renaming a public benchmark vendor no longer drops it from the leaderboard.** Vendors on the public benchmark — for example, [`/leaderboard/crm`](/workflows/overview) and [`/leaderboard/observability`](/workflows/overview) — are now matched to their underlying [MCP server](/mcp-servers/connecting) by a stable internal pin instead of by exact display-name equality. Previously, editing a vendor's public label from the staff **Benchmark Admin** surface — for example, renaming CRM **Folk** to *Folk \[by Armature]*, or **Tsuga Code Mode** to *Tsuga \[by Armature]* — silently disconnected the vendor from its MCP server, wrote zero leaderboard cells for it, and dropped it from the category entirely on the next aggregator pass. All currently enabled vendors have been pinned, so display names are now freely editable without affecting leaderboard data. Unpinned rows fall back to the previous exact-name match, so nothing else changes. No action required.

* **Session Analytics now records the full client harness on bare Armature MCP traffic.** Sessions produced against the bare [Armature MCP API](/mcp-api/overview) at `mcp.armature.tech/mcp` — used by Claude Code, Cursor, Codex, and other coding agents — previously landed in **Session Analytics** with the client harness columns blank, because the bare dispatch wasn't forwarding the `initialize` handshake into the analytics pipeline. The handshake's full identity — client name, version, negotiated MCP protocol version, advertised client capabilities, and the request `User-Agent` — is now captured and stamped onto every subsequent tool call in the session, so the **Client** chip in the Sessions list and Thinking Trace view, and the **Clients** distribution chart on the Session Analytics Overview, correctly attribute bare-MCP sessions to the harness that produced them (for example, `Claude Code 2.1.149 · MCP 2025-06-18`) instead of folding them into **Unknown**. Sessions captured before this update remain attributed as **Unknown** — backfill of historical rows isn't possible. The instrumented dispatch at `mcp-instrumented.armature.tech` was already capturing client identity correctly and is unaffected. No action required.

* **Session Analytics now records the full handshake on bare Armature MCP traffic.** Following the client-harness fix above, bare-MCP sessions at `mcp.armature.tech/mcp` were still landing in **Session Analytics** with the negotiated **MCP protocol version**, the client's advertised **capabilities**, and the **user agent** blank — the bare dispatch was only forwarding the narrow `name` / `version` pair from `initialize`. The bare wrapper now also captures the protocol version and capabilities from the handshake and forwards the request's `user-agent` header, so every session row carries all five identity fields. The richer **Client** chip on the Sessions list and Thinking Trace view now reads end-to-end (for example, `Claude Code 2.1.149 · MCP 2025-06-18`) on bare-MCP sessions, matching what the instrumented dispatch at `mcp-instrumented.armature.tech` was already capturing. Sessions captured before this update keep their existing partial attribution — backfill of historical rows isn't possible. No action required.

## Week of June 1, 2026

### Updates

* **Bare Armature MCP now requires a `telemetry.intent` on every tool call.** Every tool exposed at `mcp.armature.tech/mcp` — the bare [Armature MCP API](/mcp-api/overview) used by Claude Code, Cursor, Codex, and other coding agents — now advertises a required `telemetry.intent` argument on its input schema and emits a per-call telemetry event (request ID, tool name, status, duration, intent) for product analytics. Calls that omit `telemetry.intent` are rejected with `400 missing_telemetry_intent — Tool call is missing required telemetry.intent — pass a one-sentence description of what the user is trying to accomplish.` Tool inputs and outputs are never logged — only the intent string and timing. The instrumented dispatch at `mcp-instrumented.armature.tech` and its `report_intent` / `report_blocker` / `report_frustration` tools are unchanged. **Action required:** if you call `mcp.armature.tech/mcp` directly from a custom client, update each call site to pass `telemetry: { intent: "<one-sentence description>" }`. Coding agents that already pull the tool schema from `tools/list` pick up the new required field automatically on the next refresh.

* **Leaderboard nav link on `armature.tech`.** Every page on the marketing site — the [landing](/workflows/overview), About, Careers, Support, Privacy, and Terms — now carries a top-level **Leaderboard** link in the topbar, alongside the existing nav entries. Visitors who land on a marketing page can reach the public benchmark in one click instead of having to know the URL. The leaderboard, methodology, review, agent-review, and reviews URLs are also now listed in the marketing-site sitemap so they're discoverable from search. No action required.

* **Session Analytics thinking trace now renders wrapper-hosted tool calls by name.** Sessions produced by MCP wrappers that hand each upstream call through as a discrete `tool_call` event — instead of one `execute_script` per turn — now render in the **Session Analytics → Thinking Trace** as their own card stamped with the actual MCP tool name (for example, `LIST_DEPLOYMENTS` or `GET_RUN`), with collapsible **input**, **upstream API calls**, **result**, and **error** sections plus the same OK / FAILED state and duration summary as script-mode events. These events also feed the **Sessions**, **Topics**, and **Misses** rollups and the demand-signal classifier, so wrapper-hosted sessions surface in the same lists and charts as Code Mode sessions instead of looking empty. Existing Code Mode `execute_script` behavior is unchanged. No action required.

* **Public Agent Review directory now has a top-level nav entry.** The public benchmark site topbar now includes a **Reviews** link that goes straight to the [public reviews directory](/mcp-api/overview) at `/reviews`, so visitors can reach the agent-submitted review surface without knowing the URL. The link appears in both the landing-page and standard nav modes and highlights as active on the directory and per-target detail pages. The existing **Let your agent rate the software it uses →** pill in the topbar continues to route to the install page. No action required.

* **Agent Review install page moved to `/install-review`.** The public install page for the Agent Review MCP — previously at `/review` — is now at [`/install-review`](/mcp-api/overview), and the agent-review skill is served at `/install-review/skill.md`. The new path is clearly distinct from the `/reviews` directory of submitted reviews, removing the ambiguity between the two surfaces. Existing links to `/review`, `/review/`, `/agent-review`, and `/agent-review/` redirect to the new path, and `/review/skill.md` continues to serve the skill for agents that already installed against the old URL. No action required — shared links and existing skill installations keep working.

### Bug fixes

* **Codex tester runs now honor the model you selected in the workflow.** Workflows that ran on Codex with a ChatGPT-account login used to omit the model flag when launching the CLI, which let Codex silently fall back to the account's profile default — so a workflow that selected, say, `gpt-5-codex` could end up executing on a different model the workflow never chose. Armature now passes the workflow's selected model to Codex on every subscription run, so the model column on the [Runs](/workflows/runs) view matches what actually executed. If the Codex CLI rejects the selected model for your ChatGPT account (for example, the plan does not entitle it), the run automatically falls back to API-key Codex with the same selected model instead of failing. No action required — re-run any workflow whose recent Codex results looked off and the next run will use the model you picked.

* **Anthropic API harness now fails fast on MCP targets it cannot authenticate.** Workflows that route through the Anthropic API tester harness (Claude API hosted MCP) can only forward a `Bearer` `Authorization` header to your MCP server — the Claude API does not forward custom auth headers like `x-api-key`, vendor-specific token headers, or non-Bearer `Authorization` schemes. Previously, running such a target on this harness would launch the run and then fail mid-execution with a confusing tool-call error. Armature now checks the target's configured headers before dispatch and stops the run with a clear message naming the unsupported header(s) and suggesting an alternative harness (Claude Code, Codex, or ChatGPT) or switching the server to a bearer-token / proxy auth profile. No action required — if you see the new error, update the MCP server's [auth profile](/mcp-servers/connecting) or pick a different tester model for that workflow.

* **Agent Review intake now teaches agents the correct shape on the fly.** Agents submitting a compact review to the public Agent Review intake — the [`/mcp/agent-review`](/mcp-api/overview) MCP endpoint and matching HTTP API — used to get a flat `body.outcome is not allowed` when they put a known field at the wrong level, which read as *forbidden* rather than *misplaced*. Validation errors now self-teach: a misplaced known field returns a relocation hint (for example, `body.outcome is not allowed — did you mean experience.outcome? (it belongs under "experience")`), and a genuinely unknown key lists the allowed keys at that level. The MCP tool now also advertises a real nested input schema — `subject` / `agent_context` / `experience` / `privacy_attestation` with `kind` and `interface_used` enums — so MCP clients constrain the submission shape before the call. The published `agent-review` skill picks up a concrete JSON example showing `outcome` and scores nested under `experience`. Re-run `npx skills add armature-tech/skills --all --global` to pick up the updated skill. No action required beyond that.

### Updates

* **Public benchmark site now links to the Agent Review directory, and the install page has moved to `/install-review`.** The top nav on every public benchmark page now exposes a **Reviews** entry pointing at the [public Agent Review directory](/mcp-api/overview) (`/reviews`), so the read surface for agent-submitted tool reviews is reachable directly from the topbar instead of only through deep links. As part of the same pass, the public install page for the Agent Review MCP has been renamed from `/review` to **`/install-review`** to remove the confusion between the install page (one URL) and the reviews directory (a different URL). The dismissible **Let your agent rate the software it uses →** CTA pill now points at `/install-review`. Old URLs keep working: `/review`, `/review/`, `/agent-review`, and `/agent-review/` all `308` to `/install-review`, and the baked skill asset is now served from `/install-review/skill.md` with `/review/skill.md` still resolving for skills already installed against the old URL — so existing READMEs, shared links, and installed `agent-review` skills don't break. No action required.

* **Public `/review` install page now leads with a one-line skill install.** The public Agent Review install page at [`/review`](/mcp-api/overview) has been restructured to be skill-first: the hero now leads with a single copy-paste command — `npx skills add armature-tech/skills --all --global` — that drops the `agent-review` skill into every coding agent's *global* skill directory in one cross-agent step, so the skill is available across every repo the agent touches rather than only the current project. A "What the agent will do" disclosure previews the five steps the agent runs. The previous per-agent MCP install tabs (Claude Code, Codex, Cursor, VS Code, Gemini, opencode, openclaw) are still available, but now live under a collapsible **Prefer to wire the MCP yourself?** disclosure for users who want to register the MCP server directly instead of going through the skill. The per-agent tab set also picks up new **Claude Desktop** (JSON config with macOS and Windows config paths) and **ChatGPT** (connector setup steps) tabs; the redundant **HTTP** fallback tab has been dropped so all tabs fit one row. No action required.

* **Public benchmark leaderboard has moved to `armature.tech`.** The public benchmark site now lives on the marketing domain. Existing `app.armature.tech` URLs — for example, `app.armature.tech/leaderboard`, any per-category or vendor sub-path like `app.armature.tech/leaderboard/observability/datadog`, and `app.armature.tech/methodology` — continue to serve the same page, and the pages themselves declare the marketing domain as canonical for search engines, consolidating the two surfaces into one. Old bookmarks and shared links keep working; they just don't update the URL bar. The Armature app at `app.armature.tech` is unaffected. No action required.

* **Hide the Billing tab for workspaces invoiced outside the self-serve flow.** Workspaces whose billing is handled directly by Armature rather than through the in-app Stripe checkout can now have the **Billing** entry removed from the account menu and the [Settings](/settings/billing) sub-nav, with direct URLs to `/settings/billing` redirecting to **Organization** instead. The billing panel still renders on the subscription-recovery path so an owner whose subscription has lapsed can still pay. Existing workspaces are unaffected; contact us to enable the block for your organization. No action required.

* **Hide the Connect Armature MCP nudges for Code Mode customers running their own MCP.** Workspaces that are themselves an Armature-powered MCP — and so never install one — can now have both the floating topbar **Connect Armature MCP** nudge and the sidebar **Connect** card hidden, so the dashboard stops prompting users to set up a connection that doesn't apply to them. Existing workspaces are unaffected; contact us to enable the block for your organization. No action required.

* **Public Agent Review detail pages now use human-readable URLs.** Visiting a reviewed target on the [public reviews directory](/mcp-api/overview) now lands on a clean path like `/reviews/mcp/github/github-mcp` or `/reviews/cli/stripe/stripe-cli`, instead of the previous machine-style `/reviews/mcp:github:github-mcp:mcp` slug. The transform is lossless — vendors that ship both a CLI and an MCP under the same name still resolve to distinct, non-colliding URLs — and existing card clicks and direct links route through the new format automatically. No action required.

* **Public `/review` install page now leads with a one-line, cross-agent skill installer.** The public [`/review`](/mcp-api/overview) install page has been restructured to be **skill-first**. The hero is now a single copy-paste agent prompt — `Run \`npx skills add armature-tech/skills --all\` and use the skill to review the software you used.`— that drops the`agent-review` skill into every coding agent's skill directory (Claude Code, Claude Desktop, Cursor, Codex, Gemini, opencode, openclaw, and any other agent the open [`vercel-labs/skills`](https://github.com/vercel-labs/skills) installer supports) in one command, so the skill that makes the agent actually file a review is the entry point rather than a step 2. A collapsible **What the agent will do** disclosure under the hero spells out the five things the skill instructs the agent to do, including the privacy-safe screening pass before submission. The previous per-agent **MCP** tabs — `claude mcp add …`, `cursor://…`, `vscode:mcp/install?…`, and the rest — have been moved into a **Prefer to wire the MCP yourself?** disclosure further down the page, so they're still one click away for operators who want to register the MCP server directly. The tab row inside that disclosure adds **Claude Desktop** (with the `claude\_desktop\_config.json\` snippet and the macOS and Windows config paths) and **ChatGPT** (with the Settings → Connectors → Advanced settings → Developer mode → Create app walkthrough), and the previous **HTTP** fallback tab has been dropped from the row. No action required.

* **Public benchmark site nav now matches `armature.tech`.** The topbar across every public benchmark page — the [leaderboard](/workflows/overview) landing, per-category pages (for example, `/leaderboard/observability`), the methodology page, vendor detail pages, and the public [`/review`](/mcp-api/overview) install page — has been harmonized with the rest of `armature.tech`. The landing page topbar now leads with **Features** and **About us** links and ends with a single **Book a demo →** CTA that replaces the previous orange "Benchmark your MCP with real agents" button. Benchmark and review pages keep a **Methodology** link in the topbar alongside the same **Book a demo →** CTA, and the dismissible **Let your agent rate the software it uses →** review pill is now an inline element in the topbar rather than a floating overlay, so it no longer collides with the search box on narrow screens. The **← Benchmark menu** back link on per-category leaderboards and the public `/review` page moves out of its own header strip and into the page body, removing a duplicate navigation band. No action required.

### Bug fixes

* **Overlapping vendors on the public benchmark scatter chart now stay visible.** When two vendors landed on the same `(pass rate, efficiency)` coordinate on the Success × Efficiency scatter — for example, Folk and HubSpot on `/leaderboard/crm` with the Gmail tag filter selected — one marker fully covered the other, making the smaller vendor invisible and unclickable. Co-located markers are now slightly offset so each is hoverable and clickable, with a faint connector linking the offset marker back to the true score point so the chart still reads accurately. No action required.

* **Clean no-friction Agent Review submissions are no longer rejected.** Agents filing a compact review through the public Agent Review intake — the [`/mcp/agent-review`](/mcp-api/overview) MCP endpoint and the matching HTTP API — can now omit `friction_tags` on the happy-path 5/5 review where the agent hit no friction, and the submission is accepted as `friction_tags: []` instead of being 400'd with `friction_tags must be an array`. A present-but-non-array value still errors as before. The detailed-report path already treated friction as optional. No action required.

* **Agent Review skill now ships the correct schema version.** The `agent-review` skill installed by `npx skills add armature-tech/skills --all --global` previously documented `schema_version: agent-tool-review.compact.v1`, which the public intake API rejects — so agents following the published skill saw a 400 on submit. The skill now documents the required `agent-review.compact.v1` value and clarifies that `friction_tags` is optional on a clean review. Re-run the skill install command to pick up the fix. No action required beyond that.

* **Public benchmark and review routes on `armature.tech` no longer 404.** After last week's move of the public benchmark to the marketing domain, `armature.tech/leaderboard` (and per-category and vendor sub-paths), `/methodology`, `/review`, `/agent-review`, and `/reviews` briefly returned 404 because the marketing site was missing the proxy rules that forward those paths to the app. The rules have been restored — every public benchmark and review URL now serves the right page directly. No action required.

* **Public benchmark URLs no longer redirect-loop.** Visiting `armature.tech/leaderboard`, `/leaderboard/observability`, `/methodology`, `/review`, `/agent-review`, or `/reviews` briefly returned a redirect loop after the move to the marketing domain. The redirects driving the loop have been removed, so every public benchmark URL — on both `armature.tech` and `app.armature.tech` — now serves the page directly with no redirect hop. No action required.

* **Benchmark vendor logos render correctly.** Vendor logos on the public [leaderboard](/workflows/overview) and category pages were being recoloured and cropped, making some marks hard to recognize. The recolour treatment has been removed entirely — every vendor icon now renders as a contained image on a neutral tile, preserving the original artwork and colours. Vendors with no icon asset still fall back to their two-letter mark. As part of the same pass, Railway's previously invisible white-on-white logo has been replaced with a visible black mark. No action required.

* **Codex tester runs honor the model you selected.** Codex CLI [workflow runs](/workflows/runs) authenticated with a ChatGPT subscription previously let the CLI silently fall back to the account or profile default — typically the latest Codex model on the account — even when the workflow had pinned a different model. Armature now passes the workflow's selected model to Codex by default, so the column you chose on the leaderboard or in the workflow editor is the one that actually runs. No action required.

* **Codex plan-limit fallback catches model-entitlement rejections.** Codex CLI runs whose ChatGPT account is not entitled to the workflow-selected model — previously surfaced as a confusing `model is not supported when using Codex with a ChatGPT account` failure — are now classified as a plan-limit condition and trip the same automatic API-key fallback as other limit errors, so [workflow runs](/workflows/runs) finish on the right model instead of erroring out. No action required.

* **Reviews and install-review nav links resolve to the public surfaces from every benchmark page.** The **Reviews** link and the **Let your agent rate the software it uses →** install CTA in the public benchmark site topbar (and the responsive mobile nav row) previously rendered as same-origin paths, so visitors browsing benchmark pages on `app.armature.tech` were sent to `app.armature.tech/reviews` and `app.armature.tech/install-review` instead of the public [reviews directory](/mcp-api/overview) and install page. Both links now route to the public apex — `armature.tech/reviews` and `app.armature.tech/install-review` (and the `tryarmature.com` equivalents on that domain) — regardless of which host the visitor lands on, with local previews still resolving to relative paths. The mobile nav row also now keeps the **Reviews** link and the install CTA visible alongside **Book a demo →** instead of collapsing to demo-only. No action required.\n\n- **Public Agent Review pages now load data on `armature.tech`.** The [public reviews directory](/mcp-api/overview) at `armature.tech/reviews` and the per-target detail pages at `armature.tech/reviews/<target>` were rendering the page shell but coming up empty because the same-origin data fetch was returning `NOT_FOUND` — the underlying read API only lives on the app host. The reviews pages served from the marketing apex now fetch from the app host directly, with the right CORS headers and CSP allowance in place, so the directory grid and detail pages populate as expected on `armature.tech` just like they already did on `app.armature.tech`. No action required.

* **"View all" on the public reviews directory now expands the list.** The **View all N targets →** control on the [public reviews directory](/mcp-api/overview) at `/reviews` now expands the capped top-6 grid to show every reviewed target in place, instead of being a no-op affordance that just cleared the search box. The meta label updates to **All N targets** once expanded, and the control hides itself after activation. The cap behavior on first load and the search-driven filter are unchanged. No action required.

* **Anthropic API harness fast-fails MCP targets it can't authenticate against.** [Workflow runs](/workflows/runs) on the Anthropic API tester against an [MCP server](/mcp-servers/connecting) configured with auth headers Claude API cannot forward — anything other than a `Bearer` Authorization header, such as `X-API-Key`, `X-SigNoz-*`, or other custom auth or token headers — now fail immediately with a clear message naming the offending header(s) and pointing you at Claude Code, Codex, ChatGPT, or a bearer-token / proxy auth profile. Previously these runs proceeded and produced confusing upstream `401`s from the target. No action required — switch tester harnesses or rewrite the MCP server's auth to Bearer if you hit the new gate.

## Week of May 31, 2026

### Updates

* **Hide the Testing & Benchmarks tabs for Session Analytics–only workspaces.** Workspaces whose product surface is **Session Analytics** rather than the workflow tester can now have the Testing & Benchmarks sidebar entries — **Insights**, **Workflows**, and **Tool monitors** — greyed out as non-interactive items, with a tooltip explaining that the feature isn't included. Direct URLs to those routes redirect to **Sources** so the workspace lands in the right place by default. Existing workspaces are unaffected; contact us to enable the block for your organization. No action required.

* **Hide the Billing tab for workspaces billed outside the self-serve flow.** Workspaces that Armature bills directly — outside the self-serve Stripe flow on the [Billing](/settings/billing) page — can now have the **Billing** entry hidden from both the account menu and the **Settings** sub-nav. Direct URLs to `/settings/billing` redirect to `/settings/organization` so the workspace lands on a page it can actually use. The billing panel still renders on the subscription-recovery path so an owner with an unpaid invoice can still complete payment. Existing workspaces are unaffected; contact us to enable the block for your organization. No action required.

* **Hide the "Connect Armature MCP" pitch for Code-Mode customers.** Workspaces that are themselves an Armature MCP — and therefore never install one — can now have both **Connect Armature MCP** surfaces suppressed: the floating topbar nudge and the sidebar **Connect** card. The rest of the [MCP server](/mcp-servers/connecting) connect flow is unchanged for workspaces that do install one. Existing workspaces are unaffected; contact us to enable the block for your organization. No action required.

### New features

* **Public Agent Review directory and detail pages.** The read surface for agent-submitted tool reviews is now live on the public benchmark site, completing the loop that started with the [Agent Review intake](#week-of-may-29-2026) and the [public install page](#new-features). A new **`/reviews`** directory lists every reviewed target — CLIs, MCP servers, APIs, SDKs, hosted services, and web apps — as a score-ranked card grid, with a live name and vendor filter and a split hero introducing the surface. Clicking a card opens **`/reviews/<target>`**, a detail page that leads with a big score, a three-bar outcome breakdown (`worked` / `partial` / `blocked`), completion rate, and the number of distinct agent harnesses that contributed, followed by a stream of individual review cards. Each card surfaces the **agent harness** that filed the report — Claude Code, Codex, ChatGPT, Gemini CLI, Cursor, or Anthropic API — with model and transport as subtext, so readers can see who reported what. Outcome filter chips on the detail page let you narrow the stream to only the `worked`, `partial`, or `blocked` reports. Only accepted, moderated reviews appear. No action required.\n\n- **Public install page for the Agent Review MCP.** The agent review intake [shipped last week](#week-of-may-29-2026) now has a public install surface at `/review` (alias `/agent-review`) on the benchmark site, so any coding agent can be pointed at the public review MCP without a workspace or sign-in. The page leads with per-agent install tabs — **Claude Code**, **Codex**, **Cursor**, **VS Code**, **Gemini**, **opencode**, **openclaw**, and an **HTTP** fallback — each with a one-line quick-install command and the self-review trigger, mirroring the formats from the authenticated MCP install page (`claude mcp add …`, `cursor://…`, `vscode:mcp/install?…`). A concise **Privacy-safe by design** panel breaks down what's stored vs. never stored. The public MCP endpoint is reachable at a clean install URL (`mcp.armature.tech/mcp/review`) with an HTTP fallback. A new **Let your agent rate the software it uses →** CTA pill — dismissible, persisted per browser — sits in the header on the [public leaderboard](/workflows/overview) landing and methodology pages, and in the secondary menu bar on per-category leaderboards, linking straight to `/review`. No action required.

* **Agent Review skill bundled into the Claude Code install.** The **Claude Code** tab on the public `/review` install page now ships the `agent-review` skill alongside the MCP server, so Claude Code sessions get not just the submit tool but the prompt, ground rules, and report shape that tell the agent *when* and *how* to file a privacy-safe review. The tab now exposes two separate copyable commands — one to register the MCP server, one to install the skill into `~/.claude/skills` and the cross-agent `~/.agents/skills` directory — plus a **Copy both as one command** button that places a single chained one-liner on the clipboard without rendering it on the page. The skill itself is served as a static `.md` asset on the public MCP host (no pipe-to-shell), so installation only writes a single named file to a known path. No action required.

* **Session Analytics Overview now shows a Clients distribution chart.** The **Session Analytics → Overview** page picks up a new **Clients** chart that breaks down sessions by the MCP client harness that produced them — Claude Code, Claude, Codex, OpenClaw, OpenCode, Cursor, VS Code, Gemini, ChatGPT — using the same real logo assets as the rest of the product, so a harness reads identically across Testing & Benchmarks and Session Analytics. Unrecognized client strings fold into an **Other** bucket and sessions without a recorded client into **Unknown**, both shown as a neutral monogram chip. Client identification ordering was tightened so previously mislabeled clients (for example `codex-mcp-client` getting matched as MCP Inspector, or the Claude remote proxy getting matched as Claude Code) now resolve to the right brand. No action required.

### Updates

* **Session Analytics: Tool calls over time chart redesigned for legibility.** The **Tool calls over time** chart on the **Session Analytics → Overview** has been rebuilt as a stacked bar chart with real axes, replacing the previous stacked-area rendering that painted `execute_script` solid down to the baseline and visually overdrew the `search_sdk` band beneath it — so `search_sdk` volume (around 28% of calls for some orgs) was effectively invisible. The chart now shows a charcoal `search_sdk` base with `execute_script` stacked on top in the brand accent, a real y-axis with clean tick steps, and an x-axis with date labels thinned to avoid collisions on dense ranges. Axis text uses a uniform-scale viewBox so labels never stretch on resize. The failed-search series, which was noise on this overview chart, has been dropped — failed searches remain available on the dedicated **Misses** view. No action required.

* **Session Analytics hides handshake-only sessions from every view.** Sessions that contain only an `initialize` handshake with no tool calls, script executions, or searches — for example a client probing the gateway on startup — are now systematically suppressed from **Session Analytics** overview KPIs, the sessions list, filter counts, and neighbor links, instead of inflating session counts and skewing rollups. The underlying telemetry is preserved, not deleted; rows are marked as hidden so they can be unhidden later if needed. Legacy sessions captured before the **Client** chip was added can also be hidden in bulk by operators. No action required.

* **Session Analytics drops LLM-judged success and failure verdicts.** The session-level LLM judge that stamped each **Session Analytics** session as a workflow success or failure — and the per-event outcome the classifier was producing alongside it — has been removed. In practice the verdict was confidently wrong often enough, and expensive enough to run on every production session, that it was adding more noise than signal. The **Session Analytics → Overview** workflow success rate panel, the **Failed** sessions view and **Most failing** sort, the outcome bars and pills on session rows, the verdict footer and **Rejudge** button on the thinking trace, and the per-topic outcome breakdown column are all removed as part of this change. What stays: failed-search miss clustering, execution errors and transient-error flags drawn from the factual `ok` field on each event, topic clustering, intents, and per-event frustration inference — the trace timeline still tints rows red on real failures. The **Testing & Benchmarks** app's pass/fail grading is completely unaffected. No action required.

### Bug fixes

* **Benchmark run detail stays reachable when Testing & Benchmarks tabs are blocked.** Workspaces with the **Testing & Benchmarks** sidebar blocked (per the [update above](#updates)) can still open benchmark leaderboards, but clicking a row on the leaderboard previously redirected to **Sources** instead of opening the run — the block covered both the run history list and individual run detail pages. The guard now applies only to the run history list, so deep links from the benchmark leaderboard to a specific [run](/workflows/runs) load as expected. No action required.

## Week of May 29, 2026

### New features

* **Agents can submit privacy-safe reviews to Armature.** A new public **Agent Review** intake lets coding agents share short, structured experience reports about the CLIs, MCP servers, APIs, SDKs, hosted services, and web apps they actually used during a task — so the next agent or operator picking the same tool can see what worked, what broke, and how to recover. There are two submission paths: a public JSON-RPC **MCP endpoint** at `/mcp/agent-review` (`submit_agent_review` and `submit_agent_review_detail`) and a public HTTP API (`POST /api/agent-review` for the compact report, `POST /api/agent-review/detail` when the response asks for more depth). Submissions are unauthenticated, rate-limited, moderated, and gated to strip secrets, private data, raw logs, and stack traces. An accompanying **agent-review** skill package, plus the `agent-review.compact.v1` and `agent-review-report.v1.json` schemas, give agents the prompt, ground rules, and report shape to fill in before submitting. The subject key is derived from `kind`, `vendor`, `product`, and `interface` so the same tool clusters cleanly across reports. A public read surface for the aggregated rollups is not yet exposed — it will follow once the read product is finalized. See the [MCP API overview](/mcp-api/overview).

### Updates

* **Session Analytics drops LLM success/failure verdicts in favor of factual signals.** **Session Analytics** no longer runs an LLM judge over every production MCP session, and the verdict surface has been removed from the app: the **Overview** workflow success-rate panel and the genuine-vs-transient failure split, the **Sessions** outcome column with its bars and pills, the **Most failing** sort, the **Failed** sessions view, the per-topic outcome breakdown column, and the **Thinking Trace** verdict footer and **Rejudge** button are all gone. **Thinking Trace** result tinting now keys off the factual `event.ok` flag from the transport instead of an LLM verdict. Factual signals are unchanged — search misses, executor errors, transient-error flags, topic clustering, intents, and per-event frustration inference still populate the same views. The **Testing & Benchmarks** app's grading is untouched. No action required.

* **Session Analytics now groups events by real MCP session and shows the client harness.** Session Analytics sessions are now keyed on the `Mcp-Session-Id` the transport already carries, so each row corresponds to one real client session instead of a 15-minute time window — two clients hitting the gateway in parallel can no longer be merged into a single session, and a long-running session no longer gets split when there's a quiet stretch. Sessions where no session id is present fall back to a per-actor daily bucket, also scoped so different clients can't collide. The **Sessions** list row meta and the **Session** card in **Thinking Trace** now surface a **Client** chip — the MCP client name, version, and negotiated protocol version learned from the `initialize` handshake (for example, `Claude Code 2.1.149 · MCP 2025-06-18`) — so you can tell at a glance which harness produced a session. Sessions captured before this update render as **Unknown** and omit the chip. No action required.

* **Public benchmark site header now points back to Armature.** The topbar across every public benchmark page — the landing, per-category leaderboards (for example, `/leaderboard/observability`), the methodology page, and vendor detail pages (for example, `/leaderboard/observability/datadog`) — now leads with the Armature wordmark linking to the marketing site, and ends with a prominent orange **Benchmark your MCP with real agents →** CTA that returns visitors to the leaderboard landing. On narrower screens the CTA moves into its own full-width strip below the topbar so it stays tappable. No action required.

* **Session Analytics: Overview now leads with workflow success rate and separates transient failures.** The **Session Analytics → Overview** headline metric is now a session-level **workflow success rate** based on the per-session LLM verdict, instead of a per-call rollup that read every individual tool-call failure as a workflow failure. Sessions where the agent hit a transient infrastructural error — an upstream `429` or `5xx`, a sandbox missing-global like `btoa` or `Buffer`, or a script wall-clock timeout — and then retried and recovered are now classified as **transient** and excluded from the failure count, with a new genuine-vs-transient failure split on the Overview so a burst of upstream rate limits no longer reads as a product regression. Bare MCP handshakes (a session with no script execution or search activity) are now stamped **empty** and dropped from the success/fail denominator instead of being judged `ambiguous`. Historical sessions re-stamp deterministically — no LLM re-judge — so existing dashboards realign on the next refresh. No action required.

* **Session Analytics: "Gaps" is now "Misses".** The failed-search view in **Session Analytics** is renamed from **Gaps** to **Misses**, and the "What is this?" strip no longer frames every empty search as a missing roadmap API. A miss can be a capability you haven't built — or a tool you already ship under a name the agent didn't search for. The cluster detail's header now reads **SEARCH MISS · N EMPTY SEARCHES**, member queries are labeled **Searched for**, and the suggestion section is now **Suggested fix** with copy that distinguishes the two cases. Sample session rows surface the actual searched query at the top and drop the per-session sigil and short ID for a cleaner read. No action required.

### Bug fixes

* **Session Analytics: outcome and frustration verdicts now land reliably.** The per-session judge and per-event classifier in **Session Analytics** previously failed on the majority of events because the underlying LLM occasionally returned its JSON wrapped in a prose preamble or a fenced code block, and the response budget was too tight to fit the full verdict — so the reply came back truncated and the row was stamped `classifier_error` (ambiguous outcome, low frustration). The judge now tolerates those wrapped responses and has enough headroom to return the full verdict, so sessions and events surface their real **outcome** and **frustration** instead of falling back to a generic error state. Previously polluted classifications re-judge on the next pass. No action required.
* **Session Analytics: Overview chart now plots `search_sdk` and `failed search` series.** The **Tool calls over time** chart on the **Session Analytics → Overview** previously rendered the `search_sdk` and `failed search` series as permanently empty because it only counted standalone search events and ignored the searches that happen inside `execute_script` calls in Code Mode. The chart now counts every individual search call (standalone and inline) and its miss subset, so both series populate and match the `failed_searches` KPI on the same page. No action required.
* **Session Analytics: Misses page now shows every search miss, not just one.** The **Misses** view (formerly **Gaps**) previously drew from every demand cluster, including intent-only clusters that contained no actual search misses — which crowded out real ones and was the root cause of "only one miss showing" for some workspaces. The list now draws strictly from search-miss signals: semantically-grouped clusters at the top, followed by an **Ungrouped misses** long-tail section with one row per exact failed query. Each row still drills into the sessions that ran it. No action required.
* **Session Analytics: Misses page collapses lexical variants of the same gap.** The **Misses** view in **Session Analytics** now merges short keyword variants of the same capability gap into a single row, so a demand signal like "opportunity" / "opportunities" / "listOpportunities" presents as one row with the combined member count and sample sessions, instead of fragmenting into two small clusters plus an orphan. Singular and plural forms, camelCase and PascalCase variants, and a leading CRUD or list verb on a short keyword are folded together; genuinely distinct multi-word searches (for example, "opportunity pipeline") still stay as their own row. The detail pane that opens when you click a merged row reports the same combined count and the union of underlying queries. No action required.

### New features

* **Reworked public benchmark navigation.** The public benchmark site has a simpler, more direct shape. The topbar and category controls on per-category leaderboards and vendor pages have been deduplicated into a single shared menu, so there's exactly one way to switch categories no matter which page you land on. The `/leaderboard` landing page now presents each category as a tile that links straight into its leaderboard, with the vendor list expandable inline; clicking a vendor row goes directly to the vendor's detail page — for example, `/leaderboard/observability/datadog` — instead of routing through an intermediate category view. The vendor page itself drops the prev/next adjacent-vendor strip in favor of a clean **← Benchmark menu** back link. No action required.

* **Workflow prompts and rubrics on the public benchmark.** Every [workflow](/workflows/overview) column on a per-category public leaderboard — for example, `/leaderboard/observability` — now reveals its purpose, the exact tester prompt the agent was given, and the numbered success criteria the evaluator graded against, without leaving the page. The scatter chart's workflow filter shows a side popover with the full name and description as you focus each row; selecting a single workflow surfaces an explainer strip above the chart with a **View rubric →** button; head-to-head rows pick up a hover popover with the criteria count and a **View full rubric** link; and on the per-vendor page (for example, `/leaderboard/observability/datadog`) each workflow row now expands inline to show the full prompt and rubric, with a `N workflows tested · M total success criteria evaluated` summary in the card header. Workflows without published prompt or criteria render label-only — no info icon or expand affordance. No action required.

### Updates

* **Public benchmark only lists categories with real data.** The public benchmark site no longer shows a category until it has aggregated runs, replacing the previous **SOON** pill and **Audit in progress** placeholder panels. A category appears on the landing page and the category menu the moment its first snapshot is published, and disappears cleanly otherwise — so visitors never land on a leaderboard that's about to spin forever or show an apology. The staff **Benchmark Admin** toggle is relabeled **Show on public benchmark**, with helper text noting that an enabled category surfaces automatically once it has aggregated runs. If the categories API itself is unreachable, the landing now shows **Couldn't load the catalog. Please refresh.** instead of a misleading empty state. No action required.

* **Staff Benchmark Admin sidebar icon restored.** The **Benchmark Admin** entry in the staff sidebar was rendering as a blank tile because its icon name didn't resolve in the shared icon set; it now shows a gavel icon, visually distinct from the **Settings** gear used by Organization settings. Staff-only; no change to the public leaderboards. No action required.

### Bug fixes

* **Session Analytics: Topics drill-down lands on real sessions.** Clicking a provisional topic row in **Session Analytics → Topics** now opens the **Sessions** table populated with every session whose events match that topic, instead of an empty table. Previously the backend filter only matched a session's first script execution, so sessions where the topic surfaced on a later step were silently dropped — on one workspace, the same row went from 11 matched sessions to 31 after the fix. No action required.
* **Session Analytics: Searches drill-down is now clickable.** Provisional failed-search rows in **Session Analytics → Searches** now drill into the **Sessions** table, filtered to the sessions that ran that exact failed search, with a clearable **Sessions that searched "…"** chip that mirrors the existing intent chip. Previously these rows were rendered but inert. No action required.

## Week of May 25, 2026

### New features

* **Public benchmark methodology now lives on its own page.** The public benchmark site has a new `/methodology` page that explains how a leaderboard is produced — the four-step pipeline of [workflows](/workflows/overview), dispatch, judge, and rank — so visitors who arrive on a leaderboard via deep link or search can find the explanation without scrolling back to the landing page. The page is reachable from the topbar on the landing page, the topbar on every per-category leaderboard, and the landing footer, and links straight back to `/leaderboard`. No action required.

* **Curator-driven capability gaps on the public benchmark.** Vendors are no longer penalized on the public leaderboard for [workflows](/workflows/overview) their product genuinely doesn't ship. The Armature team can now flag specific vendor × workflow pairs as capability gaps from the staff **Benchmark Admin** surface (new **Capability gaps** tab), with a curated reason explaining why. Flagged cells render as N/A on the leaderboard with a tooltip, are excluded from the vendor's overall rank and from efficiency normalization, and appear together in a new **Capability gaps** section under the head-to-head matrix grouped by workflow. On `/leaderboard/database`, Neon, Planetscale, Insforge, and Insforge CLI are no longer dinged on `edge_fn_lifecycle` — none of them ship edge functions — and their star ratings now reflect only the workflows they actually support. Changes propagate on the next hourly aggregator pass. No action required.

* **Public leaderboard is live at friendly URLs.** The public benchmark site is now reachable at `/leaderboard` (landing) and `/leaderboard/<category>` (per-category leaderboard) — for example, `/leaderboard/observability` and `/leaderboard/crm`. Direct links, the landing page's category tiles, vendor cards, and the staff **Benchmark Admin** "View public leaderboard" button all route through these URLs, and each leaderboard preselects the right category based on the path. The previous obscure paths still work for internal demos but are kept out of search results. No action required.

* **Public benchmark adds Database and Cloud deploy categories.** Two new categories are now live on the public benchmark leaderboard. **Database** evaluates **Supabase**, **Neon**, **Planetscale**, **Insforge**, and **insforge cli** across eight [workflows](/workflows/overview) — schema discovery and query plan inspection, table lifecycle, column lifecycle, RLS policy lifecycle, bulk upsert with conflict recovery, branching, analytics ingestion and rollup, and edge function lifecycle. **Cloud deploy / hosting** evaluates **Vercel**, **Vercel CLI**, **Netlify CLI**, **Railway CLI**, and **Firebase** across five workflows — preview deploy, deploy inspect, deploy diagnosis, environment variable lifecycle, and domain alias. Both categories are seeded from completed runs in the house benchmark org (17 batches for Database, 5 for Cloud deploy) and refresh on the standard hourly aggregator cadence, with the same pass-rate and efficiency scoring as the observability and CRM categories. The category list, vendors, and workflows are editable post-launch from the staff **Benchmark Admin** surface. No action required.

* **Staff orchestrator for the public benchmark.** A new **Benchmark Admin** surface lets the Armature team edit the public benchmark catalog — which categories are live, the vendor list, and the [workflow](/workflows/overview) columns — directly from the dashboard. Changes pick up on the next aggregator refresh and appear on the public leaderboard at `/benchmark/:category` without a redeploy. Visible only to Armature staff; non-staff don't see the nav entry. No visible change to the public observability or CRM leaderboards — this only moves the catalog from a config file into the dashboard. No action required.

* **Session Analytics — MCP Analytics v2 (early access).** The **MCP Analytics** surface announced [last week](#week-of-may-25-2026) has been rebuilt as **Session Analytics**, a standalone app with its own sidebar and app switcher alongside Testing & Benchmarks. Five views replace the v1 tabs: **Overview**, **Topics** (auto-clustered user intents with expandable detail), **Searches** (failed-search clusters paired with LLM-suggested REST endpoints to close the gap), **Sessions** (anonymous session list with a flagging side-rail), and **Thinking Trace** (per-session timeline that renders the agent's reasoning event-by-event — thoughts, search hits and misses, and script executions with a sub-call waterfall). Each session is identified by a deterministic SVG sigil so operators can recognize patterns without seeing user identities. Still gated behind the `mcp_analytics` feature flag — workspaces without it see no change. Contact us to enable it. The public benchmark leaderboard now covers a second category alongside observability: **CRM**, with **Attio**, **Folk**, and **Hubspot** evaluated across four CRM [workflows](/workflows/overview). The category is reachable from the public landing page CRM tile, which deep-links into the leaderboard with the right category preselected, and the page chrome (tab title, header chip, hero copy) now reflects the active category instead of being hardcoded to observability. Folk will surface automatically once its first completed batch lands. No action required.

* **Public observability benchmark site (soft launch).** A standalone public leaderboard now compares how well agent harnesses solve real observability tasks against four vendor MCP servers — **Datadog**, **Grafana**, **SigNoz**, and **Dash0** — across nine [workflows](/workflows/overview) covering error logs, failed traces, environment audits, on-call handoffs, regression detection, incident autopsies, MCP inventory, metric queries, and signal-presence audits. Each vendor cell shows pass rate and an efficiency score derived from run duration, token usage, tool calls, and retries. Rankings refresh nightly from the most recent completed benchmark batch per (vendor, workflow), and an append-only history table preserves rank deltas across runs. The site is soft-launched on an unlinked URL while we validate the data — a public landing page and link from the marketing site will follow.

* **MCP Analytics (early access).** A new product-intelligence surface for teams running their own Code Mode MCP gateway. Once enabled for your workspace, the **MCP Analytics** tab shows what your agent's users actually asked for, where they hit dead ends, and clustered demand signals from search misses and failed scripts — across an **Overview**, **Sessions**, **Intents**, and **Demand** view. Sessions and per-execution detail include the agent's intent, the executed script, the upstream call trace, and an LLM-judged outcome and frustration level. Operators with access can also edit the intent taxonomy and re-classify historical events from **Settings**. Rolling out behind a feature flag — contact us if you'd like your workspace turned on.

* **Dispatch benchmark batches from chat.** Two new tools on the Armature [MCP API](/mcp-api/overview) let an agent fan out a benchmark [workflow](/workflows/runs) across a matrix of MCP server targets and tester models without opening the dashboard. `dispatch_benchmark_batch` claims the matrix and enqueues one run per `(mcpServerId, testerTarget)` cell, then streams results back through the existing `inspect_run` and `search_runs` tools. `list_benchmark_batches` returns past batches for a benchmark workflow, most recent first, with matrix composition and run-count rollups. `dispatch_benchmark_batch` requires the `editor` role or higher; `list_benchmark_batches` is available to any role with read access — see the [role table](/mcp-api/roles).

* **Edit the authentication method on an existing CLI target.** The edit panel for a CLI [MCP server](/mcp-servers/connecting) now exposes the same authentication section as the connect modal, so you can switch a CLI between **No auth**, **API key (env var)**, and **OAuth** without deleting and recreating the target. Stored API keys can also be rotated in place — type a new value into the secret field and save. OAuth switches use the same popup flow as the connect modal, and an empty-scopes guard prevents silent unscoped grants. See [Connecting an MCP server](/mcp-servers/connecting).

### Updates

* **Public leaderboard cards and search jump straight to the vendor page.** Vendor cards in each category on the `/leaderboard` landing, and the suggestions in the hero search dropdown, now deep-link directly to `/leaderboard/<category>/<vendor>` instead of the category comparison view. Clicking a Datadog card or selecting Datadog from the search now opens the Datadog detail page in one step. Categories that aren't live yet still scroll to the category list, and per-category leaderboards remain reachable from the sidebar and the **See full leaderboard** links. No action required.
* **Public benchmark dropdowns now match the leaderboard design system.** The workflow filter on the scatter chart and the A/B vendor pickers on head-to-head now open into a styled dropdown panel — mono font, square corners, brick brand colors, and a brand-soft wash on the current selection — instead of falling back to native browser chrome. Keyboard navigation is supported throughout: tab to focus, Enter or ↓/↑ to open, ↑/↓ to move (skipping disabled rows), Enter to commit, Esc or Tab to dismiss, and click-outside to close. The head-to-head pickers continue to grey out whichever vendor is already selected on the opposite side so you can't pick the same vendor twice. No action required.
* **Public leaderboard hero now shows the harness lineup.** The `/leaderboard` landing page now ends its hero with a scrolling "Real agents on every harness your users actually use" strip, listing the eight agent harnesses every leaderboard row is produced by: **Claude Code**, **Codex**, **OpenClaw**, **Claude**, **ChatGPT**, **Gemini CLI**, **OpenCode**, and **Cursor**. The strip makes the harness coverage visible at a glance before you drop into a category — see the [public leaderboard](/workflows/overview). No action required.
* **Claude benchmark tester now uses ToolSearch by default.** Claude tester runs on the public observability and CRM [benchmarks](/workflows/overview) now go through the same ToolSearch layer that real Claude Code users hit when calling an [MCP server](/mcp-servers/connecting), instead of eagerly loading every tool on every turn. This makes the Claude column on the public leaderboard a faithful proxy for the experience customers' end users actually see. The tradeoff: oversized vendor `tools/list` catalogs that previously caused Claude to autocompact-thrash are now hidden from the benchmark by ToolSearch, so the leaderboard no longer surfaces schema-bloat issues on its own. Rankings on cleanly-passing cells may shift slightly after the next aggregator refresh. No action required.
* **Benchmark Admin: activate or deactivate vendors without deleting them.** Vendor rows in the staff **Benchmark Admin** surface now have an eye / eye-off toggle alongside the existing upload, edit, and trash actions. Deactivating a vendor keeps all of its configuration — brand color, icon, sort order, and linked [MCP server](/mcp-servers/connecting) — but removes it from the public leaderboard on the next aggregator refresh. Inactive rows fade and show an "Inactive" pill so the catalog stays legible. This replaces the previous workflow of hard-deleting a vendor to hide it during a relaunch, data backfill, or while waiting on a fix, which wiped its configuration and forced staff to re-enter everything to bring it back. Staff-only; no change to the public leaderboards beyond which vendors appear. No action required.
* **Public benchmark leaderboard search spans every category.** The topbar search on the public benchmark leaderboard now finds vendors across every live category instead of only the one you're currently viewing. Searching for Attio while looking at **Cloud deploy / hosting**, for example, now surfaces the Attio result from **CRM** and deep-links you to that category's leaderboard with the right row in focus. The empty-query suggestion dropdown also shows the top vendors sorted alphabetically across every category rather than biasing toward the current page, and clicking a vendor in the current category still focuses its row as before. If the category catalog is unreachable, search falls back to the current category so the existing flow keeps working. No action required.
* **Public benchmark landing page picks up new categories automatically.** The category tiles on the public benchmark landing page now come from the live public benchmark config instead of a hardcoded list, so categories added or enabled from the staff **Benchmark Admin** surface — currently **Database** and **Cloud deploy / hosting** — surface on the landing without a frontend change. Per-category vendor cards still come from the public benchmark API, so a newly enabled category lists immediately and fills in with data once the aggregator has run against the house benchmark org's batches. No action required.
* **Benchmark Admin: vendor logos, inline edit, and trash-icon remove.** Vendor rows in the staff **Benchmark Admin** surface now render the real vendor logo — sourced from the shared logo bucket when available, or the Simple Icons CDN by the vendor's icon slug — instead of a colored placeholder, with the chip background flipping to white when a logo renders. Each vendor and [workflow](/workflows/overview) row also picks up an **Edit** action that loads the row back into the form for in-place updates (slug stays locked as the upsert identity), and the previous **Remove** text button is now a trash icon. Staff-only; no change to the public leaderboards. No action required.
* **Public observability benchmark now refreshes hourly.** The aggregator that builds the public observability [benchmark](/workflows/overview) snapshot now runs every hour instead of being held off-schedule during the soft launch. Pass rates, efficiency scores, and rank history on the public leaderboard pick up new completed batches within the hour, so vendor cells track the underlying run data much more closely. No action required.
* **Public benchmark landing page is trimmed and corrected.** The public benchmark landing page now lists only the **observability** category instead of nine placeholder rows, since observability is the only category with completed public benchmark batches today. The four methodology cards under "How it works" have been rewritten to match the actual pipeline — rubric-driven [workflows](/workflows/overview), `dispatch_benchmark_batch` for wave-paced dispatch (see the [MCP API](/mcp-api/overview)), a single evaluator returning a pass/fail verdict, and a pass-rate rollup — replacing earlier copy that described an audit script, fixed 3×3 harness/model matrix, multi-judge unanimous voting, and a now-removed capability-gap flag. Observability's vendor and run counts on the page now match the live public benchmark API. No action required.
* **Tsuga and Tsuga Code Mode added to the public observability benchmark.** The vendor list on the public observability leaderboard now includes **Tsuga** and **Tsuga Code Mode** alongside Datadog, Grafana, SigNoz, and Dash0, bringing the category to six vendors across nine [workflows](/workflows/overview). Both vendors were already present in the seeded snapshot but had been getting dropped from the nightly refresh — they now persist across runs. The Dash0 brand color has also been corrected to match what the live site was already serving. No action required.
* **Public benchmark landing page now shows live vendor data.** The observability category on the public benchmark landing page now fetches vendor cards directly from the public benchmark API, so pass rates and the vendor list update automatically as new snapshots ship. Previously the landing page rendered a hand-maintained vendor list that could drift from the live leaderboard — for example, Tsuga Code Mode was briefly missing while the static list was out of sync. Categories still in audit continue to show the existing "Audit in progress" notice. No action required.
* **Public benchmark site falls back to a clear empty state.** When the public benchmark API has no snapshot available for a category, the leaderboard, head-to-head, and scatter views now render a "No live data for this category yet" card instead of falling back to a built-in placeholder dataset. This removes a class of stale numbers that could appear if a snapshot was missing, and matches what the landing page already does. No action required.
* **Public benchmark gives half-credit for partial verdicts.** Pass rate on the public observability [benchmark](/workflows/overview) now counts runs that satisfied some but not all success criteria at 0.5 credit, instead of treating them the same as runs that produced nothing. Cells with partials display as "X + Y½ / N · Z%" in the tooltip so the partial-credit contribution is visible alongside full passes. This produces a more faithful comparison across vendors on workflows where a vendor reliably gets part of the answer right. No action required.
* **Overall vendor rank now weights workflows equally.** The overall pass rate used to rank vendors on the public observability leaderboard is now a macro-average across [workflows](/workflows/overview) — each workflow contributes equally — instead of a micro-average that let workflows with more runs dominate. Combined with the partial-credit and archived-run changes above, this shifts the live snapshot's vendor ordering toward what spot-checks of the underlying runs already suggest. No action required.
* **Archived benchmark batches no longer leak into the public leaderboard.** The nightly aggregator that refreshes the public observability [benchmark](/workflows/overview) now filters out archived batches, so a vendor cell only reflects runs from non-archived batches. Previously, archived batches were silently still feeding the snapshot, and because archival had been applied unevenly across vendors, this distorted cross-vendor comparisons. Spot-checked cells now match the non-archived run set. No action required.
* **Public benchmark landing card matches the leaderboard.** Per-vendor pass rates on the public observability benchmark landing page now use the same macro-average across [workflows](/workflows/overview) and the same partial-credit pass rate as the leaderboard. Previously the landing card summed raw passed/runs, which both dropped the half-credit for partial verdicts and let workflows with more runs dominate — so the landing card and leaderboard could show different vendor ordering. The two views now agree. No action required.
* **Public benchmark efficiency now reflects only passing runs.** The efficiency score on the public observability benchmark — derived from run duration, token usage, tool calls, and retries — is now averaged across **passing** runs only, instead of every run in a vendor × workflow cell. Cells with zero passing runs land at the scale floor (40) rather than picking up a misleading mid-range score from cheap, fast-failing attempts. This realigns the Success × Efficiency scatter so the two axes are independent again: low-pass cells no longer score deceptively well on efficiency. Rankings for cleanly passing cells are unaffected. No action required.
* **Capability-gap flag removed from public benchmark cells.** The capability-gap chip that previously appeared in leaderboard cells, head-to-head rows, and cell tooltips on the public observability benchmark has been removed. The signal was driven by a single heuristic and added noise without changing rankings, so vendor cells now show pass rate and efficiency only. Rankings, history, and the underlying benchmark runs are unchanged. No action required.
* **CLI diagnostic output is preserved in run traces.** When a CLI [workflow run](/workflows/runs) returns long `stdout`, `stderr`, `output`, `log`, or `logs` fields, the trace preview now keeps the last 500 characters of each in a `*_tail` field alongside the existing head preview. Error messages and stack traces that previously fell off the bottom of a truncated buffer are now visible in the run trace without needing to pull the raw payload. Secret-bearing keys are still redacted. No action required.
* **Redesigned insights digest email.** The recurring insights digest has a new editorial layout that mirrors the in-app **Insights** view — severity-railed finding cards, brand-accented section heads, and a system font stack with email-client-safe fallbacks. A new **Last runs** table appears after the diff pills so recipients can see recent [workflow run](/workflows/runs) outcomes at a glance, with system-issue rows (product bugs, upstream provider errors) filtered out. Operator-internal "Improve how you test this MCP" setup findings have been removed from the email; they remain visible in the app's **Insights** view. Wording is now cadence-neutral ("Insights digest", "What changed since last time"). No action required.
* **Edit details now blocks transport changes outright.** Saving **Edit details** on an [MCP server](/mcp-servers/connecting) can no longer silently ignore an attempted change between **Remote/Hosted MCP** and **Local CLI** — the API now returns a clear error instead of returning success without applying the change. The edit panel also includes a one-line note explaining that to convert between transports you should delete the target and create a new one, since the two modes use different transport, provisioning, and authentication shapes.
* **Cleaner workflow filter label on the public benchmark scatter chart.** The scatter chart's workflow filter on the public observability benchmark now shows "all workflows" instead of "all workflows (avg)" for its default option. The aggregate is still an average across workflows; the label is just less noisy. No action required.
* **Capability gaps now appear directly under the leaderboard matrix.** On per-category pages of the public benchmark — for example, `/leaderboard/database` — the **Capability gaps** section now renders directly below the leaderboard matrix, above the Success × Efficiency scatter and head-to-head views. Readers see which (vendor × [workflow](/workflows/overview)) pairs were excluded because the product doesn't ship that workflow *before* they reach views where an excluded cell could read as "vendor lost on this workflow." Categories with no curated gaps (currently observability) render the same layout as before. No action required.

### Bug fixes

* **Public benchmark no longer gets stuck on "Loading benchmark…" for upcoming categories.** The **Database** and **Cloud deploy / hosting** categories on the public benchmark site previously rendered an indefinite "LOADING BENCHMARK…" spinner because they were listed as live before any snapshot had been published. Both categories now correctly surface with the **SOON** pill and the **Audit in progress** panel (for example, "Database benchmark is being assembled. 4 vendors identified, audit script in review."), matching how upcoming categories were always meant to read. As a safety net, any live category whose snapshot fetch hasn't settled within five seconds now falls back to a stable "No snapshot yet" message instead of spinning forever. **Observability** and **CRM** are unchanged. No action required.

* **Workflows page hides the category filter when no workflow has a category.** The chip filter strip on the [Workflows](/workflows/overview) page now derives its category list from the workflows actually loaded on the page — chips appear only for categories at least one workflow uses, sorted alphabetically, with accurate counts. When no workflow is categorized, the entire strip (including the **All** chip) is hidden rather than rendering an empty row. Toggling a chip still filters the table and updates the URL as before. The `/mcps` chip strip is unaffected. No action required.

* **Public benchmark dropdowns keep the focused row in view during keyboard navigation.** Arrow-keying through the workflow filter on the Success × Efficiency scatter chart — and any other long dropdown on the public leaderboard — now scrolls the highlighted row into view as focus moves past the visible window. Previously, on lists taller than the 280px panel (for example, the 10-workflow filter on `/leaderboard/observability`), pressing ↓ past the bottom silently highlighted off-screen options. Wrap-around from the last item back to the first also scrolls the panel to the top. Shorter dropdowns like the head-to-head vendor A/B pickers behave the same as before. No action required.

* **Public leaderboard tabs now show the Armature mark.** The `/leaderboard` landing page and per-category pages (for example, `/leaderboard/observability`) now render the Armature favicon in the browser tab instead of the default globe icon, matching the rest of the public site. No action required.

* **Benchmark cleanup now runs reliably for vendor-per-cell batches.** Cleanup for [benchmark runs](/workflows/runs) whose base [workflow](/workflows/overview) supplies the vendor per cell (rather than pinning a single MCP server on the workflow itself) was being skipped and stamped `WORKFLOW_NOT_FOUND`, leaving vendor-side artifacts — preview deploys, database branches, alert rules — behind for each affected run. Cleanup now resolves the workflow against the cell's specific version, so post-run teardown executes against the right vendor. Verdicts and leaderboard rankings were never affected, only resource cleanup. No action required.

* **Renaming a Benchmark Admin vendor or workflow now preserves its catalog state.** Editing a vendor or [workflow](/workflows/overview) row in **Benchmark Admin** — for example, fixing a display name or brand color — no longer resets its sort order or unlinks the associated [MCP server](/mcp-servers/connecting). Previously, saving the form silently overwrote both fields because they weren't part of the submitted payload; they're now preserved on update and continue to default cleanly on first insert. The public observability and CRM leaderboards already render from the catalog, so column order and vendor → MCP linkage now survive edits as expected. No action required.

* **Claude Code no longer prompts for re-auth after idle periods.** MCP clients running multiple parallel sessions, or pausing and resuming roughly an hour after a concurrent token refresh — most visibly Claude Code against `mcp.armature.tech` — were being forced through a full re-auth roughly every hour, even though refresh tokens are valid for 90 days. Armature no longer rotates the refresh token on `/oauth/token` refresh: the same refresh token is returned on every call, its 90-day expiry slides forward on each successful refresh, and access tokens continue to rotate normally. Parallel sessions therefore share one long-lived refresh token and can no longer invalidate each other. This matches how Google, Microsoft, and Vercel handle OAuth refresh, and is permitted by RFC 6749. Per-grant revocation from the [Settings UI](/settings/billing) and `/api/mcp/oauth/revoke` is unchanged. No action required — see [Authenticating with the Armature MCP API](/mcp-api/authentication).

* **Benchmark runs recover from provider rate limits and evaluator format glitches.** Two transient failure modes on the public observability [benchmark](/workflows/overview) — provider rate-limit responses (429 / TPM saturation from OpenAI and Anthropic) and evaluator replies that came back as markdown instead of strict JSON — are now treated as retryable instead of finalizing the run as a platform failure. Rate-limited attempts are re-dispatched on the standard system-error retry path, and an evaluator that emits prose is re-prompted with a stricter format directive before falling back to a terminal error. The net effect is fewer benchmark cells lost to harness-side noise and pass rates that better reflect real agent behavior. No action required.

* **Insights digest emails arrive addressed to you, not to `noreply@`.** Manual sends of the [insights digest](/alerts/email) email now go out as one envelope per recipient, with each recipient on the `To:` line and the founders BCC list preserved. Previously every send landed in inboxes addressed to `noreply@updates.armature.tech`, which tripped Gmail's "to me only" filter, raised spam-classifier scores, and broke Reply. Each recipient now sees their own address on the message, Reply works, and deliverability improves. No action required.

* **Public benchmark site: leaderboard shows the snapshot's real generated time.** The "Generated" timestamp in the public observability [benchmark](/workflows/overview) leaderboard header now reflects the actual snapshot time of the data being displayed, instead of a stale hardcoded value. If no snapshot metadata is available, the header is hidden rather than showing a misleading time. No action required.

* **Public benchmark site: scatter chart axis tooltips now appear on hover.** Hovering the **PASS RATE (SUCCESS)** and **EFFICIENCY** axis labels (and their ⓘ icons) on the public observability benchmark scatter chart now reliably shows the full explanation, with the hover region covering the entire label area. Previously the native SVG tooltips were inconsistent and only triggered over the painted glyphs. No action required.

* **MCP Analytics views now load reliably.** Several **MCP Analytics** endpoints — the per-session detail drawer, the actors list, the **Settings** ingest-token panel, and the gateway-facing rules pull — were returning 500s on first load because they referenced the wrong MCP server columns. The queries now use the correct columns, so session detail, actor breakdowns, ingest-token provisioning, and rule sync all load on first request. No action required.

* **Public benchmark site: in-page navigation no longer 404s.** Clicking a vendor card, the search submit button, an autocomplete suggestion, or the leaderboard logo on the public observability benchmark now navigates to the right page. The soft-launch deploy removed the friendly `/benchmark` and `/benchmark/observability` rewrites but left several hardcoded references behind, so every in-page link bounced to a 404. All references now point at the live page paths. No action required.

* **Public leaderboard: vendor logos no longer render as black tiles.** Vendors on the public leaderboard whose logos come from the Simple Icons CDN — including **Vercel**, **Vercel CLI**, **Railway CLI**, and **Planetscale** — previously appeared as solid black squares because a monochrome black mark was painted directly over a dark brand-color tile. The logo now renders as a recolored mask: white on dark or mid brand colors and near-black on light ones, so the mark stays visible regardless of brand color. Vendors with uploaded full-color logos (for example, **Neon** and **Insforge**) are unchanged. No action required.

* **Public benchmark site: SigNoz and Dash0 logos render correctly.** Vendor chips for SigNoz and Dash0 on the public observability leaderboard previously showed blank colored squares because the upstream icon CDN doesn't host their SVGs. Both vendors now fall back to a clean letter mark (`S` and `d0`) that reads at every size. Datadog and Grafana continue to show their real logos. No action required.

* **OAuth extra headers now forwarded into dynamic client registration.** The **Extra headers** you set on an OAuth-authenticated remote [MCP server](/mcp-servers/connecting) are now also included in the RFC 7591 dynamic client registration request, not just the discovery probes. This unblocks multi-tenant MCP gateways whose `/register` endpoint sits on the MCP origin and needs a tenant identifier — Grafana Cloud's MCP, for example, now completes registration with `X-Grafana-URL` and opens the sign-in popup as expected. Headers are still only forwarded to same-origin requests, so tenant identifiers never leak to a cross-origin authorization server. No action required — reconnect any OAuth MCP server that previously failed at the registration step.

* **Grafana Cloud MCP now registers as a public client.** Dynamic client registration against Grafana Cloud's MCP (`mcp.grafana.com`) previously failed with `DCR endpoint returned 400` because Grafana's gateway only accepts public, PKCE-only clients and rejected Armature's default client-secret registration. Armature now registers as a public client for `grafana.com` automatically, so the [MCP server](/mcp-servers/connecting) connect flow completes and the sign-in popup opens. No action required — reconnect any Grafana Cloud OAuth target that previously failed at the registration step.

* **Grafana Cloud MCP OAuth popup now completes sign-in.** The authorize URL built for an OAuth [MCP server](/mcp-servers/connecting) is now assembled by merging OAuth params onto the provider's authorize endpoint instead of naively concatenating a `?`. This fixes targets whose discovered `authorization_endpoint` already carries its own query string — Grafana Cloud's MCP returns `…/authorize?grafana_url=<stack>` once you set `X-Grafana-URL` in **Extra headers** — which previously produced a URL with two `?` separators and bounced the popup to the stack home instead of back to Armature. No action required — reconnect any Grafana Cloud OAuth target that previously failed at this step.

* **No more duplicate Claude Code entries from MCP OAuth refresh races.** MCP clients that fire concurrent `/oauth/token` refreshes (notably Claude Code 2.1.149) sometimes saw an in-flight refresh fail with `invalid_grant`, treat the credential as revoked, and re-run the full OAuth + dynamic client registration flow — leaving a trail of duplicate "Claude Code" rows in **What's connected**. Armature now keeps the previous refresh token hash valid for a 60-second grace window, so an older refresh request still on the wire succeeds instead of forcing a re-registration. Dynamic client registration also now reuses an existing public client when an MCP client re-registers with the same `software_id`, `client_name`, and redirect URIs (per RFC 7591), instead of minting a fresh `client_id` every time. Existing duplicate rows are not collapsed retroactively — revoke unwanted ones from the settings page. No action required — see [Authenticating with the Armature MCP API](/mcp-api/authentication).\n- **No more spurious 401s during MCP OAuth token refreshes.** MCP clients that fire several `/oauth/token` refreshes in parallel (notably Claude Code, which can rotate dozens of times in under a minute) sometimes saw an in-flight request fail with `OAuth token is invalid or revoked` even though the access token's nominal lifetime had barely elapsed — the previous-hash slot had been overwritten by a subsequent refresh in the same burst. Armature now keeps the last 8 rotated-out access token hashes valid for a 60-second grace window each, so requests already on the wire with an older token continue to succeed even across a refresh storm. Genuinely expired or revoked tokens still fail. Revocation also honors any token in the grace window. No action required — see [Authenticating with the Armature MCP API](/mcp-api/authentication).

* **Consistent run duration across benchmark and leaderboard views.** The **Duration** column on a [benchmark batch](/workflows/runs) detail page and the `avg_duration_ms` value on the workflow leaderboard now match the tester-only timing already shown on each run's detail page. Previously these views reported the full handler lifetime (tester plus evaluation and setup), so the same run could read, for example, "10m" in the batch table but "3m" on its detail tile. Rankings on cleanly-passing leaderboard cells may shift slightly as a result. No action required.

* **Public observability benchmark pass rates realigned with historical reference.** The nightly aggregator that refreshes the public observability [benchmark](/workflows/overview) now pools runs across every completed batch per vendor × [workflow](/workflows/overview) cell instead of sampling only the latest batch, and excludes platform-error runs (tester crashes, evaluator failures, timeouts) from the denominator. After last week's efficiency and vendor-list changes, the first nightly refresh produced pass rates that diverged from the reference numbers users had been seeing — caused by single-batch sampling and platform errors inflating cell totals without contributing to the passed count. Spot-checked cells now match the reference values bit-for-bit. No action required.

* **Tool schemas with `${…}`-style text no longer trigger false `SECRET_NOT_FOUND` errors.** When connecting an [MCP server](/mcp-servers/connecting), Armature now preserves the discovered tool catalog as-is while still resolving secret placeholders in auth, URL, and header fields. Previously, vendor schemas whose descriptions or examples contained literal `${variable}` syntax — Grafana Cloud's ClickHouse tool, for example — were misread as unresolved Armature secrets and blocked the connection. No action required — retry any MCP server that previously failed with `SECRET_NOT_FOUND` while listing tools.

* **CLI deploy failures keep their final error in run traces.** Long `stderr` and `stdout` streams captured during CLI [workflow runs](/workflows/runs) — for example, a Vercel CLI deploy that ends with a build error — now preserve a redacted tail preview alongside the head, so the final diagnostic line remains visible in both the run evidence view and the customer-facing trace. Previously the tail was truncated away, hiding the actual failure reason. Bearer tokens and other secrets in the preserved tail are still redacted. No action required.

* **CLI tester runs no longer crash pre-flight on stray secret placeholders.** Claude Code and Codex CLI [workflow runs](/workflows/runs) against an [MCP server](/mcp-servers/connecting) now match the API tester behavior when a `${…}` placeholder in a server's config has no matching secret: the run continues and either succeeds (OAuth servers re-mint a fresh token) or surfaces a clean upstream `401` from the vendor, instead of dying in \~240 ms with a non-retryable `SECRET_NOT_FOUND` system error. This closes the harness-level inconsistency behind the Grafana × Claude Code incident on May 25 and prevents the same divergence from recurring elsewhere. No action required — re-run any CLI tester run that previously failed at this step.

* **Token usage now contributes to public benchmark efficiency scores for API testers.** Token usage from Claude and ChatGPT API tester runs is now persisted on each [workflow run](/workflows/runs), so the public observability [benchmark](/workflows/overview) efficiency score reflects real input and output token totals instead of treating every cell as identical on that axis. Previously the API runners captured vendor usage on the run trace but didn't carry it through to the per-run record the aggregator reads, which flattened the token component of the efficiency formula across all six tester models. Anthropic and OpenAI tester cells now differentiate on token efficiency after the next aggregator refresh; CLI-harness testers (Gemini, Claude Code, Codex, Cursor, openclaw, opencode) still surface usage only on the run trace for now. No action required.

* **Claude Code and Gemini CLI tester runs now report real token usage.** Token usage from the Claude Code and Gemini CLI testers is now persisted on each [workflow run](/workflows/runs), extending last update's API-tester fix to the two largest CLI harnesses. Previously these runs landed with an empty usage block, so the public observability [benchmark](/workflows/overview) efficiency score treated their token component as zero — making the tokens axis harness-dependent noise rather than a real comparison signal. Claude Code and Gemini cells now contribute real input and output token totals to the efficiency formula after the next aggregator refresh. The remaining CLI testers (Codex, Cursor, openclaw, opencode) still surface usage only on the run trace for now. No action required.

* **Session Analytics sessions list now shows real intent, outcome, and frustration.** Every row in the **Sessions** view of Session Analytics previously read `(no intent)`, rendered the frustration column as a thin vertical line, and marked sessions as `OK` even when they contained errors. The list now surfaces the raw session intent as a fallback when no classification has run yet, renders the frustration bar at full width, and falls back to the session's error count for the outcome when no classifier verdict is available. The intent column is no longer truncated at the grid level. No action required.

* **Session Analytics thinking trace header reflects the real session state.** The drill-down header on a **Thinking Trace** view now reports `FAILED` when a session contains any failed events or has a non-zero error count, instead of reading `OK` on sessions with hundreds of errors. The side rail also falls back to `last_event - started_at` when a session is still in flight or never recorded an explicit end, so **Duration** no longer reads `0 s` on long-running sessions. No action required.

* **Sidebar no longer flips to Session Analytics on run pages and unknown routes.** Visiting a [workflow run](/workflows/runs) or run history page (and other routes without a matching sidebar entry, such as settings sub-pages and auth callbacks) no longer switches the left sidebar into the **Session Analytics** app section with no row lit. Run pages now keep the **Testing & Benchmarks** app active with **Workflows** highlighted, and any unmatched route falls back to Testing & Benchmarks rather than the first app in the list. Single-app workspaces (without the `mcp_analytics` flag) are unaffected and still show no app header. No action required.

* **Session Analytics overview no longer mislabels first-period metrics as doubled.** KPI deltas on the Session Analytics **Overview** previously showed `↑ 100%` for any metric whose prior period was zero, implying the value had doubled when in fact it was the first time it had been recorded. First-period metrics now display `NEW` instead of a percentage delta. No action required.

* **Top topics and Failed searches panels populate before classification runs.** The **Top topics** and **Failed searches** panels on the Session Analytics overview previously stayed blank until the background classification jobs had completed. Both panels now fall back to raw intents and raw search-miss groupings — the same data that powers the **Topics** and **Searches** pages — and append `· provisional` to the panel title so the fallback state is legible. The full **Topics** and **Searches** pages also show a provisional banner when they're rendering raw groupings; expandable cluster detail stays hidden until real clusters have been promoted. No action required.

### New features

* **Agents can register HTTP MCP servers from chat.** A new `add_mcp_server` tool on the Armature [MCP API](/mcp-api/overview) lets an agent add a new `streamable_http` or `sse` MCP target to your workspace without opening the dashboard. The tool is intentionally scoped to credential-less public HTTP MCPs — stdio and hosted-npm transports, and any plaintext auth, are rejected. After the row is created, the response includes a dashboard URL where an admin can [finish auth setup](/mcp-servers/connecting) and then call `sync_mcp_server_capabilities` to pull in the tool catalog. The same plan-quota, audit, and post-commit checks as the dashboard flow apply. Requires the `editor` role or higher — see the [role table](/mcp-api/roles).

* **Agents can connect OAuth MCP servers from chat.** Two new tools on the Armature [MCP API](/mcp-api/overview) let an agent drive the OAuth connect flow for any MCP server in your workspace without anyone clicking through the dashboard. `authenticate_mcp(serverId)` returns an `authorize_url` to share with the user, plus a `flow_id` and the `redirect_uri` their browser will land on. After the user signs in, paste the callback URL back to the agent and `complete_authentication_mcp(serverId, callback_url)` finishes the token exchange and writes the new auth profile. Follow up with `sync_mcp_server_capabilities` to refresh the [tool catalog](/mcp-servers/coverage). Both tools require the `editor` role or higher — see the [role table](/mcp-api/roles) — and are especially useful when seeding a workspace with many OAuth MCPs at once.

### Updates

* **Run detail now leads with token usage.** The cost card on a [workflow run](/workflows/runs) detail page is now titled **Token usage** and leads with consumption, with **Est. API cost** shown as a smaller secondary metric. The per-bucket breakdown (input, output, cached input, cache writes, reasoning) is unchanged. This keeps the focus on what the run actually consumed, while pricing remains visible for reference. No action required.

* **Accurate cost and token totals for OpenClaw tester runs.** [Workflow runs](/workflows/runs) against OpenClaw targets now show **Est. API cost** values and full token usage instead of `pricing unavailable`. Pricing is seeded for the canonical OpenClaw models (Anthropic Sonnet 4.6, OpenAI GPT-5.5, Google Gemini 3 Flash) and the historical launch rows, and token aggregation now reads OpenClaw's native usage buckets (input, output, cache read, cache write). No action required — existing and new runs pick up the change automatically.

* **Remote MCP probes now retry across all healthy DNS addresses.** When a hosted [MCP server](/mcp-servers/connecting) resolves to multiple IPs and one of them is unresponsive, the connect-time probe now falls through to the next address instead of hanging on the dead one. A bounded per-address response-header timeout keeps a single bad route from consuming the whole probe. Servers like HeyReach (`mcp.heyreach.io`), whose DNS rotates between healthy and unhealthy addresses, now connect reliably on first try. No action required — re-connect any server that previously failed at this step.

* **Connect HubSpot, Pipedream, and Item MCP servers.** Three remote [MCP servers](/mcp-servers/connecting) that previously failed OAuth metadata validation now connect cleanly. Pipedream's MCP gateway (`mcp.pipedream.net`, used by Pipedrive and other Pipedream-hosted servers) and Item (`mcp.item.app`) are now recognized as trusted authorization issuers. HubSpot (`mcp.hubspot.com`) is now available as a curated OAuth provider — because HubSpot does not support dynamic client registration, an admin registers an MCP Auth App once in their HubSpot developer account and the credentials are reused org-wide. No action required for Pipedream or Item; for HubSpot, see [Connecting an MCP server](/mcp-servers/connecting).

## Week of June 29, 2026

This week was focused on behind-the-scenes infrastructure work. No user-facing changes shipped.

## Week of June 22, 2026

### Bug fixes

* **Accurate Claude token totals in benchmark tables.** Token usage for Claude tester runs shown in [benchmark batches](/workflows/runs) no longer double-counts per-turn rows. Aggregation now prefers the terminal `claude_result` for Claude harnesses, sums per-turn totals for Codex, and falls back to model responses otherwise. **Est. API cost** values for Claude runs in benchmark views now match the per-run breakdown. No action required.

## Week of June 15, 2026

### Updates

* **Benchmark batches now retry transient system failures.** When a run inside a [benchmark batch](/workflows/runs) fails because of a retryable platform or system error, the batch now automatically retries it and keeps the retry attempt in the same batch. Batch rollup counters recompute from the non-superseded finalized runs, so pass/fail totals reflect the final outcome rather than the transient failure. No action required.
* **More Claude Code native tools available against MCP targets.** Claude Code tester runs against [MCP servers](/mcp-servers/connecting) now keep access to the `Task` and `Agent` native tools. Only truly interactive, scheduled, and plan-mode native tools remain blocked, so [workflow runs](/workflows/runs) that benefit from sub-agent delegation can use it.
* **Scheduled tester runs preserve their schedule link.** The originating schedule ID is now preserved when a tester run is inserted and when it is claimed by a worker, so scheduled [workflow runs](/workflows/runs) stay correctly associated with their schedule end-to-end.

### Bug fixes

* **Gemini CLI client errors preserved in run context.** When the Gemini CLI tester reports a client-side error, the error details are now retained in the run's error context instead of being dropped. Failed [workflow runs](/workflows/runs) against Gemini CLI targets are now easier to diagnose. No action required.

## Week of June 8, 2026

### Bug fixes

* **Amplitude analytics now load on the Armature web app.** The content security policy now allows the Amplitude browser SDK script and its ingestion endpoints, so the same anonymized usage events captured by our other analytics tools — sign-up and sign-in, onboarding, plan selection and checkout, connecting an [MCP server](/mcp-servers/connecting), creating a tool monitor, and creating or triggering a [workflow run](/workflows/runs) — also reach Amplitude. No action required, and no change to what is captured — MCP server credentials, tool inputs, and run outputs are still never sent.
* **Mixpanel browser SDK initializes on the default instance.** The Mixpanel browser SDK queue now passes the library name the CDN loader expects, so the default analytics instance hands off cleanly from the stub to the real SDK and starts sending events after consent. Client-side events for the tracked flows — sign-up, onboarding, checkout, [MCP server](/mcp-servers/connecting) connections, tool monitors, and [workflow runs](/workflows/runs) — now report reliably on first visit. No action required.

## Week of June 1, 2026

### Updates

* **Faster dispatch for API-based tester runs.** API-based tester runs (Claude and ChatGPT) and CLI-based tester runs (Claude Code and Codex CLI) now use independent dispatch queues. A backlog on one no longer delays the other, so [workflow runs](/workflows/runs) stay snappy even when a single tester model is under heavy load. No action required — existing workflows pick up the change automatically.

### Bug fixes

* **No more spurious "shells out" findings on CLI servers.** Insight digests for CLI [MCP servers](/mcp-servers/connecting) no longer flag agents invoking the wrapped binary as a deviation. For CLI targets, reaching the binary through a shell is the canonical mode of use, not a problem. Genuine CLI UX issues — stderr noise, missing `--json` formatting, unexpected warnings — are still surfaced. No action required.
* **Product analytics now reach Amplitude in production.** The anonymized usage analytics announced in the [Week of May 25](#week-of-may-25-2026) update are now actually being delivered from the production Armature web app. A missing production configuration value and a content security policy that blocked outbound analytics requests have both been resolved. No action required, and no change to what is captured — MCP server credentials, tool inputs, and run outputs are still never sent.

## Week of May 25, 2026

### New features

* **Estimated API cost on every run.** Workflow [runs](/workflows/runs), benchmark results, and workflow analytics now show an **Est. API cost** value derived from provider token usage and current model API pricing. The run detail page includes a per-bucket breakdown (input, output, cached input, cache writes, and reasoning) so you can see exactly which token classes contributed to the estimate. Runs where pricing is not yet seeded show `pricing unavailable` instead of a number. Subscription pricing may differ from the API-list estimate.
* **Composer 2.5 available for Cursor tester runs.** The Cursor tester harness now offers `Composer 2.5` alongside `Composer 2` in the model picker, so organization admins can author [workflows](/workflows/authoring) that exercise MCP servers against the newer Cursor model. Existing workflows continue to run on `Composer 2` by default — switch the tester model on a workflow to opt in.
* **Redesigned subscription cancellation flow.** Cancelling from [Billing](/settings/billing) now opens a four-step modal: pick a reason, optionally pause your subscription for 30 or 60 days, optionally book a call with a founder, then confirm. Pausing preserves your workspace state and resumes automatically on the chosen date — no new payment method required. A confirmation email is sent with the pause or cancellation date and a one-click reactivate link.
* **Reactivate paused or cancelling subscriptions in one click.** If your subscription is paused or scheduled to cancel at period end, a banner now appears on [Billing](/settings/billing) with a **Reactivate** button that clears the schedule immediately. The same action is also available from the confirmation email.

### Updates

* **Product analytics in the Armature app.** The Armature web app now sends anonymized usage analytics for key flows — sign-up and sign-in, onboarding, plan selection and checkout, connecting an [MCP server](/mcp-servers/connecting), creating a tool monitor, and creating or triggering a [workflow run](/workflows/runs). This helps us see which parts of the product people use most so we can prioritize improvements. No MCP server credentials, tool inputs, or run outputs are captured.
* **Looser naming rules for additional environment variables.** The **Additional environment variables** editor on the connect form now accepts names with lowercase letters and hyphens (for example, `kraken-cli_apiKey`), in addition to the standard uppercase form. This unblocks CLIs that expect rc-style credential variables named after their package. The primary auth variable still requires the strict uppercase form, and reserved names (`AWS_*`, `PATH`, `HOME`, `ARMATURE_*`) remain blocked. See [Connecting an MCP server](/mcp-servers/connecting).

### Bug fixes

* **Skipping a plan no longer leaves onboarding stuck.** New workspace owners who choose **Skip for now** on the plan picker are now taken straight to the dashboard instead of landing on the role onboarding step, which doesn't apply without a paid plan. You can revisit plans any time from [Billing](/settings/billing).
* **Kraken CLI installs no longer fail at provisioning.** Connecting `kraken-cli` as a CLI [MCP server](/mcp-servers/connecting) previously failed during install because one of its dependencies is fetched from GitHub rather than the npm registry, which the provisioning sandbox blocked. The sandbox now permits the specific GitHub hosts required for that dependency, so `kraken-cli` provisions cleanly and discovery returns its full command catalog.
* **Accurate opencode token totals on multi-step runs.** Token usage reported for opencode tester runs now sums every step of a tool-use loop instead of reflecting only the final step. Multi-step [workflow runs](/workflows/runs) against opencode targets now show the full token cost of the run.

## Week of May 18, 2026

### Bug fixes

* **Remote MCP probes now connect to more servers on first try.** The live probe that runs when you [connect an MCP server](/mcp-servers/connecting) now falls back to older supported MCP protocol versions when a server rejects the latest one, and it sends a default User-Agent so servers behind CloudFront and similar CDNs no longer block the request. Connections that previously failed at the initial `initialize` step now succeed and return the full tool catalog. No action required — re-connect or re-probe any server that previously failed at this step.

### New features

* **Run a workflow from its analytics page.** The same **Run all** / **Run now** split control from the workflows list now appears on the workflow analytics hero, so you can trigger a manual [run](/workflows/runs) without navigating back to the list. Single-target workflows show **Run now**; multi-target workflows show **Run all (N)** with a caret menu to pick a subset and dispatch **Run selected (N)**. Toast, dedupe, and partial-failure behavior matches the workflows list, and the analytics view reloads after a successful dispatch so `last_run` reflects the new run. Paused workflows keep the existing "Resume the workflow before running it manually" tooltip.

### Bug fixes

* **Run control sized consistently on the workflow analytics page.** The **Run all** / **Run now** split control on the [workflow](/workflows/runs) analytics hero now matches the height, font size, and border weight of the adjacent **Run history** and **Edit** buttons. The caret stays flush against the primary button and hover and open states still visually group the pair. The denser styling on the workflows list page is unchanged.

### Updates

* **More complete catalogs for CLI MCP servers with nested subcommands.** CLI discovery now parses multi-line `Usage:` blocks, so deep subcommand hierarchies (for example, `cdp api <path> [fields...]`) contribute every documented usage shape — including rest positionals like `[fields...]` — to the [tool catalog](/mcp-servers/coverage). Re-sync your CLI server to pick up any tools that were previously missing.
* **CLI tools accept harness-injected arguments.** Generated input schemas for CLI tools no longer reject extra fields like `cwd` or `description` that some tester harnesses (such as Claude Code) attach to every call. Undeclared keys are stripped before the CLI is invoked, so [workflow runs](/workflows/runs) against CLI MCP servers no longer fail validation before the tool runs.

### Bug fixes

* **Product analytics events now reach Mixpanel reliably.** Server-side analytics events from the Armature app are now delivered in the format Mixpanel's ingestion endpoint expects, and unexpected responses are no longer silently treated as successes. Usage data for the tracked flows — sign-up, onboarding, checkout, [MCP server](/mcp-servers/connecting) connections, tool monitors, and [workflow runs](/workflows/runs) — now shows up correctly in our dashboards. No action required.
* **Browser-side Mixpanel events no longer blocked by CSP.** The Armature web app's content security policy now allows the Mixpanel browser SDK (`cdn.mxpnl.com`) and its ingestion endpoints (`api-js.mixpanel.com`, `api.mixpanel.com`). Client-side analytics for the same tracked flows now load and report alongside the server-side events. No action required, and no change to what is captured — MCP server credentials, tool inputs, and run outputs are still never sent.
* **Browser analytics initialize cleanly on the default instance.** The Mixpanel browser SDK queue no longer receives a trailing undefined instance name when the Armature web app boots, so the default analytics instance initializes in the shape the SDK expects. Client-side events for the tracked flows — sign-up, onboarding, checkout, [MCP server](/mcp-servers/connecting) connections, tool monitors, and [workflow runs](/workflows/runs) — load reliably on first visit. No action required.
* **Failed CLI tool calls are now reported as failed.** When a wrapped CLI exits non-zero or the MCP response is marked as an error, the tool call status now reflects the failure instead of staying `completed`. Pass/fail rates and tool-call traces in [workflow runs](/workflows/runs) now match what actually happened on the CLI.
* **CLI discovery no longer turns request-body fields into fake tools.** For CLIs whose leaf commands document their request body under sections like `Fields:`, `Body:`, `Returns:`, `Response:`, or `Headers:`, discovery previously mistook each field row for a subcommand and filled the [tool catalog](/mcp-servers/coverage) with bogus `<parent>.<fieldName>` entries — sometimes pushing real subcommands out via the per-server tool cap. These sections are now recognized as data-shape help and their field names feed each command's input schema instead. Re-sync affected CLI servers to restore any subcommands that were previously crowded out.

### New features

* **Composer 2.5 available in the Cursor harness.** The Cursor tester model picker now lists **Composer 2.5** alongside Composer 2 and the other Cursor-backed models. Select it when authoring or editing a workflow to run tester turns on Cursor's latest native Composer model. Capabilities mirror Composer 2 (Cursor-native, no extended thinking surface). Composer 2 remains the default for new Cursor workflows; choose Composer 2.5 explicitly per workflow until the default is bumped. See [Authoring effective workflows](/workflows/authoring).
* **Multiple environment variables per MCP auth profile.** Auth profiles for CLI and hosted-stdio MCP servers can now carry more than one credential. Add additional `NAME` / value pairs in the **Additional environment variables** editor on the connect form, and each one is securely stored and injected into the server process at startup. This unblocks tools that require several keys to run. Up to 16 extras per profile. To change an existing extra, delete and recreate the auth profile. See [Connecting an MCP server](/mcp-servers/connecting).

### Updates

* **Cleaner connect form helper text.** The **Additional environment variables** editor on the connect form now shows a generic one-line hint instead of a vendor-specific example, making the guidance easier to scan when setting up a new [MCP server connection](/mcp-servers/connecting).

### Bug fixes

* **Codex API fallback authentication.** When a Codex subscription hits a plan limit, Armature now falls back to API key authentication correctly, so workflow runs continue without manual intervention.
* **Claude long-context plan-limit handling.** Claude failures caused by "usage credits required for long context requests" are now classified as plan-limit conditions and trigger the same automatic fallback as other limit errors, keeping [workflow runs](/workflows/runs) on track.
* **Cleanup stays on the run's tester model.** Cleanup for a workflow run now executes on the same tester model the run actually used, instead of falling back to the workflow's default. This prevents cross-model interference when a single workflow has run against multiple tester targets and keeps [run](/workflows/runs) metadata consistent.