Skip to main content
Follow along with the latest changes to Armature. For documentation, see the Introduction.

Week of June 5, 2026

New features

  • Surface and restore archived workflows, benchmarks, and runs. Archived items are no longer hidden out of reach — every archive surface now has a way to see what’s in it and bring something back. Runs get a Show archived toggle that reveals archived rows with an Archived badge and an Unarchive action. Benchmark batches pick up the same Show archived toggle on Past batches, with an Archived pill and a Restore action wired to the existing un-archive endpoint. The Workflows page gets a new Archived tab listing every workflow set inactive (whether paused or deleted), each with a Restore action — this replaces the previously dead Paused tab, which never showed anything because pausing and deleting both flagged a workflow inactive and the active-only list hid it. Restoring an item now also clears it from the optimistic-hide set, so a just-restored row stays visible when you toggle Show archived back off. No action required.

Bug fixes

  • More accurate workflow pass rates. Queued and in-flight workflow runs are no longer counted against a workflow’s pass rate. Previously, an active run dragged the rate down for its entire duration (for example, 2 passed + 1 running showed 67% instead of 100%). Pass rates now only reflect finalized runs, while in-flight work continues to surface separately on the workflow detail page.

Updates

  • Product MCP setup forces an explicit API docs path. Step 2 of the Product MCP create flow now requires you to pick Upload OpenAPI now or Connect CI rather than leaving the spec blank. Upload accepts paste plus .json, .yaml, and .yml file picker; CI mode surfaces the upload endpoint, source slug, environment, key purpose, and last successful or failed sync, with raw implementation details tucked behind a drawer. The setup recap reads either OpenAPI uploaded or CI sync connected/waiting, so there’s no ambiguous “leave blank” state. No action required.
  • Product MCP API connection step now reads as security-critical. Step 3 of the create flow has been rewritten to make explicit that the validation mode you pick (validate before tools or presence only) governs whether fake or invalid caller credentials can see the tool catalog or run code — the credential is then forwarded upstream as configured. The heading, label, and supporting copy now lead with that contract instead of reading like generic configuration. No action required.
  • Dedicated upload-key flow for Product MCP CI sync. The managed_mcp_upload API key needed for the CI sync path is now mintable directly from the Product MCP setup, instead of being buried in generic Settings → API keys. The UI clearly separates upload keys from the caller/product API key used at runtime, so it’s obvious which secret goes where. No action required.
  • Explicit auth validation states on Product MCPs. The detail view now distinguishes no auth validation configured, validation failing, and validation passing as separate states, and tool exposure depends on the validation actually passing — not just the presence of a config. A Product MCP can no longer present as ready while its auth path is unverified or broken. No action required.
  • Activation guard on Product MCPs. The Activate action is now disabled in the UI and rejected by the API unless the OpenAPI docs, generated tool catalog, auth configuration, and runtime/domain readiness are all complete. A partially set up Product MCP can no longer be flipped live by accident. No action required.
  • Automatic domain provisioning for Armature-hosted Product MCPs. Creating or activating a Product MCP now provisions the mcp.<org-slug>.armature.tech/<source-slug> hostname end-to-end — DNS and the hosting project are configured automatically, and the detail view reports the domain as provisioning, ready, or failed so you can tell at a glance whether the public URL actually resolves. No more manual DNS work to bring a new Product MCP online. No action required.
  • Denser Product MCP detail page. The top of the Product MCP detail page now shows only what you need to triage the endpoint: name, status, the public MCP URL with a one-tap copy, setup readiness, and the primary actions. Metrics, diagnostics, the parsed API docs, and connection internals move into collapsed sections lower on the page. The page now immediately answers “What is the URL, is it live, what is missing?” The redundant ENDPOINTS eyebrow on the index page has also been removed, and the mcp.<org-slug>.armature.tech/<source-slug> URL now appears prominently on every card with the same one-tap copy. No action required.

New features

  • Managed Code Mode MCP — host a Code Mode MCP from your OpenAPI spec. Point Armature at your product’s OpenAPI spec and we host a Code Mode MCP server for it at mcp.armature.tech/mcp/product-mcp/<source>, with the same two-tool shape (search_sdk and execute_script) the bare Armature MCP already exposes — so coding agents call your API as a typed SDK in a sandbox instead of juggling a flat tool catalog. Uploads go through a signed POST /api/openapi-artifacts/upload from CI (editor role or higher, with X-Armature-Timestamp and X-Armature-Signature headers), and Armature compiles the spec into a generated client, SDK docs, and a search index per environment so search hits stay fast and scripts run against a stable shape. Sources are scoped per organization with a unique clientName, and only the source’s owning org can reach its MCP — so every customer’s managed MCP stays isolated. Per-call telemetry (request ID, tool name, status, duration, JSDoc @intent) flows into the same Session Analytics surface as bare Armature MCP, so you can see how agents actually use your API. Contact us to enable Managed Code Mode for your organization.

Updates

  • Armature MCP API tools now accept an optional telemetry.intent argument. Every tool exposed by the Armature MCP endpoint (/api/mcp, including the public host at mcp.armature.tech/mcp) now advertises an optional telemetry object with a single string field, intent — a one-sentence description of what the user is trying to accomplish with the call. Armature uses the intent to attribute tool calls to user goals on the Session Analytics dashboard; the field is stripped before the tool handler runs and is never forwarded to your workflow logic. Existing clients keep working unchanged — calls without telemetry.intent succeed and still record a tool-call event, just with the intent recorded as null. The instrumented endpoint at mcp-instrumented.armature.tech and its report_intent / report_blocker / report_frustration tools are unaffected. See the overview for an example. No action required.

Bug fixes

  • Clearer recovery from partial Product MCP setup failures. When the docs upload step fails after the Product MCP source has already been created, the flow now lands on the new source’s detail page with the real upload error shown inline under API docs, a Retry upload action next to it, and status held at Needs setup — instead of stranding you with a vague “Created, but docs upload failed” toast. You can see exactly what failed and fix it in place. No action required.
  • OpenAPI uploads no longer fail with “Bucket not found”. The private bundle storage that backs Product MCP OpenAPI uploads is now provisioned in every environment before the feature is enabled, so the upload step in the create flow can no longer fail with a missing-bucket error on a fresh workspace. No action required.

Week of June 4, 2026

New features

  • Self-serve Product MCPs. You can now turn your product API into an Armature-hosted MCP server from the dashboard — without standing up your own gateway. A new Product MCPs sidebar entry lists your endpoints as cards and steps you through a four-part create flow: name and base URL, upload an OpenAPI spec to derive the tool catalog, configure how the caller’s product API credential is presented (bearer header, custom header, or query param), and publish. Once activated, the MCP is reachable at https://mcp.<org-slug>.armature.tech/<source-slug> and a compliant MCP client signs in with the caller’s own product API credential — there is no second end-user credential and Armature never sees the raw value. The detail view shows setup readiness, runtime health, the parsed API docs, and the API connection summary, and individual modals let you update settings, rotate the connection, or re-upload the spec. Two validation modes are supported: validate before tools (the default, which checks the credential against your upstream before each session) and presence only (faster, but requires explicitly confirming the security tradeoff). Rolling out behind a feature flag — the Product MCPs sidebar entry, the dashboard APIs, the public mcp.<org-slug>.armature.tech/<source-slug> endpoint, and the OpenAPI upload path stay hidden until your workspace is opted in. Contact us if you’d like your workspace turned on — see Connecting an MCP server and the MCP API overview.

Updates

  • Product MCP setup tightened across API docs, auth, activation, and the public hostname. Follow-up polish on the Self-serve Product MCPs feature above, all behind the same product_mcp feature flag. The API docs step now forces an explicit choice between Upload OpenAPI now (paste or pick a .json, .yaml, or .yml file) and Connect CI (which shows the upload endpoint, source slug, environment, key purpose, and waiting-vs-synced status), instead of treating it as a vague optional field. The CI path has its own guided action for creating a dedicated managed_mcp_upload API key from inside the Product MCP setup flow, so you no longer have to hunt for it in generic Settings → API keys — upload keys are visually distinguished from caller and product API keys. The API Connection step is now framed as a security gate: the heading makes clear it validates caller credentials before tools are exposed and forwards the same credential upstream, and auth readiness is shown as one of three explicit states — no validation configured, validation failing, or validation passing — with tool exposure tied to the passing state rather than just the presence of config. Activate is now disabled (and rejected by the API) until docs, generated capabilities, auth config, and runtime/domain readiness are all complete, so a partially set-up Product MCP can no longer go live. The Armature-owned hostname https://mcp.<org-slug>.armature.tech/<source-slug> is now auto-provisioned end-to-end on create and activate — Armature ensures both the Cloudflare DNS CNAME to cname.vercel-dns.com and the Vercel project domain exist, with status tracked as provisioning, ready, or failed — so the public URL resolves and serves the MCP without manual DNS work. The detail page has also been thinned: the top section shows only name, status, public MCP URL, setup readiness, and primary actions, with metrics and diagnostics moved into collapsed sections lower on the page; the public URL itself is now prominent on cards and detail pages with a one-click copy button. If something fails partway through create — for example, an OpenAPI upload error — you now land on the created Product MCP’s detail page with the real error inline in API docs, a Retry upload action, and status held at Needs setup, instead of being stranded by a vague “Created, but docs upload failed” toast. No action required — existing opted-in Product MCPs pick up the new states, guardrails, and provisioning on the next save.
  • armature.tech testing tile rewritten for the analytics-first framing. Following last week’s landing-page rebuild around Agent Experience, the 03 · Testing feature tile on armature.tech now reads “Tests and benchmarks on every harness.” with a one-line body about running complex workflows across every model and harness and benchmarking your agent experience against competitors. The supporting harness × workflow pass/fail matrix visualization is unchanged. No action required.

Week of June 3, 2026

New features

  • Analytics ingest API keys for self-hosted MCPs. Self-hosted customer MCPs using @armature-tech/mcp-analytics ^0.3.0 can now authenticate ingest pushes with a bearer API key minted from Settings → API keys → Analytics ingestion keys, instead of wiring up the previous HMAC path. Naming the key registers a matching MCP source under the same name in one step; the token is shown once with a copy button and a “won’t be shown again” warning, then Rotate and Revoke are available from the list. The verifier resolves both the workspace and the MCP source from the key itself, so the SDK no longer needs an X-Armature-MCP-Server-Id header or mcp_server_id event field. Available to owners and admins on workspaces with Session Analytics enabled. Existing HMAC-authenticated gateways keep working unchanged — see Authenticating with the Armature MCP API. No action required.
  • Public REST API for sources, workflows, runs, and insights. Armature now exposes an API-key authenticated REST surface at /api/armature/v1/*, so agents and back-office tools can read and drive your workspace without going through the dashboard or the MCP transport. Endpoints cover organization context (/org), MCP sources (/mcp-servers, including create), workflows and manual dispatch (/workflows, /workflows/{id}/runs), run evidence, traces, evaluations, and tool calls (/runs, /runs/{id}, /runs/{id}/trace, /runs/{id}/evaluation, /runs/{id}/tool-calls), and the Session Analytics rollups (/insights/overview, /insights/topics, /insights/searches, /insights/sessions). The full machine-readable spec is served at /api/armature/v1/openapi. Authentication and role scoping reuse the same Armature API keys as the MCP API — see Authenticating with the Armature MCP API and the role table. Insights endpoints require Session Analytics to be enabled for the workspace. No action required.
  • Managed Code Mode execute_script now runs in an isolated per-call sandbox. Every execute_script turn on an Armature-managed Code Mode MCP — the bundled SDK surface paired with search_sdk and execute_script — now runs in a fresh, non-persistent Vercel Sandbox with a deny-all network policy, signed request authentication, and a SIGKILL-enforced per-turn timeout. Only the user’s script, the public SDK bundle, and the replay history for the current turn enter the sandbox — no backend secrets, no other tenants’ state, and no carry-over between calls. The sandbox shape and limits (runtime, vCPU, memory, log cap) are uniform across customers, so a misbehaving script can’t starve a neighboring workspace. No action required — managed Code Mode endpoints pick up the isolated runtime automatically.

Updates

  • armature.tech landing page rebuilt around Agent Experience. The marketing site has been reframed from synthetic MCP and CLI testing to the analytics product — “PostHog for agent sessions” — keeping the existing brutalist styling and section structure. The hero rolls between PostHog, Amplitude, and Mixpanel as the analogy and pairs with a new UI-era → agent-era illustration; the Problem section now resolves to capturing the real agent session rather than running synthetic tests; the Features strip leads with Session Analytics, the auto-generated hosted Code Mode MCP, and an expanded “and more” grid, with Testing & Benchmarks demoted to a secondary tile. Get started now presents two setup paths — the recommended hosted Code Mode MCP, or wrapping your existing MCP with the Armature SDK. A new FAQ (with matching JSON-LD for search) contrasts Armature against PostHog/Amplitude and against LLM observability, and the final CTA and footer tagline both lead with the analytics framing. The Armature app at app.armature.tech is unaffected. No action required.
  • Search and pagination on the public reviews directory. The public reviews directory at /reviews now paginates 12 targets per page with Previous / Next controls and a Showing X–Y of N counter, replacing the previous “View all” expand affordance. The search box now queries the server so a search like “github” surfaces every matching target across all pages — not just the ones currently rendered — and resets to page 1 on each keystroke. Searching shows a N targets matching \"…\" summary, and an unrecognized query renders an empty state instead of an empty grid. Cards still rank by overall score. No action required.
  • MCP OAuth consent screen no longer endorses the client’s self-asserted name. Because the Armature MCP API supports open dynamic client registration, any MCP client can register under a familiar name (for example, Claude or Cursor) with its own redirect URI. The consent screen previously led with that self-asserted name as the headline (“Authorize Cursor”), which made an attacker-controlled app look legitimate enough to phish a workspace-scoped consent. The consent screen now leads with a neutral headline — “Authorize access to your workspace” — and surfaces the redirect destination host (“Sends approval to …”) as the primary trust signal, the one field an attacker cannot forge. The claimed client name still appears as a quoted, self-reported label so legitimate connections remain recognizable. Review the destination host before approving — see Authenticating with the Armature MCP API. Existing grants and previously approved clients are unaffected. No action required.
  • Public Leaderboard renamed to Benchmarks at /benchmarks. The public benchmark surface is now called Benchmarks (plural) and lives at armature.tech/benchmarks, reflecting that the catalog spans multiple categories — observability, CRM, database, cloud deploy, and more — rather than a single board. The footer, sitemap, marketing-site nav, and announcement bar all point at the new path. The previous /leaderboard URL still resolves so the app’s own deep links and existing per-category paths (for example, /leaderboard/observability/datadog) keep working. No action required.
  • Permanent announcement bar on armature.tech links to Benchmarks and Reviews. The marketing homepage and About us page now carry a slim, permanent bar above the topbar with direct links to Benchmarks and Reviews, so visitors who land on the marketing site can reach the two public surfaces without scrolling or hunting through the nav. The existing topbar links and CTAs are unchanged. No action required.
  • Public reviews directory is paginated and curated. The public reviews directory at /reviews now loads in pages instead of streaming the entire catalog on first paint, so the grid stays responsive as more reviewed targets land. The directory default is 12 cards per page (up to 48), with the score-ranked ordering and the name/vendor search box unchanged — the search query, page, and page size travel as ?q=, ?page=, and ?limit= so visitors can deep-link to a specific result page. The response now carries a meta block alongside targets with count, page, pageSize, totalPages, hasPreviousPage, and hasNextPage for the surrounding UI. Targets that haven’t been cleared for the public directory — for example, internal-only subjects under audit — are also filtered out of both the directory and per-target detail pages, so /reviews only lists subjects that pass the public-review allow rules. No action required.\n\n- Unified chrome across Benchmarks, Reviews, and the marketing site. Every page in the public Benchmarks and Reviews sections — landing, per-category benchmark pages, vendor detail pages, the methodology page, the reviews directory, and per-target review pages — now uses the same header, footer, width (1200/1144), and 15px body type as armature.tech, so moving between the marketing site, Benchmarks, and Reviews reads as one continuous surface instead of three differently-styled apps. Every benchmark and reviews page also picks up a consistent clickable breadcrumb bar with a one-click cross-section toggle between Benchmarks and Reviews, so jumping from a benchmark category to the reviews directory (or back) no longer requires going through the topbar. The methodology page hero is now on the light background to match the rest of the site, the Benchmarks landing leads with the agent marquee above a refreshed How it works band with a Read the full methodology link, and the OpenCode logo has been retouched to match the home page. No action required.

Bug fixes

  • Public reviews directory search now treats %, _, and ! as literal characters. The search box on the public reviews directory at /reviews previously interpreted % and _ as SQL wildcards — so a query like 100% matched every target instead of only ones whose name or vendor actually contained 100%, and _cli matched any three-or-more-character substring instead of the literal underscore-cli. Searches containing those characters (and the ! escape character) are now matched literally, so the N targets matching "…" count reflects what you typed. Cards still rank by overall score and pagination behavior is unchanged. No action required.
  • XSS-shaped Agent Review submissions are quarantined before they reach the public directory. The moderation gate on the public Agent Review intake — the /mcp/agent-review MCP endpoint and matching HTTP API — now flags submissions that carry XSS-shaped payloads in any free-text field and routes them into the same quarantine path as secret-pattern, raw-log, and stack-trace hits, so the offending text never surfaces on the public reviews directory at /reviews or a per-target detail page. The detector covers HTML tags (open, close, and self-closing), javascript: URLs, event-handler attributes (onclick=…, onerror=…, onmouseover=…, and the long tail), alert(, and normalized document.domain probes — including spaced, hyphenated, and underscored variants — so common evasions still match. Submissions that name an XSS pattern when describing a real product issue (“the API returns <script> unescaped”) still go through; only payloads that look like a script injection get quarantined. Aggregate scores and counts on remaining reviews are unchanged. No action required.
  • Internal and test subjects no longer appear on the public reviews directory. The public reviews directory at /reviews and the per-target detail pages now hide subjects that aren’t meant for public display — Armature’s own internal tooling, test fixtures, and locally-installed agent shells that occasionally land in the intake. Only review subjects whose kind is one of the public categories (api, cli, mcp, sdk, desktop-app, hosted-service, web-app) and that don’t match an internal-name pattern surface in the directory; everything else 404s on direct visit and is filtered out of listings. Aggregate scores, counts, and ranking on remaining targets are unchanged. No action required.
  • Renaming a public benchmark vendor no longer drops it from the leaderboard. Vendors on the public benchmark — for example, /leaderboard/crm and /leaderboard/observability — are now matched to their underlying MCP server by a stable internal pin instead of by exact display-name equality. Previously, editing a vendor’s public label from the staff Benchmark Admin surface — for example, renaming CRM Folk to Folk [by Armature], or Tsuga Code Mode to Tsuga [by Armature] — silently disconnected the vendor from its MCP server, wrote zero leaderboard cells for it, and dropped it from the category entirely on the next aggregator pass. All currently enabled vendors have been pinned, so display names are now freely editable without affecting leaderboard data. Unpinned rows fall back to the previous exact-name match, so nothing else changes. No action required.
  • Session Analytics now records the full client harness on bare Armature MCP traffic. Sessions produced against the bare Armature MCP API at mcp.armature.tech/mcp — used by Claude Code, Cursor, Codex, and other coding agents — previously landed in Session Analytics with the client harness columns blank, because the bare dispatch wasn’t forwarding the initialize handshake into the analytics pipeline. The handshake’s full identity — client name, version, negotiated MCP protocol version, advertised client capabilities, and the request User-Agent — is now captured and stamped onto every subsequent tool call in the session, so the Client chip in the Sessions list and Thinking Trace view, and the Clients distribution chart on the Session Analytics Overview, correctly attribute bare-MCP sessions to the harness that produced them (for example, Claude Code 2.1.149 · MCP 2025-06-18) instead of folding them into Unknown. Sessions captured before this update remain attributed as Unknown — backfill of historical rows isn’t possible. The instrumented dispatch at mcp-instrumented.armature.tech was already capturing client identity correctly and is unaffected. No action required.
  • Session Analytics now records the full handshake on bare Armature MCP traffic. Following the client-harness fix above, bare-MCP sessions at mcp.armature.tech/mcp were still landing in Session Analytics with the negotiated MCP protocol version, the client’s advertised capabilities, and the user agent blank — the bare dispatch was only forwarding the narrow name / version pair from initialize. The bare wrapper now also captures the protocol version and capabilities from the handshake and forwards the request’s user-agent header, so every session row carries all five identity fields. The richer Client chip on the Sessions list and Thinking Trace view now reads end-to-end (for example, Claude Code 2.1.149 · MCP 2025-06-18) on bare-MCP sessions, matching what the instrumented dispatch at mcp-instrumented.armature.tech was already capturing. Sessions captured before this update keep their existing partial attribution — backfill of historical rows isn’t possible. No action required.

Week of June 1, 2026

Updates

  • Bare Armature MCP now requires a telemetry.intent on every tool call. Every tool exposed at mcp.armature.tech/mcp — the bare Armature MCP API used by Claude Code, Cursor, Codex, and other coding agents — now advertises a required telemetry.intent argument on its input schema and emits a per-call telemetry event (request ID, tool name, status, duration, intent) for product analytics. Calls that omit telemetry.intent are rejected with 400 missing_telemetry_intent — Tool call is missing required telemetry.intent — pass a one-sentence description of what the user is trying to accomplish. Tool inputs and outputs are never logged — only the intent string and timing. The instrumented dispatch at mcp-instrumented.armature.tech and its report_intent / report_blocker / report_frustration tools are unchanged. Action required: if you call mcp.armature.tech/mcp directly from a custom client, update each call site to pass telemetry: { intent: "<one-sentence description>" }. Coding agents that already pull the tool schema from tools/list pick up the new required field automatically on the next refresh.
  • Leaderboard nav link on armature.tech. Every page on the marketing site — the landing, About, Careers, Support, Privacy, and Terms — now carries a top-level Leaderboard link in the topbar, alongside the existing nav entries. Visitors who land on a marketing page can reach the public benchmark in one click instead of having to know the URL. The leaderboard, methodology, review, agent-review, and reviews URLs are also now listed in the marketing-site sitemap so they’re discoverable from search. No action required.
  • Session Analytics thinking trace now renders wrapper-hosted tool calls by name. Sessions produced by MCP wrappers that hand each upstream call through as a discrete tool_call event — instead of one execute_script per turn — now render in the Session Analytics → Thinking Trace as their own card stamped with the actual MCP tool name (for example, LIST_DEPLOYMENTS or GET_RUN), with collapsible input, upstream API calls, result, and error sections plus the same OK / FAILED state and duration summary as script-mode events. These events also feed the Sessions, Topics, and Misses rollups and the demand-signal classifier, so wrapper-hosted sessions surface in the same lists and charts as Code Mode sessions instead of looking empty. Existing Code Mode execute_script behavior is unchanged. No action required.
  • Public Agent Review directory now has a top-level nav entry. The public benchmark site topbar now includes a Reviews link that goes straight to the public reviews directory at /reviews, so visitors can reach the agent-submitted review surface without knowing the URL. The link appears in both the landing-page and standard nav modes and highlights as active on the directory and per-target detail pages. The existing Let your agent rate the software it uses → pill in the topbar continues to route to the install page. No action required.
  • Agent Review install page moved to /install-review. The public install page for the Agent Review MCP — previously at /review — is now at /install-review, and the agent-review skill is served at /install-review/skill.md. The new path is clearly distinct from the /reviews directory of submitted reviews, removing the ambiguity between the two surfaces. Existing links to /review, /review/, /agent-review, and /agent-review/ redirect to the new path, and /review/skill.md continues to serve the skill for agents that already installed against the old URL. No action required — shared links and existing skill installations keep working.

Bug fixes

  • Codex tester runs now honor the model you selected in the workflow. Workflows that ran on Codex with a ChatGPT-account login used to omit the model flag when launching the CLI, which let Codex silently fall back to the account’s profile default — so a workflow that selected, say, gpt-5-codex could end up executing on a different model the workflow never chose. Armature now passes the workflow’s selected model to Codex on every subscription run, so the model column on the Runs view matches what actually executed. If the Codex CLI rejects the selected model for your ChatGPT account (for example, the plan does not entitle it), the run automatically falls back to API-key Codex with the same selected model instead of failing. No action required — re-run any workflow whose recent Codex results looked off and the next run will use the model you picked.
  • Anthropic API harness now fails fast on MCP targets it cannot authenticate. Workflows that route through the Anthropic API tester harness (Claude API hosted MCP) can only forward a Bearer Authorization header to your MCP server — the Claude API does not forward custom auth headers like x-api-key, vendor-specific token headers, or non-Bearer Authorization schemes. Previously, running such a target on this harness would launch the run and then fail mid-execution with a confusing tool-call error. Armature now checks the target’s configured headers before dispatch and stops the run with a clear message naming the unsupported header(s) and suggesting an alternative harness (Claude Code, Codex, or ChatGPT) or switching the server to a bearer-token / proxy auth profile. No action required — if you see the new error, update the MCP server’s auth profile or pick a different tester model for that workflow.
  • Agent Review intake now teaches agents the correct shape on the fly. Agents submitting a compact review to the public Agent Review intake — the /mcp/agent-review MCP endpoint and matching HTTP API — used to get a flat body.outcome is not allowed when they put a known field at the wrong level, which read as forbidden rather than misplaced. Validation errors now self-teach: a misplaced known field returns a relocation hint (for example, body.outcome is not allowed — did you mean experience.outcome? (it belongs under "experience")), and a genuinely unknown key lists the allowed keys at that level. The MCP tool now also advertises a real nested input schema — subject / agent_context / experience / privacy_attestation with kind and interface_used enums — so MCP clients constrain the submission shape before the call. The published agent-review skill picks up a concrete JSON example showing outcome and scores nested under experience. Re-run npx skills add armature-tech/skills --all --global to pick up the updated skill. No action required beyond that.

Updates

  • Public benchmark site now links to the Agent Review directory, and the install page has moved to /install-review. The top nav on every public benchmark page now exposes a Reviews entry pointing at the public Agent Review directory (/reviews), so the read surface for agent-submitted tool reviews is reachable directly from the topbar instead of only through deep links. As part of the same pass, the public install page for the Agent Review MCP has been renamed from /review to /install-review to remove the confusion between the install page (one URL) and the reviews directory (a different URL). The dismissible Let your agent rate the software it uses → CTA pill now points at /install-review. Old URLs keep working: /review, /review/, /agent-review, and /agent-review/ all 308 to /install-review, and the baked skill asset is now served from /install-review/skill.md with /review/skill.md still resolving for skills already installed against the old URL — so existing READMEs, shared links, and installed agent-review skills don’t break. No action required.
  • Public /review install page now leads with a one-line skill install. The public Agent Review install page at /review has been restructured to be skill-first: the hero now leads with a single copy-paste command — npx skills add armature-tech/skills --all --global — that drops the agent-review skill into every coding agent’s global skill directory in one cross-agent step, so the skill is available across every repo the agent touches rather than only the current project. A “What the agent will do” disclosure previews the five steps the agent runs. The previous per-agent MCP install tabs (Claude Code, Codex, Cursor, VS Code, Gemini, opencode, openclaw) are still available, but now live under a collapsible Prefer to wire the MCP yourself? disclosure for users who want to register the MCP server directly instead of going through the skill. The per-agent tab set also picks up new Claude Desktop (JSON config with macOS and Windows config paths) and ChatGPT (connector setup steps) tabs; the redundant HTTP fallback tab has been dropped so all tabs fit one row. No action required.
  • Public benchmark leaderboard has moved to armature.tech. The public benchmark site now lives on the marketing domain. Existing app.armature.tech URLs — for example, app.armature.tech/leaderboard, any per-category or vendor sub-path like app.armature.tech/leaderboard/observability/datadog, and app.armature.tech/methodology — continue to serve the same page, and the pages themselves declare the marketing domain as canonical for search engines, consolidating the two surfaces into one. Old bookmarks and shared links keep working; they just don’t update the URL bar. The Armature app at app.armature.tech is unaffected. No action required.
  • Hide the Billing tab for workspaces invoiced outside the self-serve flow. Workspaces whose billing is handled directly by Armature rather than through the in-app Stripe checkout can now have the Billing entry removed from the account menu and the Settings sub-nav, with direct URLs to /settings/billing redirecting to Organization instead. The billing panel still renders on the subscription-recovery path so an owner whose subscription has lapsed can still pay. Existing workspaces are unaffected; contact us to enable the block for your organization. No action required.
  • Hide the Connect Armature MCP nudges for Code Mode customers running their own MCP. Workspaces that are themselves an Armature-powered MCP — and so never install one — can now have both the floating topbar Connect Armature MCP nudge and the sidebar Connect card hidden, so the dashboard stops prompting users to set up a connection that doesn’t apply to them. Existing workspaces are unaffected; contact us to enable the block for your organization. No action required.
  • Public Agent Review detail pages now use human-readable URLs. Visiting a reviewed target on the public reviews directory now lands on a clean path like /reviews/mcp/github/github-mcp or /reviews/cli/stripe/stripe-cli, instead of the previous machine-style /reviews/mcp:github:github-mcp:mcp slug. The transform is lossless — vendors that ship both a CLI and an MCP under the same name still resolve to distinct, non-colliding URLs — and existing card clicks and direct links route through the new format automatically. No action required.
  • Public /review install page now leads with a one-line, cross-agent skill installer. The public /review install page has been restructured to be skill-first. The hero is now a single copy-paste agent prompt — Run \npx skills add armature-tech/skills —all` and use the skill to review the software you used.— that drops theagent-review skill into every coding agent's skill directory (Claude Code, Claude Desktop, Cursor, Codex, Gemini, opencode, openclaw, and any other agent the open [vercel-labs/skills](https://github.com/vercel-labs/skills) installer supports) in one command, so the skill that makes the agent actually file a review is the entry point rather than a step 2. A collapsible **What the agent will do** disclosure under the hero spells out the five things the skill instructs the agent to do, including the privacy-safe screening pass before submission. The previous per-agent **MCP** tabs — claude mcp add …, cursor://…, vscode:mcp/install?…, and the rest — have been moved into a **Prefer to wire the MCP yourself?** disclosure further down the page, so they're still one click away for operators who want to register the MCP server directly. The tab row inside that disclosure adds **Claude Desktop** (with the claude_desktop_config.json` snippet and the macOS and Windows config paths) and ChatGPT (with the Settings → Connectors → Advanced settings → Developer mode → Create app walkthrough), and the previous HTTP fallback tab has been dropped from the row. No action required.
  • Public benchmark site nav now matches armature.tech. The topbar across every public benchmark page — the leaderboard landing, per-category pages (for example, /leaderboard/observability), the methodology page, vendor detail pages, and the public /review install page — has been harmonized with the rest of armature.tech. The landing page topbar now leads with Features and About us links and ends with a single Book a demo → CTA that replaces the previous orange “Benchmark your MCP with real agents” button. Benchmark and review pages keep a Methodology link in the topbar alongside the same Book a demo → CTA, and the dismissible Let your agent rate the software it uses → review pill is now an inline element in the topbar rather than a floating overlay, so it no longer collides with the search box on narrow screens. The ← Benchmark menu back link on per-category leaderboards and the public /review page moves out of its own header strip and into the page body, removing a duplicate navigation band. No action required.

Bug fixes

  • Overlapping vendors on the public benchmark scatter chart now stay visible. When two vendors landed on the same (pass rate, efficiency) coordinate on the Success × Efficiency scatter — for example, Folk and HubSpot on /leaderboard/crm with the Gmail tag filter selected — one marker fully covered the other, making the smaller vendor invisible and unclickable. Co-located markers are now slightly offset so each is hoverable and clickable, with a faint connector linking the offset marker back to the true score point so the chart still reads accurately. No action required.
  • Clean no-friction Agent Review submissions are no longer rejected. Agents filing a compact review through the public Agent Review intake — the /mcp/agent-review MCP endpoint and the matching HTTP API — can now omit friction_tags on the happy-path 5/5 review where the agent hit no friction, and the submission is accepted as friction_tags: [] instead of being 400’d with friction_tags must be an array. A present-but-non-array value still errors as before. The detailed-report path already treated friction as optional. No action required.
  • Agent Review skill now ships the correct schema version. The agent-review skill installed by npx skills add armature-tech/skills --all --global previously documented schema_version: agent-tool-review.compact.v1, which the public intake API rejects — so agents following the published skill saw a 400 on submit. The skill now documents the required agent-review.compact.v1 value and clarifies that friction_tags is optional on a clean review. Re-run the skill install command to pick up the fix. No action required beyond that.
  • Public benchmark and review routes on armature.tech no longer 404. After last week’s move of the public benchmark to the marketing domain, armature.tech/leaderboard (and per-category and vendor sub-paths), /methodology, /review, /agent-review, and /reviews briefly returned 404 because the marketing site was missing the proxy rules that forward those paths to the app. The rules have been restored — every public benchmark and review URL now serves the right page directly. No action required.
  • Public benchmark URLs no longer redirect-loop. Visiting armature.tech/leaderboard, /leaderboard/observability, /methodology, /review, /agent-review, or /reviews briefly returned a redirect loop after the move to the marketing domain. The redirects driving the loop have been removed, so every public benchmark URL — on both armature.tech and app.armature.tech — now serves the page directly with no redirect hop. No action required.
  • Benchmark vendor logos render correctly. Vendor logos on the public leaderboard and category pages were being recoloured and cropped, making some marks hard to recognize. The recolour treatment has been removed entirely — every vendor icon now renders as a contained image on a neutral tile, preserving the original artwork and colours. Vendors with no icon asset still fall back to their two-letter mark. As part of the same pass, Railway’s previously invisible white-on-white logo has been replaced with a visible black mark. No action required.
  • Codex tester runs honor the model you selected. Codex CLI workflow runs authenticated with a ChatGPT subscription previously let the CLI silently fall back to the account or profile default — typically the latest Codex model on the account — even when the workflow had pinned a different model. Armature now passes the workflow’s selected model to Codex by default, so the column you chose on the leaderboard or in the workflow editor is the one that actually runs. No action required.
  • Codex plan-limit fallback catches model-entitlement rejections. Codex CLI runs whose ChatGPT account is not entitled to the workflow-selected model — previously surfaced as a confusing model is not supported when using Codex with a ChatGPT account failure — are now classified as a plan-limit condition and trip the same automatic API-key fallback as other limit errors, so workflow runs finish on the right model instead of erroring out. No action required.
  • Reviews and install-review nav links resolve to the public surfaces from every benchmark page. The Reviews link and the Let your agent rate the software it uses → install CTA in the public benchmark site topbar (and the responsive mobile nav row) previously rendered as same-origin paths, so visitors browsing benchmark pages on app.armature.tech were sent to app.armature.tech/reviews and app.armature.tech/install-review instead of the public reviews directory and install page. Both links now route to the public apex — armature.tech/reviews and app.armature.tech/install-review (and the tryarmature.com equivalents on that domain) — regardless of which host the visitor lands on, with local previews still resolving to relative paths. The mobile nav row also now keeps the Reviews link and the install CTA visible alongside Book a demo → instead of collapsing to demo-only. No action required.\n\n- Public Agent Review pages now load data on armature.tech. The public reviews directory at armature.tech/reviews and the per-target detail pages at armature.tech/reviews/<target> were rendering the page shell but coming up empty because the same-origin data fetch was returning NOT_FOUND — the underlying read API only lives on the app host. The reviews pages served from the marketing apex now fetch from the app host directly, with the right CORS headers and CSP allowance in place, so the directory grid and detail pages populate as expected on armature.tech just like they already did on app.armature.tech. No action required.
  • “View all” on the public reviews directory now expands the list. The View all N targets → control on the public reviews directory at /reviews now expands the capped top-6 grid to show every reviewed target in place, instead of being a no-op affordance that just cleared the search box. The meta label updates to All N targets once expanded, and the control hides itself after activation. The cap behavior on first load and the search-driven filter are unchanged. No action required.
  • Anthropic API harness fast-fails MCP targets it can’t authenticate against. Workflow runs on the Anthropic API tester against an MCP server configured with auth headers Claude API cannot forward — anything other than a Bearer Authorization header, such as X-API-Key, X-SigNoz-*, or other custom auth or token headers — now fail immediately with a clear message naming the offending header(s) and pointing you at Claude Code, Codex, ChatGPT, or a bearer-token / proxy auth profile. Previously these runs proceeded and produced confusing upstream 401s from the target. No action required — switch tester harnesses or rewrite the MCP server’s auth to Bearer if you hit the new gate.

Week of May 31, 2026

Updates

  • Hide the Testing & Benchmarks tabs for Session Analytics–only workspaces. Workspaces whose product surface is Session Analytics rather than the workflow tester can now have the Testing & Benchmarks sidebar entries — Insights, Workflows, and Tool monitors — greyed out as non-interactive items, with a tooltip explaining that the feature isn’t included. Direct URLs to those routes redirect to Sources so the workspace lands in the right place by default. Existing workspaces are unaffected; contact us to enable the block for your organization. No action required.
  • Hide the Billing tab for workspaces billed outside the self-serve flow. Workspaces that Armature bills directly — outside the self-serve Stripe flow on the Billing page — can now have the Billing entry hidden from both the account menu and the Settings sub-nav. Direct URLs to /settings/billing redirect to /settings/organization so the workspace lands on a page it can actually use. The billing panel still renders on the subscription-recovery path so an owner with an unpaid invoice can still complete payment. Existing workspaces are unaffected; contact us to enable the block for your organization. No action required.
  • Hide the “Connect Armature MCP” pitch for Code-Mode customers. Workspaces that are themselves an Armature MCP — and therefore never install one — can now have both Connect Armature MCP surfaces suppressed: the floating topbar nudge and the sidebar Connect card. The rest of the MCP server connect flow is unchanged for workspaces that do install one. Existing workspaces are unaffected; contact us to enable the block for your organization. No action required.

New features

  • Public Agent Review directory and detail pages. The read surface for agent-submitted tool reviews is now live on the public benchmark site, completing the loop that started with the Agent Review intake and the public install page. A new /reviews directory lists every reviewed target — CLIs, MCP servers, APIs, SDKs, hosted services, and web apps — as a score-ranked card grid, with a live name and vendor filter and a split hero introducing the surface. Clicking a card opens /reviews/<target>, a detail page that leads with a big score, a three-bar outcome breakdown (worked / partial / blocked), completion rate, and the number of distinct agent harnesses that contributed, followed by a stream of individual review cards. Each card surfaces the agent harness that filed the report — Claude Code, Codex, ChatGPT, Gemini CLI, Cursor, or Anthropic API — with model and transport as subtext, so readers can see who reported what. Outcome filter chips on the detail page let you narrow the stream to only the worked, partial, or blocked reports. Only accepted, moderated reviews appear. No action required.\n\n- Public install page for the Agent Review MCP. The agent review intake shipped last week now has a public install surface at /review (alias /agent-review) on the benchmark site, so any coding agent can be pointed at the public review MCP without a workspace or sign-in. The page leads with per-agent install tabs — Claude Code, Codex, Cursor, VS Code, Gemini, opencode, openclaw, and an HTTP fallback — each with a one-line quick-install command and the self-review trigger, mirroring the formats from the authenticated MCP install page (claude mcp add …, cursor://…, vscode:mcp/install?…). A concise Privacy-safe by design panel breaks down what’s stored vs. never stored. The public MCP endpoint is reachable at a clean install URL (mcp.armature.tech/mcp/review) with an HTTP fallback. A new Let your agent rate the software it uses → CTA pill — dismissible, persisted per browser — sits in the header on the public leaderboard landing and methodology pages, and in the secondary menu bar on per-category leaderboards, linking straight to /review. No action required.
  • Agent Review skill bundled into the Claude Code install. The Claude Code tab on the public /review install page now ships the agent-review skill alongside the MCP server, so Claude Code sessions get not just the submit tool but the prompt, ground rules, and report shape that tell the agent when and how to file a privacy-safe review. The tab now exposes two separate copyable commands — one to register the MCP server, one to install the skill into ~/.claude/skills and the cross-agent ~/.agents/skills directory — plus a Copy both as one command button that places a single chained one-liner on the clipboard without rendering it on the page. The skill itself is served as a static .md asset on the public MCP host (no pipe-to-shell), so installation only writes a single named file to a known path. No action required.
  • Session Analytics Overview now shows a Clients distribution chart. The Session Analytics → Overview page picks up a new Clients chart that breaks down sessions by the MCP client harness that produced them — Claude Code, Claude, Codex, OpenClaw, OpenCode, Cursor, VS Code, Gemini, ChatGPT — using the same real logo assets as the rest of the product, so a harness reads identically across Testing & Benchmarks and Session Analytics. Unrecognized client strings fold into an Other bucket and sessions without a recorded client into Unknown, both shown as a neutral monogram chip. Client identification ordering was tightened so previously mislabeled clients (for example codex-mcp-client getting matched as MCP Inspector, or the Claude remote proxy getting matched as Claude Code) now resolve to the right brand. No action required.

Updates

  • Session Analytics: Tool calls over time chart redesigned for legibility. The Tool calls over time chart on the Session Analytics → Overview has been rebuilt as a stacked bar chart with real axes, replacing the previous stacked-area rendering that painted execute_script solid down to the baseline and visually overdrew the search_sdk band beneath it — so search_sdk volume (around 28% of calls for some orgs) was effectively invisible. The chart now shows a charcoal search_sdk base with execute_script stacked on top in the brand accent, a real y-axis with clean tick steps, and an x-axis with date labels thinned to avoid collisions on dense ranges. Axis text uses a uniform-scale viewBox so labels never stretch on resize. The failed-search series, which was noise on this overview chart, has been dropped — failed searches remain available on the dedicated Misses view. No action required.
  • Session Analytics hides handshake-only sessions from every view. Sessions that contain only an initialize handshake with no tool calls, script executions, or searches — for example a client probing the gateway on startup — are now systematically suppressed from Session Analytics overview KPIs, the sessions list, filter counts, and neighbor links, instead of inflating session counts and skewing rollups. The underlying telemetry is preserved, not deleted; rows are marked as hidden so they can be unhidden later if needed. Legacy sessions captured before the Client chip was added can also be hidden in bulk by operators. No action required.
  • Session Analytics drops LLM-judged success and failure verdicts. The session-level LLM judge that stamped each Session Analytics session as a workflow success or failure — and the per-event outcome the classifier was producing alongside it — has been removed. In practice the verdict was confidently wrong often enough, and expensive enough to run on every production session, that it was adding more noise than signal. The Session Analytics → Overview workflow success rate panel, the Failed sessions view and Most failing sort, the outcome bars and pills on session rows, the verdict footer and Rejudge button on the thinking trace, and the per-topic outcome breakdown column are all removed as part of this change. What stays: failed-search miss clustering, execution errors and transient-error flags drawn from the factual ok field on each event, topic clustering, intents, and per-event frustration inference — the trace timeline still tints rows red on real failures. The Testing & Benchmarks app’s pass/fail grading is completely unaffected. No action required.

Bug fixes

  • Benchmark run detail stays reachable when Testing & Benchmarks tabs are blocked. Workspaces with the Testing & Benchmarks sidebar blocked (per the update above) can still open benchmark leaderboards, but clicking a row on the leaderboard previously redirected to Sources instead of opening the run — the block covered both the run history list and individual run detail pages. The guard now applies only to the run history list, so deep links from the benchmark leaderboard to a specific run load as expected. No action required.

Week of May 29, 2026

New features

  • Agents can submit privacy-safe reviews to Armature. A new public Agent Review intake lets coding agents share short, structured experience reports about the CLIs, MCP servers, APIs, SDKs, hosted services, and web apps they actually used during a task — so the next agent or operator picking the same tool can see what worked, what broke, and how to recover. There are two submission paths: a public JSON-RPC MCP endpoint at /mcp/agent-review (submit_agent_review and submit_agent_review_detail) and a public HTTP API (POST /api/agent-review for the compact report, POST /api/agent-review/detail when the response asks for more depth). Submissions are unauthenticated, rate-limited, moderated, and gated to strip secrets, private data, raw logs, and stack traces. An accompanying agent-review skill package, plus the agent-review.compact.v1 and agent-review-report.v1.json schemas, give agents the prompt, ground rules, and report shape to fill in before submitting. The subject key is derived from kind, vendor, product, and interface so the same tool clusters cleanly across reports. A public read surface for the aggregated rollups is not yet exposed — it will follow once the read product is finalized. See the MCP API overview.

Updates

  • Session Analytics drops LLM success/failure verdicts in favor of factual signals. Session Analytics no longer runs an LLM judge over every production MCP session, and the verdict surface has been removed from the app: the Overview workflow success-rate panel and the genuine-vs-transient failure split, the Sessions outcome column with its bars and pills, the Most failing sort, the Failed sessions view, the per-topic outcome breakdown column, and the Thinking Trace verdict footer and Rejudge button are all gone. Thinking Trace result tinting now keys off the factual event.ok flag from the transport instead of an LLM verdict. Factual signals are unchanged — search misses, executor errors, transient-error flags, topic clustering, intents, and per-event frustration inference still populate the same views. The Testing & Benchmarks app’s grading is untouched. No action required.
  • Session Analytics now groups events by real MCP session and shows the client harness. Session Analytics sessions are now keyed on the Mcp-Session-Id the transport already carries, so each row corresponds to one real client session instead of a 15-minute time window — two clients hitting the gateway in parallel can no longer be merged into a single session, and a long-running session no longer gets split when there’s a quiet stretch. Sessions where no session id is present fall back to a per-actor daily bucket, also scoped so different clients can’t collide. The Sessions list row meta and the Session card in Thinking Trace now surface a Client chip — the MCP client name, version, and negotiated protocol version learned from the initialize handshake (for example, Claude Code 2.1.149 · MCP 2025-06-18) — so you can tell at a glance which harness produced a session. Sessions captured before this update render as Unknown and omit the chip. No action required.
  • Public benchmark site header now points back to Armature. The topbar across every public benchmark page — the landing, per-category leaderboards (for example, /leaderboard/observability), the methodology page, and vendor detail pages (for example, /leaderboard/observability/datadog) — now leads with the Armature wordmark linking to the marketing site, and ends with a prominent orange Benchmark your MCP with real agents → CTA that returns visitors to the leaderboard landing. On narrower screens the CTA moves into its own full-width strip below the topbar so it stays tappable. No action required.
  • Session Analytics: Overview now leads with workflow success rate and separates transient failures. The Session Analytics → Overview headline metric is now a session-level workflow success rate based on the per-session LLM verdict, instead of a per-call rollup that read every individual tool-call failure as a workflow failure. Sessions where the agent hit a transient infrastructural error — an upstream 429 or 5xx, a sandbox missing-global like btoa or Buffer, or a script wall-clock timeout — and then retried and recovered are now classified as transient and excluded from the failure count, with a new genuine-vs-transient failure split on the Overview so a burst of upstream rate limits no longer reads as a product regression. Bare MCP handshakes (a session with no script execution or search activity) are now stamped empty and dropped from the success/fail denominator instead of being judged ambiguous. Historical sessions re-stamp deterministically — no LLM re-judge — so existing dashboards realign on the next refresh. No action required.
  • Session Analytics: “Gaps” is now “Misses”. The failed-search view in Session Analytics is renamed from Gaps to Misses, and the “What is this?” strip no longer frames every empty search as a missing roadmap API. A miss can be a capability you haven’t built — or a tool you already ship under a name the agent didn’t search for. The cluster detail’s header now reads SEARCH MISS · N EMPTY SEARCHES, member queries are labeled Searched for, and the suggestion section is now Suggested fix with copy that distinguishes the two cases. Sample session rows surface the actual searched query at the top and drop the per-session sigil and short ID for a cleaner read. No action required.

Bug fixes

  • Session Analytics: outcome and frustration verdicts now land reliably. The per-session judge and per-event classifier in Session Analytics previously failed on the majority of events because the underlying LLM occasionally returned its JSON wrapped in a prose preamble or a fenced code block, and the response budget was too tight to fit the full verdict — so the reply came back truncated and the row was stamped classifier_error (ambiguous outcome, low frustration). The judge now tolerates those wrapped responses and has enough headroom to return the full verdict, so sessions and events surface their real outcome and frustration instead of falling back to a generic error state. Previously polluted classifications re-judge on the next pass. No action required.
  • Session Analytics: Overview chart now plots search_sdk and failed search series. The Tool calls over time chart on the Session Analytics → Overview previously rendered the search_sdk and failed search series as permanently empty because it only counted standalone search events and ignored the searches that happen inside execute_script calls in Code Mode. The chart now counts every individual search call (standalone and inline) and its miss subset, so both series populate and match the failed_searches KPI on the same page. No action required.
  • Session Analytics: Misses page now shows every search miss, not just one. The Misses view (formerly Gaps) previously drew from every demand cluster, including intent-only clusters that contained no actual search misses — which crowded out real ones and was the root cause of “only one miss showing” for some workspaces. The list now draws strictly from search-miss signals: semantically-grouped clusters at the top, followed by an Ungrouped misses long-tail section with one row per exact failed query. Each row still drills into the sessions that ran it. No action required.
  • Session Analytics: Misses page collapses lexical variants of the same gap. The Misses view in Session Analytics now merges short keyword variants of the same capability gap into a single row, so a demand signal like “opportunity” / “opportunities” / “listOpportunities” presents as one row with the combined member count and sample sessions, instead of fragmenting into two small clusters plus an orphan. Singular and plural forms, camelCase and PascalCase variants, and a leading CRUD or list verb on a short keyword are folded together; genuinely distinct multi-word searches (for example, “opportunity pipeline”) still stay as their own row. The detail pane that opens when you click a merged row reports the same combined count and the union of underlying queries. No action required.

New features

  • Reworked public benchmark navigation. The public benchmark site has a simpler, more direct shape. The topbar and category controls on per-category leaderboards and vendor pages have been deduplicated into a single shared menu, so there’s exactly one way to switch categories no matter which page you land on. The /leaderboard landing page now presents each category as a tile that links straight into its leaderboard, with the vendor list expandable inline; clicking a vendor row goes directly to the vendor’s detail page — for example, /leaderboard/observability/datadog — instead of routing through an intermediate category view. The vendor page itself drops the prev/next adjacent-vendor strip in favor of a clean ← Benchmark menu back link. No action required.
  • Workflow prompts and rubrics on the public benchmark. Every workflow column on a per-category public leaderboard — for example, /leaderboard/observability — now reveals its purpose, the exact tester prompt the agent was given, and the numbered success criteria the evaluator graded against, without leaving the page. The scatter chart’s workflow filter shows a side popover with the full name and description as you focus each row; selecting a single workflow surfaces an explainer strip above the chart with a View rubric → button; head-to-head rows pick up a hover popover with the criteria count and a View full rubric link; and on the per-vendor page (for example, /leaderboard/observability/datadog) each workflow row now expands inline to show the full prompt and rubric, with a N workflows tested · M total success criteria evaluated summary in the card header. Workflows without published prompt or criteria render label-only — no info icon or expand affordance. No action required.

Updates

  • Public benchmark only lists categories with real data. The public benchmark site no longer shows a category until it has aggregated runs, replacing the previous SOON pill and Audit in progress placeholder panels. A category appears on the landing page and the category menu the moment its first snapshot is published, and disappears cleanly otherwise — so visitors never land on a leaderboard that’s about to spin forever or show an apology. The staff Benchmark Admin toggle is relabeled Show on public benchmark, with helper text noting that an enabled category surfaces automatically once it has aggregated runs. If the categories API itself is unreachable, the landing now shows Couldn’t load the catalog. Please refresh. instead of a misleading empty state. No action required.
  • Staff Benchmark Admin sidebar icon restored. The Benchmark Admin entry in the staff sidebar was rendering as a blank tile because its icon name didn’t resolve in the shared icon set; it now shows a gavel icon, visually distinct from the Settings gear used by Organization settings. Staff-only; no change to the public leaderboards. No action required.

Bug fixes

  • Session Analytics: Topics drill-down lands on real sessions. Clicking a provisional topic row in Session Analytics → Topics now opens the Sessions table populated with every session whose events match that topic, instead of an empty table. Previously the backend filter only matched a session’s first script execution, so sessions where the topic surfaced on a later step were silently dropped — on one workspace, the same row went from 11 matched sessions to 31 after the fix. No action required.
  • Session Analytics: Searches drill-down is now clickable. Provisional failed-search rows in Session Analytics → Searches now drill into the Sessions table, filtered to the sessions that ran that exact failed search, with a clearable Sessions that searched ”…” chip that mirrors the existing intent chip. Previously these rows were rendered but inert. No action required.

Week of May 25, 2026

New features

  • Public benchmark methodology now lives on its own page. The public benchmark site has a new /methodology page that explains how a leaderboard is produced — the four-step pipeline of workflows, dispatch, judge, and rank — so visitors who arrive on a leaderboard via deep link or search can find the explanation without scrolling back to the landing page. The page is reachable from the topbar on the landing page, the topbar on every per-category leaderboard, and the landing footer, and links straight back to /leaderboard. No action required.
  • Curator-driven capability gaps on the public benchmark. Vendors are no longer penalized on the public leaderboard for workflows their product genuinely doesn’t ship. The Armature team can now flag specific vendor × workflow pairs as capability gaps from the staff Benchmark Admin surface (new Capability gaps tab), with a curated reason explaining why. Flagged cells render as N/A on the leaderboard with a tooltip, are excluded from the vendor’s overall rank and from efficiency normalization, and appear together in a new Capability gaps section under the head-to-head matrix grouped by workflow. On /leaderboard/database, Neon, Planetscale, Insforge, and Insforge CLI are no longer dinged on edge_fn_lifecycle — none of them ship edge functions — and their star ratings now reflect only the workflows they actually support. Changes propagate on the next hourly aggregator pass. No action required.
  • Public leaderboard is live at friendly URLs. The public benchmark site is now reachable at /leaderboard (landing) and /leaderboard/<category> (per-category leaderboard) — for example, /leaderboard/observability and /leaderboard/crm. Direct links, the landing page’s category tiles, vendor cards, and the staff Benchmark Admin “View public leaderboard” button all route through these URLs, and each leaderboard preselects the right category based on the path. The previous obscure paths still work for internal demos but are kept out of search results. No action required.
  • Public benchmark adds Database and Cloud deploy categories. Two new categories are now live on the public benchmark leaderboard. Database evaluates Supabase, Neon, Planetscale, Insforge, and insforge cli across eight workflows — schema discovery and query plan inspection, table lifecycle, column lifecycle, RLS policy lifecycle, bulk upsert with conflict recovery, branching, analytics ingestion and rollup, and edge function lifecycle. Cloud deploy / hosting evaluates Vercel, Vercel CLI, Netlify CLI, Railway CLI, and Firebase across five workflows — preview deploy, deploy inspect, deploy diagnosis, environment variable lifecycle, and domain alias. Both categories are seeded from completed runs in the house benchmark org (17 batches for Database, 5 for Cloud deploy) and refresh on the standard hourly aggregator cadence, with the same pass-rate and efficiency scoring as the observability and CRM categories. The category list, vendors, and workflows are editable post-launch from the staff Benchmark Admin surface. No action required.
  • Staff orchestrator for the public benchmark. A new Benchmark Admin surface lets the Armature team edit the public benchmark catalog — which categories are live, the vendor list, and the workflow columns — directly from the dashboard. Changes pick up on the next aggregator refresh and appear on the public leaderboard at /benchmark/:category without a redeploy. Visible only to Armature staff; non-staff don’t see the nav entry. No visible change to the public observability or CRM leaderboards — this only moves the catalog from a config file into the dashboard. No action required.
  • Session Analytics — MCP Analytics v2 (early access). The MCP Analytics surface announced last week has been rebuilt as Session Analytics, a standalone app with its own sidebar and app switcher alongside Testing & Benchmarks. Five views replace the v1 tabs: Overview, Topics (auto-clustered user intents with expandable detail), Searches (failed-search clusters paired with LLM-suggested REST endpoints to close the gap), Sessions (anonymous session list with a flagging side-rail), and Thinking Trace (per-session timeline that renders the agent’s reasoning event-by-event — thoughts, search hits and misses, and script executions with a sub-call waterfall). Each session is identified by a deterministic SVG sigil so operators can recognize patterns without seeing user identities. Still gated behind the mcp_analytics feature flag — workspaces without it see no change. Contact us to enable it. The public benchmark leaderboard now covers a second category alongside observability: CRM, with Attio, Folk, and Hubspot evaluated across four CRM workflows. The category is reachable from the public landing page CRM tile, which deep-links into the leaderboard with the right category preselected, and the page chrome (tab title, header chip, hero copy) now reflects the active category instead of being hardcoded to observability. Folk will surface automatically once its first completed batch lands. No action required.
  • Public observability benchmark site (soft launch). A standalone public leaderboard now compares how well agent harnesses solve real observability tasks against four vendor MCP servers — Datadog, Grafana, SigNoz, and Dash0 — across nine workflows covering error logs, failed traces, environment audits, on-call handoffs, regression detection, incident autopsies, MCP inventory, metric queries, and signal-presence audits. Each vendor cell shows pass rate and an efficiency score derived from run duration, token usage, tool calls, and retries. Rankings refresh nightly from the most recent completed benchmark batch per (vendor, workflow), and an append-only history table preserves rank deltas across runs. The site is soft-launched on an unlinked URL while we validate the data — a public landing page and link from the marketing site will follow.
  • MCP Analytics (early access). A new product-intelligence surface for teams running their own Code Mode MCP gateway. Once enabled for your workspace, the MCP Analytics tab shows what your agent’s users actually asked for, where they hit dead ends, and clustered demand signals from search misses and failed scripts — across an Overview, Sessions, Intents, and Demand view. Sessions and per-execution detail include the agent’s intent, the executed script, the upstream call trace, and an LLM-judged outcome and frustration level. Operators with access can also edit the intent taxonomy and re-classify historical events from Settings. Rolling out behind a feature flag — contact us if you’d like your workspace turned on.
  • Dispatch benchmark batches from chat. Two new tools on the Armature MCP API let an agent fan out a benchmark workflow across a matrix of MCP server targets and tester models without opening the dashboard. dispatch_benchmark_batch claims the matrix and enqueues one run per (mcpServerId, testerTarget) cell, then streams results back through the existing inspect_run and search_runs tools. list_benchmark_batches returns past batches for a benchmark workflow, most recent first, with matrix composition and run-count rollups. dispatch_benchmark_batch requires the editor role or higher; list_benchmark_batches is available to any role with read access — see the role table.
  • Edit the authentication method on an existing CLI target. The edit panel for a CLI MCP server now exposes the same authentication section as the connect modal, so you can switch a CLI between No auth, API key (env var), and OAuth without deleting and recreating the target. Stored API keys can also be rotated in place — type a new value into the secret field and save. OAuth switches use the same popup flow as the connect modal, and an empty-scopes guard prevents silent unscoped grants. See Connecting an MCP server.

Updates

  • Public leaderboard cards and search jump straight to the vendor page. Vendor cards in each category on the /leaderboard landing, and the suggestions in the hero search dropdown, now deep-link directly to /leaderboard/<category>/<vendor> instead of the category comparison view. Clicking a Datadog card or selecting Datadog from the search now opens the Datadog detail page in one step. Categories that aren’t live yet still scroll to the category list, and per-category leaderboards remain reachable from the sidebar and the See full leaderboard links. No action required.
  • Public benchmark dropdowns now match the leaderboard design system. The workflow filter on the scatter chart and the A/B vendor pickers on head-to-head now open into a styled dropdown panel — mono font, square corners, brick brand colors, and a brand-soft wash on the current selection — instead of falling back to native browser chrome. Keyboard navigation is supported throughout: tab to focus, Enter or ↓/↑ to open, ↑/↓ to move (skipping disabled rows), Enter to commit, Esc or Tab to dismiss, and click-outside to close. The head-to-head pickers continue to grey out whichever vendor is already selected on the opposite side so you can’t pick the same vendor twice. No action required.
  • Public leaderboard hero now shows the harness lineup. The /leaderboard landing page now ends its hero with a scrolling “Real agents on every harness your users actually use” strip, listing the eight agent harnesses every leaderboard row is produced by: Claude Code, Codex, OpenClaw, Claude, ChatGPT, Gemini CLI, OpenCode, and Cursor. The strip makes the harness coverage visible at a glance before you drop into a category — see the public leaderboard. No action required.
  • Claude benchmark tester now uses ToolSearch by default. Claude tester runs on the public observability and CRM benchmarks now go through the same ToolSearch layer that real Claude Code users hit when calling an MCP server, instead of eagerly loading every tool on every turn. This makes the Claude column on the public leaderboard a faithful proxy for the experience customers’ end users actually see. The tradeoff: oversized vendor tools/list catalogs that previously caused Claude to autocompact-thrash are now hidden from the benchmark by ToolSearch, so the leaderboard no longer surfaces schema-bloat issues on its own. Rankings on cleanly-passing cells may shift slightly after the next aggregator refresh. No action required.
  • Benchmark Admin: activate or deactivate vendors without deleting them. Vendor rows in the staff Benchmark Admin surface now have an eye / eye-off toggle alongside the existing upload, edit, and trash actions. Deactivating a vendor keeps all of its configuration — brand color, icon, sort order, and linked MCP server — but removes it from the public leaderboard on the next aggregator refresh. Inactive rows fade and show an “Inactive” pill so the catalog stays legible. This replaces the previous workflow of hard-deleting a vendor to hide it during a relaunch, data backfill, or while waiting on a fix, which wiped its configuration and forced staff to re-enter everything to bring it back. Staff-only; no change to the public leaderboards beyond which vendors appear. No action required.
  • Public benchmark leaderboard search spans every category. The topbar search on the public benchmark leaderboard now finds vendors across every live category instead of only the one you’re currently viewing. Searching for Attio while looking at Cloud deploy / hosting, for example, now surfaces the Attio result from CRM and deep-links you to that category’s leaderboard with the right row in focus. The empty-query suggestion dropdown also shows the top vendors sorted alphabetically across every category rather than biasing toward the current page, and clicking a vendor in the current category still focuses its row as before. If the category catalog is unreachable, search falls back to the current category so the existing flow keeps working. No action required.
  • Public benchmark landing page picks up new categories automatically. The category tiles on the public benchmark landing page now come from the live public benchmark config instead of a hardcoded list, so categories added or enabled from the staff Benchmark Admin surface — currently Database and Cloud deploy / hosting — surface on the landing without a frontend change. Per-category vendor cards still come from the public benchmark API, so a newly enabled category lists immediately and fills in with data once the aggregator has run against the house benchmark org’s batches. No action required.
  • Benchmark Admin: vendor logos, inline edit, and trash-icon remove. Vendor rows in the staff Benchmark Admin surface now render the real vendor logo — sourced from the shared logo bucket when available, or the Simple Icons CDN by the vendor’s icon slug — instead of a colored placeholder, with the chip background flipping to white when a logo renders. Each vendor and workflow row also picks up an Edit action that loads the row back into the form for in-place updates (slug stays locked as the upsert identity), and the previous Remove text button is now a trash icon. Staff-only; no change to the public leaderboards. No action required.
  • Public observability benchmark now refreshes hourly. The aggregator that builds the public observability benchmark snapshot now runs every hour instead of being held off-schedule during the soft launch. Pass rates, efficiency scores, and rank history on the public leaderboard pick up new completed batches within the hour, so vendor cells track the underlying run data much more closely. No action required.
  • Public benchmark landing page is trimmed and corrected. The public benchmark landing page now lists only the observability category instead of nine placeholder rows, since observability is the only category with completed public benchmark batches today. The four methodology cards under “How it works” have been rewritten to match the actual pipeline — rubric-driven workflows, dispatch_benchmark_batch for wave-paced dispatch (see the MCP API), a single evaluator returning a pass/fail verdict, and a pass-rate rollup — replacing earlier copy that described an audit script, fixed 3×3 harness/model matrix, multi-judge unanimous voting, and a now-removed capability-gap flag. Observability’s vendor and run counts on the page now match the live public benchmark API. No action required.
  • Tsuga and Tsuga Code Mode added to the public observability benchmark. The vendor list on the public observability leaderboard now includes Tsuga and Tsuga Code Mode alongside Datadog, Grafana, SigNoz, and Dash0, bringing the category to six vendors across nine workflows. Both vendors were already present in the seeded snapshot but had been getting dropped from the nightly refresh — they now persist across runs. The Dash0 brand color has also been corrected to match what the live site was already serving. No action required.
  • Public benchmark landing page now shows live vendor data. The observability category on the public benchmark landing page now fetches vendor cards directly from the public benchmark API, so pass rates and the vendor list update automatically as new snapshots ship. Previously the landing page rendered a hand-maintained vendor list that could drift from the live leaderboard — for example, Tsuga Code Mode was briefly missing while the static list was out of sync. Categories still in audit continue to show the existing “Audit in progress” notice. No action required.
  • Public benchmark site falls back to a clear empty state. When the public benchmark API has no snapshot available for a category, the leaderboard, head-to-head, and scatter views now render a “No live data for this category yet” card instead of falling back to a built-in placeholder dataset. This removes a class of stale numbers that could appear if a snapshot was missing, and matches what the landing page already does. No action required.
  • Public benchmark gives half-credit for partial verdicts. Pass rate on the public observability benchmark now counts runs that satisfied some but not all success criteria at 0.5 credit, instead of treating them the same as runs that produced nothing. Cells with partials display as “X + Y½ / N · Z%” in the tooltip so the partial-credit contribution is visible alongside full passes. This produces a more faithful comparison across vendors on workflows where a vendor reliably gets part of the answer right. No action required.
  • Overall vendor rank now weights workflows equally. The overall pass rate used to rank vendors on the public observability leaderboard is now a macro-average across workflows — each workflow contributes equally — instead of a micro-average that let workflows with more runs dominate. Combined with the partial-credit and archived-run changes above, this shifts the live snapshot’s vendor ordering toward what spot-checks of the underlying runs already suggest. No action required.
  • Archived benchmark batches no longer leak into the public leaderboard. The nightly aggregator that refreshes the public observability benchmark now filters out archived batches, so a vendor cell only reflects runs from non-archived batches. Previously, archived batches were silently still feeding the snapshot, and because archival had been applied unevenly across vendors, this distorted cross-vendor comparisons. Spot-checked cells now match the non-archived run set. No action required.
  • Public benchmark landing card matches the leaderboard. Per-vendor pass rates on the public observability benchmark landing page now use the same macro-average across workflows and the same partial-credit pass rate as the leaderboard. Previously the landing card summed raw passed/runs, which both dropped the half-credit for partial verdicts and let workflows with more runs dominate — so the landing card and leaderboard could show different vendor ordering. The two views now agree. No action required.
  • Public benchmark efficiency now reflects only passing runs. The efficiency score on the public observability benchmark — derived from run duration, token usage, tool calls, and retries — is now averaged across passing runs only, instead of every run in a vendor × workflow cell. Cells with zero passing runs land at the scale floor (40) rather than picking up a misleading mid-range score from cheap, fast-failing attempts. This realigns the Success × Efficiency scatter so the two axes are independent again: low-pass cells no longer score deceptively well on efficiency. Rankings for cleanly passing cells are unaffected. No action required.
  • Capability-gap flag removed from public benchmark cells. The capability-gap chip that previously appeared in leaderboard cells, head-to-head rows, and cell tooltips on the public observability benchmark has been removed. The signal was driven by a single heuristic and added noise without changing rankings, so vendor cells now show pass rate and efficiency only. Rankings, history, and the underlying benchmark runs are unchanged. No action required.
  • CLI diagnostic output is preserved in run traces. When a CLI workflow run returns long stdout, stderr, output, log, or logs fields, the trace preview now keeps the last 500 characters of each in a *_tail field alongside the existing head preview. Error messages and stack traces that previously fell off the bottom of a truncated buffer are now visible in the run trace without needing to pull the raw payload. Secret-bearing keys are still redacted. No action required.
  • Redesigned insights digest email. The recurring insights digest has a new editorial layout that mirrors the in-app Insights view — severity-railed finding cards, brand-accented section heads, and a system font stack with email-client-safe fallbacks. A new Last runs table appears after the diff pills so recipients can see recent workflow run outcomes at a glance, with system-issue rows (product bugs, upstream provider errors) filtered out. Operator-internal “Improve how you test this MCP” setup findings have been removed from the email; they remain visible in the app’s Insights view. Wording is now cadence-neutral (“Insights digest”, “What changed since last time”). No action required.
  • Edit details now blocks transport changes outright. Saving Edit details on an MCP server can no longer silently ignore an attempted change between Remote/Hosted MCP and Local CLI — the API now returns a clear error instead of returning success without applying the change. The edit panel also includes a one-line note explaining that to convert between transports you should delete the target and create a new one, since the two modes use different transport, provisioning, and authentication shapes.
  • Cleaner workflow filter label on the public benchmark scatter chart. The scatter chart’s workflow filter on the public observability benchmark now shows “all workflows” instead of “all workflows (avg)” for its default option. The aggregate is still an average across workflows; the label is just less noisy. No action required.
  • Capability gaps now appear directly under the leaderboard matrix. On per-category pages of the public benchmark — for example, /leaderboard/database — the Capability gaps section now renders directly below the leaderboard matrix, above the Success × Efficiency scatter and head-to-head views. Readers see which (vendor × workflow) pairs were excluded because the product doesn’t ship that workflow before they reach views where an excluded cell could read as “vendor lost on this workflow.” Categories with no curated gaps (currently observability) render the same layout as before. No action required.

Bug fixes

  • Public benchmark no longer gets stuck on “Loading benchmark…” for upcoming categories. The Database and Cloud deploy / hosting categories on the public benchmark site previously rendered an indefinite “LOADING BENCHMARK…” spinner because they were listed as live before any snapshot had been published. Both categories now correctly surface with the SOON pill and the Audit in progress panel (for example, “Database benchmark is being assembled. 4 vendors identified, audit script in review.”), matching how upcoming categories were always meant to read. As a safety net, any live category whose snapshot fetch hasn’t settled within five seconds now falls back to a stable “No snapshot yet” message instead of spinning forever. Observability and CRM are unchanged. No action required.
  • Workflows page hides the category filter when no workflow has a category. The chip filter strip on the Workflows page now derives its category list from the workflows actually loaded on the page — chips appear only for categories at least one workflow uses, sorted alphabetically, with accurate counts. When no workflow is categorized, the entire strip (including the All chip) is hidden rather than rendering an empty row. Toggling a chip still filters the table and updates the URL as before. The /mcps chip strip is unaffected. No action required.
  • Public benchmark dropdowns keep the focused row in view during keyboard navigation. Arrow-keying through the workflow filter on the Success × Efficiency scatter chart — and any other long dropdown on the public leaderboard — now scrolls the highlighted row into view as focus moves past the visible window. Previously, on lists taller than the 280px panel (for example, the 10-workflow filter on /leaderboard/observability), pressing ↓ past the bottom silently highlighted off-screen options. Wrap-around from the last item back to the first also scrolls the panel to the top. Shorter dropdowns like the head-to-head vendor A/B pickers behave the same as before. No action required.
  • Public leaderboard tabs now show the Armature mark. The /leaderboard landing page and per-category pages (for example, /leaderboard/observability) now render the Armature favicon in the browser tab instead of the default globe icon, matching the rest of the public site. No action required.
  • Benchmark cleanup now runs reliably for vendor-per-cell batches. Cleanup for benchmark runs whose base workflow supplies the vendor per cell (rather than pinning a single MCP server on the workflow itself) was being skipped and stamped WORKFLOW_NOT_FOUND, leaving vendor-side artifacts — preview deploys, database branches, alert rules — behind for each affected run. Cleanup now resolves the workflow against the cell’s specific version, so post-run teardown executes against the right vendor. Verdicts and leaderboard rankings were never affected, only resource cleanup. No action required.
  • Renaming a Benchmark Admin vendor or workflow now preserves its catalog state. Editing a vendor or workflow row in Benchmark Admin — for example, fixing a display name or brand color — no longer resets its sort order or unlinks the associated MCP server. Previously, saving the form silently overwrote both fields because they weren’t part of the submitted payload; they’re now preserved on update and continue to default cleanly on first insert. The public observability and CRM leaderboards already render from the catalog, so column order and vendor → MCP linkage now survive edits as expected. No action required.
  • Claude Code no longer prompts for re-auth after idle periods. MCP clients running multiple parallel sessions, or pausing and resuming roughly an hour after a concurrent token refresh — most visibly Claude Code against mcp.armature.tech — were being forced through a full re-auth roughly every hour, even though refresh tokens are valid for 90 days. Armature no longer rotates the refresh token on /oauth/token refresh: the same refresh token is returned on every call, its 90-day expiry slides forward on each successful refresh, and access tokens continue to rotate normally. Parallel sessions therefore share one long-lived refresh token and can no longer invalidate each other. This matches how Google, Microsoft, and Vercel handle OAuth refresh, and is permitted by RFC 6749. Per-grant revocation from the Settings UI and /api/mcp/oauth/revoke is unchanged. No action required — see Authenticating with the Armature MCP API.
  • Benchmark runs recover from provider rate limits and evaluator format glitches. Two transient failure modes on the public observability benchmark — provider rate-limit responses (429 / TPM saturation from OpenAI and Anthropic) and evaluator replies that came back as markdown instead of strict JSON — are now treated as retryable instead of finalizing the run as a platform failure. Rate-limited attempts are re-dispatched on the standard system-error retry path, and an evaluator that emits prose is re-prompted with a stricter format directive before falling back to a terminal error. The net effect is fewer benchmark cells lost to harness-side noise and pass rates that better reflect real agent behavior. No action required.
  • Insights digest emails arrive addressed to you, not to noreply@. Manual sends of the insights digest email now go out as one envelope per recipient, with each recipient on the To: line and the founders BCC list preserved. Previously every send landed in inboxes addressed to noreply@updates.armature.tech, which tripped Gmail’s “to me only” filter, raised spam-classifier scores, and broke Reply. Each recipient now sees their own address on the message, Reply works, and deliverability improves. No action required.
  • Public benchmark site: leaderboard shows the snapshot’s real generated time. The “Generated” timestamp in the public observability benchmark leaderboard header now reflects the actual snapshot time of the data being displayed, instead of a stale hardcoded value. If no snapshot metadata is available, the header is hidden rather than showing a misleading time. No action required.
  • Public benchmark site: scatter chart axis tooltips now appear on hover. Hovering the PASS RATE (SUCCESS) and EFFICIENCY axis labels (and their ⓘ icons) on the public observability benchmark scatter chart now reliably shows the full explanation, with the hover region covering the entire label area. Previously the native SVG tooltips were inconsistent and only triggered over the painted glyphs. No action required.
  • MCP Analytics views now load reliably. Several MCP Analytics endpoints — the per-session detail drawer, the actors list, the Settings ingest-token panel, and the gateway-facing rules pull — were returning 500s on first load because they referenced the wrong MCP server columns. The queries now use the correct columns, so session detail, actor breakdowns, ingest-token provisioning, and rule sync all load on first request. No action required.
  • Public benchmark site: in-page navigation no longer 404s. Clicking a vendor card, the search submit button, an autocomplete suggestion, or the leaderboard logo on the public observability benchmark now navigates to the right page. The soft-launch deploy removed the friendly /benchmark and /benchmark/observability rewrites but left several hardcoded references behind, so every in-page link bounced to a 404. All references now point at the live page paths. No action required.
  • Public leaderboard: vendor logos no longer render as black tiles. Vendors on the public leaderboard whose logos come from the Simple Icons CDN — including Vercel, Vercel CLI, Railway CLI, and Planetscale — previously appeared as solid black squares because a monochrome black mark was painted directly over a dark brand-color tile. The logo now renders as a recolored mask: white on dark or mid brand colors and near-black on light ones, so the mark stays visible regardless of brand color. Vendors with uploaded full-color logos (for example, Neon and Insforge) are unchanged. No action required.
  • Public benchmark site: SigNoz and Dash0 logos render correctly. Vendor chips for SigNoz and Dash0 on the public observability leaderboard previously showed blank colored squares because the upstream icon CDN doesn’t host their SVGs. Both vendors now fall back to a clean letter mark (S and d0) that reads at every size. Datadog and Grafana continue to show their real logos. No action required.
  • OAuth extra headers now forwarded into dynamic client registration. The Extra headers you set on an OAuth-authenticated remote MCP server are now also included in the RFC 7591 dynamic client registration request, not just the discovery probes. This unblocks multi-tenant MCP gateways whose /register endpoint sits on the MCP origin and needs a tenant identifier — Grafana Cloud’s MCP, for example, now completes registration with X-Grafana-URL and opens the sign-in popup as expected. Headers are still only forwarded to same-origin requests, so tenant identifiers never leak to a cross-origin authorization server. No action required — reconnect any OAuth MCP server that previously failed at the registration step.
  • Grafana Cloud MCP now registers as a public client. Dynamic client registration against Grafana Cloud’s MCP (mcp.grafana.com) previously failed with DCR endpoint returned 400 because Grafana’s gateway only accepts public, PKCE-only clients and rejected Armature’s default client-secret registration. Armature now registers as a public client for grafana.com automatically, so the MCP server connect flow completes and the sign-in popup opens. No action required — reconnect any Grafana Cloud OAuth target that previously failed at the registration step.
  • Grafana Cloud MCP OAuth popup now completes sign-in. The authorize URL built for an OAuth MCP server is now assembled by merging OAuth params onto the provider’s authorize endpoint instead of naively concatenating a ?. This fixes targets whose discovered authorization_endpoint already carries its own query string — Grafana Cloud’s MCP returns …/authorize?grafana_url=<stack> once you set X-Grafana-URL in Extra headers — which previously produced a URL with two ? separators and bounced the popup to the stack home instead of back to Armature. No action required — reconnect any Grafana Cloud OAuth target that previously failed at this step.
  • No more duplicate Claude Code entries from MCP OAuth refresh races. MCP clients that fire concurrent /oauth/token refreshes (notably Claude Code 2.1.149) sometimes saw an in-flight refresh fail with invalid_grant, treat the credential as revoked, and re-run the full OAuth + dynamic client registration flow — leaving a trail of duplicate “Claude Code” rows in What’s connected. Armature now keeps the previous refresh token hash valid for a 60-second grace window, so an older refresh request still on the wire succeeds instead of forcing a re-registration. Dynamic client registration also now reuses an existing public client when an MCP client re-registers with the same software_id, client_name, and redirect URIs (per RFC 7591), instead of minting a fresh client_id every time. Existing duplicate rows are not collapsed retroactively — revoke unwanted ones from the settings page. No action required — see Authenticating with the Armature MCP API.\n- No more spurious 401s during MCP OAuth token refreshes. MCP clients that fire several /oauth/token refreshes in parallel (notably Claude Code, which can rotate dozens of times in under a minute) sometimes saw an in-flight request fail with OAuth token is invalid or revoked even though the access token’s nominal lifetime had barely elapsed — the previous-hash slot had been overwritten by a subsequent refresh in the same burst. Armature now keeps the last 8 rotated-out access token hashes valid for a 60-second grace window each, so requests already on the wire with an older token continue to succeed even across a refresh storm. Genuinely expired or revoked tokens still fail. Revocation also honors any token in the grace window. No action required — see Authenticating with the Armature MCP API.
  • Consistent run duration across benchmark and leaderboard views. The Duration column on a benchmark batch detail page and the avg_duration_ms value on the workflow leaderboard now match the tester-only timing already shown on each run’s detail page. Previously these views reported the full handler lifetime (tester plus evaluation and setup), so the same run could read, for example, “10m” in the batch table but “3m” on its detail tile. Rankings on cleanly-passing leaderboard cells may shift slightly as a result. No action required.
  • Public observability benchmark pass rates realigned with historical reference. The nightly aggregator that refreshes the public observability benchmark now pools runs across every completed batch per vendor × workflow cell instead of sampling only the latest batch, and excludes platform-error runs (tester crashes, evaluator failures, timeouts) from the denominator. After last week’s efficiency and vendor-list changes, the first nightly refresh produced pass rates that diverged from the reference numbers users had been seeing — caused by single-batch sampling and platform errors inflating cell totals without contributing to the passed count. Spot-checked cells now match the reference values bit-for-bit. No action required.
  • Tool schemas with ${…}-style text no longer trigger false SECRET_NOT_FOUND errors. When connecting an MCP server, Armature now preserves the discovered tool catalog as-is while still resolving secret placeholders in auth, URL, and header fields. Previously, vendor schemas whose descriptions or examples contained literal ${variable} syntax — Grafana Cloud’s ClickHouse tool, for example — were misread as unresolved Armature secrets and blocked the connection. No action required — retry any MCP server that previously failed with SECRET_NOT_FOUND while listing tools.
  • CLI deploy failures keep their final error in run traces. Long stderr and stdout streams captured during CLI workflow runs — for example, a Vercel CLI deploy that ends with a build error — now preserve a redacted tail preview alongside the head, so the final diagnostic line remains visible in both the run evidence view and the customer-facing trace. Previously the tail was truncated away, hiding the actual failure reason. Bearer tokens and other secrets in the preserved tail are still redacted. No action required.
  • CLI tester runs no longer crash pre-flight on stray secret placeholders. Claude Code and Codex CLI workflow runs against an MCP server now match the API tester behavior when a ${…} placeholder in a server’s config has no matching secret: the run continues and either succeeds (OAuth servers re-mint a fresh token) or surfaces a clean upstream 401 from the vendor, instead of dying in ~240 ms with a non-retryable SECRET_NOT_FOUND system error. This closes the harness-level inconsistency behind the Grafana × Claude Code incident on May 25 and prevents the same divergence from recurring elsewhere. No action required — re-run any CLI tester run that previously failed at this step.
  • Token usage now contributes to public benchmark efficiency scores for API testers. Token usage from Claude and ChatGPT API tester runs is now persisted on each workflow run, so the public observability benchmark efficiency score reflects real input and output token totals instead of treating every cell as identical on that axis. Previously the API runners captured vendor usage on the run trace but didn’t carry it through to the per-run record the aggregator reads, which flattened the token component of the efficiency formula across all six tester models. Anthropic and OpenAI tester cells now differentiate on token efficiency after the next aggregator refresh; CLI-harness testers (Gemini, Claude Code, Codex, Cursor, openclaw, opencode) still surface usage only on the run trace for now. No action required.
  • Claude Code and Gemini CLI tester runs now report real token usage. Token usage from the Claude Code and Gemini CLI testers is now persisted on each workflow run, extending last update’s API-tester fix to the two largest CLI harnesses. Previously these runs landed with an empty usage block, so the public observability benchmark efficiency score treated their token component as zero — making the tokens axis harness-dependent noise rather than a real comparison signal. Claude Code and Gemini cells now contribute real input and output token totals to the efficiency formula after the next aggregator refresh. The remaining CLI testers (Codex, Cursor, openclaw, opencode) still surface usage only on the run trace for now. No action required.
  • Session Analytics sessions list now shows real intent, outcome, and frustration. Every row in the Sessions view of Session Analytics previously read (no intent), rendered the frustration column as a thin vertical line, and marked sessions as OK even when they contained errors. The list now surfaces the raw session intent as a fallback when no classification has run yet, renders the frustration bar at full width, and falls back to the session’s error count for the outcome when no classifier verdict is available. The intent column is no longer truncated at the grid level. No action required.
  • Session Analytics thinking trace header reflects the real session state. The drill-down header on a Thinking Trace view now reports FAILED when a session contains any failed events or has a non-zero error count, instead of reading OK on sessions with hundreds of errors. The side rail also falls back to last_event - started_at when a session is still in flight or never recorded an explicit end, so Duration no longer reads 0 s on long-running sessions. No action required.
  • Sidebar no longer flips to Session Analytics on run pages and unknown routes. Visiting a workflow run or run history page (and other routes without a matching sidebar entry, such as settings sub-pages and auth callbacks) no longer switches the left sidebar into the Session Analytics app section with no row lit. Run pages now keep the Testing & Benchmarks app active with Workflows highlighted, and any unmatched route falls back to Testing & Benchmarks rather than the first app in the list. Single-app workspaces (without the mcp_analytics flag) are unaffected and still show no app header. No action required.
  • Session Analytics overview no longer mislabels first-period metrics as doubled. KPI deltas on the Session Analytics Overview previously showed ↑ 100% for any metric whose prior period was zero, implying the value had doubled when in fact it was the first time it had been recorded. First-period metrics now display NEW instead of a percentage delta. No action required.
  • Top topics and Failed searches panels populate before classification runs. The Top topics and Failed searches panels on the Session Analytics overview previously stayed blank until the background classification jobs had completed. Both panels now fall back to raw intents and raw search-miss groupings — the same data that powers the Topics and Searches pages — and append · provisional to the panel title so the fallback state is legible. The full Topics and Searches pages also show a provisional banner when they’re rendering raw groupings; expandable cluster detail stays hidden until real clusters have been promoted. No action required.

New features

  • Agents can register HTTP MCP servers from chat. A new add_mcp_server tool on the Armature MCP API lets an agent add a new streamable_http or sse MCP target to your workspace without opening the dashboard. The tool is intentionally scoped to credential-less public HTTP MCPs — stdio and hosted-npm transports, and any plaintext auth, are rejected. After the row is created, the response includes a dashboard URL where an admin can finish auth setup and then call sync_mcp_server_capabilities to pull in the tool catalog. The same plan-quota, audit, and post-commit checks as the dashboard flow apply. Requires the editor role or higher — see the role table.
  • Agents can connect OAuth MCP servers from chat. Two new tools on the Armature MCP API let an agent drive the OAuth connect flow for any MCP server in your workspace without anyone clicking through the dashboard. authenticate_mcp(serverId) returns an authorize_url to share with the user, plus a flow_id and the redirect_uri their browser will land on. After the user signs in, paste the callback URL back to the agent and complete_authentication_mcp(serverId, callback_url) finishes the token exchange and writes the new auth profile. Follow up with sync_mcp_server_capabilities to refresh the tool catalog. Both tools require the editor role or higher — see the role table — and are especially useful when seeding a workspace with many OAuth MCPs at once.

Updates

  • Run detail now leads with token usage. The cost card on a workflow run detail page is now titled Token usage and leads with consumption, with Est. API cost shown as a smaller secondary metric. The per-bucket breakdown (input, output, cached input, cache writes, reasoning) is unchanged. This keeps the focus on what the run actually consumed, while pricing remains visible for reference. No action required.
  • Accurate cost and token totals for OpenClaw tester runs. Workflow runs against OpenClaw targets now show Est. API cost values and full token usage instead of pricing unavailable. Pricing is seeded for the canonical OpenClaw models (Anthropic Sonnet 4.6, OpenAI GPT-5.5, Google Gemini 3 Flash) and the historical launch rows, and token aggregation now reads OpenClaw’s native usage buckets (input, output, cache read, cache write). No action required — existing and new runs pick up the change automatically.
  • Remote MCP probes now retry across all healthy DNS addresses. When a hosted MCP server resolves to multiple IPs and one of them is unresponsive, the connect-time probe now falls through to the next address instead of hanging on the dead one. A bounded per-address response-header timeout keeps a single bad route from consuming the whole probe. Servers like HeyReach (mcp.heyreach.io), whose DNS rotates between healthy and unhealthy addresses, now connect reliably on first try. No action required — re-connect any server that previously failed at this step.
  • Connect HubSpot, Pipedream, and Item MCP servers. Three remote MCP servers that previously failed OAuth metadata validation now connect cleanly. Pipedream’s MCP gateway (mcp.pipedream.net, used by Pipedrive and other Pipedream-hosted servers) and Item (mcp.item.app) are now recognized as trusted authorization issuers. HubSpot (mcp.hubspot.com) is now available as a curated OAuth provider — because HubSpot does not support dynamic client registration, an admin registers an MCP Auth App once in their HubSpot developer account and the credentials are reused org-wide. No action required for Pipedream or Item; for HubSpot, see Connecting an MCP server.

Week of June 29, 2026

This week was focused on behind-the-scenes infrastructure work. No user-facing changes shipped.

Week of June 22, 2026

Bug fixes

  • Accurate Claude token totals in benchmark tables. Token usage for Claude tester runs shown in benchmark batches no longer double-counts per-turn rows. Aggregation now prefers the terminal claude_result for Claude harnesses, sums per-turn totals for Codex, and falls back to model responses otherwise. Est. API cost values for Claude runs in benchmark views now match the per-run breakdown. No action required.

Week of June 15, 2026

Updates

  • Benchmark batches now retry transient system failures. When a run inside a benchmark batch fails because of a retryable platform or system error, the batch now automatically retries it and keeps the retry attempt in the same batch. Batch rollup counters recompute from the non-superseded finalized runs, so pass/fail totals reflect the final outcome rather than the transient failure. No action required.
  • More Claude Code native tools available against MCP targets. Claude Code tester runs against MCP servers now keep access to the Task and Agent native tools. Only truly interactive, scheduled, and plan-mode native tools remain blocked, so workflow runs that benefit from sub-agent delegation can use it.
  • Scheduled tester runs preserve their schedule link. The originating schedule ID is now preserved when a tester run is inserted and when it is claimed by a worker, so scheduled workflow runs stay correctly associated with their schedule end-to-end.

Bug fixes

  • Gemini CLI client errors preserved in run context. When the Gemini CLI tester reports a client-side error, the error details are now retained in the run’s error context instead of being dropped. Failed workflow runs against Gemini CLI targets are now easier to diagnose. No action required.

Week of June 8, 2026

Bug fixes

  • Amplitude analytics now load on the Armature web app. The content security policy now allows the Amplitude browser SDK script and its ingestion endpoints, so the same anonymized usage events captured by our other analytics tools — sign-up and sign-in, onboarding, plan selection and checkout, connecting an MCP server, creating a tool monitor, and creating or triggering a workflow run — also reach Amplitude. No action required, and no change to what is captured — MCP server credentials, tool inputs, and run outputs are still never sent.
  • Mixpanel browser SDK initializes on the default instance. The Mixpanel browser SDK queue now passes the library name the CDN loader expects, so the default analytics instance hands off cleanly from the stub to the real SDK and starts sending events after consent. Client-side events for the tracked flows — sign-up, onboarding, checkout, MCP server connections, tool monitors, and workflow runs — now report reliably on first visit. No action required.

Week of June 1, 2026

Updates

  • Faster dispatch for API-based tester runs. API-based tester runs (Claude and ChatGPT) and CLI-based tester runs (Claude Code and Codex CLI) now use independent dispatch queues. A backlog on one no longer delays the other, so workflow runs stay snappy even when a single tester model is under heavy load. No action required — existing workflows pick up the change automatically.

Bug fixes

  • No more spurious “shells out” findings on CLI servers. Insight digests for CLI MCP servers no longer flag agents invoking the wrapped binary as a deviation. For CLI targets, reaching the binary through a shell is the canonical mode of use, not a problem. Genuine CLI UX issues — stderr noise, missing --json formatting, unexpected warnings — are still surfaced. No action required.
  • Product analytics now reach Amplitude in production. The anonymized usage analytics announced in the Week of May 25 update are now actually being delivered from the production Armature web app. A missing production configuration value and a content security policy that blocked outbound analytics requests have both been resolved. No action required, and no change to what is captured — MCP server credentials, tool inputs, and run outputs are still never sent.

Week of May 25, 2026

New features

  • Estimated API cost on every run. Workflow runs, benchmark results, and workflow analytics now show an Est. API cost value derived from provider token usage and current model API pricing. The run detail page includes a per-bucket breakdown (input, output, cached input, cache writes, and reasoning) so you can see exactly which token classes contributed to the estimate. Runs where pricing is not yet seeded show pricing unavailable instead of a number. Subscription pricing may differ from the API-list estimate.
  • Composer 2.5 available for Cursor tester runs. The Cursor tester harness now offers Composer 2.5 alongside Composer 2 in the model picker, so organization admins can author workflows that exercise MCP servers against the newer Cursor model. Existing workflows continue to run on Composer 2 by default — switch the tester model on a workflow to opt in.
  • Redesigned subscription cancellation flow. Cancelling from Billing now opens a four-step modal: pick a reason, optionally pause your subscription for 30 or 60 days, optionally book a call with a founder, then confirm. Pausing preserves your workspace state and resumes automatically on the chosen date — no new payment method required. A confirmation email is sent with the pause or cancellation date and a one-click reactivate link.
  • Reactivate paused or cancelling subscriptions in one click. If your subscription is paused or scheduled to cancel at period end, a banner now appears on Billing with a Reactivate button that clears the schedule immediately. The same action is also available from the confirmation email.

Updates

  • Product analytics in the Armature app. The Armature web app now sends anonymized usage analytics for key flows — sign-up and sign-in, onboarding, plan selection and checkout, connecting an MCP server, creating a tool monitor, and creating or triggering a workflow run. This helps us see which parts of the product people use most so we can prioritize improvements. No MCP server credentials, tool inputs, or run outputs are captured.
  • Looser naming rules for additional environment variables. The Additional environment variables editor on the connect form now accepts names with lowercase letters and hyphens (for example, kraken-cli_apiKey), in addition to the standard uppercase form. This unblocks CLIs that expect rc-style credential variables named after their package. The primary auth variable still requires the strict uppercase form, and reserved names (AWS_*, PATH, HOME, ARMATURE_*) remain blocked. See Connecting an MCP server.

Bug fixes

  • Skipping a plan no longer leaves onboarding stuck. New workspace owners who choose Skip for now on the plan picker are now taken straight to the dashboard instead of landing on the role onboarding step, which doesn’t apply without a paid plan. You can revisit plans any time from Billing.
  • Kraken CLI installs no longer fail at provisioning. Connecting kraken-cli as a CLI MCP server previously failed during install because one of its dependencies is fetched from GitHub rather than the npm registry, which the provisioning sandbox blocked. The sandbox now permits the specific GitHub hosts required for that dependency, so kraken-cli provisions cleanly and discovery returns its full command catalog.
  • Accurate opencode token totals on multi-step runs. Token usage reported for opencode tester runs now sums every step of a tool-use loop instead of reflecting only the final step. Multi-step workflow runs against opencode targets now show the full token cost of the run.

Week of May 18, 2026

Bug fixes

  • Remote MCP probes now connect to more servers on first try. The live probe that runs when you connect an MCP server now falls back to older supported MCP protocol versions when a server rejects the latest one, and it sends a default User-Agent so servers behind CloudFront and similar CDNs no longer block the request. Connections that previously failed at the initial initialize step now succeed and return the full tool catalog. No action required — re-connect or re-probe any server that previously failed at this step.

New features

  • Run a workflow from its analytics page. The same Run all / Run now split control from the workflows list now appears on the workflow analytics hero, so you can trigger a manual run without navigating back to the list. Single-target workflows show Run now; multi-target workflows show Run all (N) with a caret menu to pick a subset and dispatch Run selected (N). Toast, dedupe, and partial-failure behavior matches the workflows list, and the analytics view reloads after a successful dispatch so last_run reflects the new run. Paused workflows keep the existing “Resume the workflow before running it manually” tooltip.

Bug fixes

  • Run control sized consistently on the workflow analytics page. The Run all / Run now split control on the workflow analytics hero now matches the height, font size, and border weight of the adjacent Run history and Edit buttons. The caret stays flush against the primary button and hover and open states still visually group the pair. The denser styling on the workflows list page is unchanged.

Updates

  • More complete catalogs for CLI MCP servers with nested subcommands. CLI discovery now parses multi-line Usage: blocks, so deep subcommand hierarchies (for example, cdp api <path> [fields...]) contribute every documented usage shape — including rest positionals like [fields...] — to the tool catalog. Re-sync your CLI server to pick up any tools that were previously missing.
  • CLI tools accept harness-injected arguments. Generated input schemas for CLI tools no longer reject extra fields like cwd or description that some tester harnesses (such as Claude Code) attach to every call. Undeclared keys are stripped before the CLI is invoked, so workflow runs against CLI MCP servers no longer fail validation before the tool runs.

Bug fixes

  • Product analytics events now reach Mixpanel reliably. Server-side analytics events from the Armature app are now delivered in the format Mixpanel’s ingestion endpoint expects, and unexpected responses are no longer silently treated as successes. Usage data for the tracked flows — sign-up, onboarding, checkout, MCP server connections, tool monitors, and workflow runs — now shows up correctly in our dashboards. No action required.
  • Browser-side Mixpanel events no longer blocked by CSP. The Armature web app’s content security policy now allows the Mixpanel browser SDK (cdn.mxpnl.com) and its ingestion endpoints (api-js.mixpanel.com, api.mixpanel.com). Client-side analytics for the same tracked flows now load and report alongside the server-side events. No action required, and no change to what is captured — MCP server credentials, tool inputs, and run outputs are still never sent.
  • Browser analytics initialize cleanly on the default instance. The Mixpanel browser SDK queue no longer receives a trailing undefined instance name when the Armature web app boots, so the default analytics instance initializes in the shape the SDK expects. Client-side events for the tracked flows — sign-up, onboarding, checkout, MCP server connections, tool monitors, and workflow runs — load reliably on first visit. No action required.
  • Failed CLI tool calls are now reported as failed. When a wrapped CLI exits non-zero or the MCP response is marked as an error, the tool call status now reflects the failure instead of staying completed. Pass/fail rates and tool-call traces in workflow runs now match what actually happened on the CLI.
  • CLI discovery no longer turns request-body fields into fake tools. For CLIs whose leaf commands document their request body under sections like Fields:, Body:, Returns:, Response:, or Headers:, discovery previously mistook each field row for a subcommand and filled the tool catalog with bogus <parent>.<fieldName> entries — sometimes pushing real subcommands out via the per-server tool cap. These sections are now recognized as data-shape help and their field names feed each command’s input schema instead. Re-sync affected CLI servers to restore any subcommands that were previously crowded out.

New features

  • Composer 2.5 available in the Cursor harness. The Cursor tester model picker now lists Composer 2.5 alongside Composer 2 and the other Cursor-backed models. Select it when authoring or editing a workflow to run tester turns on Cursor’s latest native Composer model. Capabilities mirror Composer 2 (Cursor-native, no extended thinking surface). Composer 2 remains the default for new Cursor workflows; choose Composer 2.5 explicitly per workflow until the default is bumped. See Authoring effective workflows.
  • Multiple environment variables per MCP auth profile. Auth profiles for CLI and hosted-stdio MCP servers can now carry more than one credential. Add additional NAME / value pairs in the Additional environment variables editor on the connect form, and each one is securely stored and injected into the server process at startup. This unblocks tools that require several keys to run. Up to 16 extras per profile. To change an existing extra, delete and recreate the auth profile. See Connecting an MCP server.

Updates

  • Cleaner connect form helper text. The Additional environment variables editor on the connect form now shows a generic one-line hint instead of a vendor-specific example, making the guidance easier to scan when setting up a new MCP server connection.

Bug fixes

  • Codex API fallback authentication. When a Codex subscription hits a plan limit, Armature now falls back to API key authentication correctly, so workflow runs continue without manual intervention.
  • Claude long-context plan-limit handling. Claude failures caused by “usage credits required for long context requests” are now classified as plan-limit conditions and trigger the same automatic fallback as other limit errors, keeping workflow runs on track.
  • Cleanup stays on the run’s tester model. Cleanup for a workflow run now executes on the same tester model the run actually used, instead of falling back to the workflow’s default. This prevents cross-model interference when a single workflow has run against multiple tester targets and keeps run metadata consistent.