How to Evaluate Multi-Agent System Vendors: 12-Point Scorecard

A ready-to-use weighted scorecard for CTOs evaluating multi-agent system vendors — built from 47 enterprise AI agent projects delivered between 2023 and 2025, and the production failure patterns flagged by Gartner and MIT in 2025–2026.

TL;DR

Most AI agent vendor pitches collapse the moment you ask three questions: Show me the trace. Show me the evals. Show me a customer in production for at least six months. Gartner expects more than 40% of agentic AI projects to be canceled by the end of 2027 — and a 2025 MIT NANDA study found that 95% of generative AI pilots delivered no measurable P&L impact. The fix is not a longer RFP; it is a weighted scorecard that forces vendors to prove production maturity, not demo-readiness. This article gives you that scorecard, with weights, red flags, and a comparative table across the major frameworks and platforms.

What you'll learn

  • A 12-point weighted scorecard you can drop into your next RFP
  • How LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, and Salesforce Agentforce actually compare on production criteria
  • Realistic cost ranges for agent pilots and production rollouts in 2026
  • The five questions that separate real agentic vendors from "agent washing"
  • When to build with a framework vs. buy a platform vs. hire a delivery partner
  • A reusable scoring template with weights and red/green flags

Why most agent vendor evaluations fail

The agentic AI market is loud. Every vendor in your inbox claims orchestration, memory, guardrails, and "enterprise-ready" deployment. Gartner's own analysts estimated in 2025 that of the thousands of vendors marketing agentic capabilities, only around 130 actually deliver them — the rest are practicing what Gartner calls agent washing: rebranded chatbots, RPA flows, or single-prompt wrappers dressed up as autonomous systems.

That is the real evaluation problem. The technology is moving faster than procurement frameworks can keep up. CTOs who run a generic SaaS-style RFP end up scoring vendors on features that do not predict whether the system will survive contact with production traffic, regulated data, or a finance team asking why this month's token bill is 4x the forecast.

A multi-agent system vendor is any organization — framework maintainer, platform provider, or delivery partner — that supplies the orchestration, tooling, and operational practices required to deploy two or more cooperating LLM-powered agents into a production workflow. That definition matters because it forces you to evaluate three different categories of vendor with the same lens: open-source frameworks (LangGraph, CrewAI, AutoGen), managed platforms (Agentforce, Bedrock Agents, watsonx Orchestrate, Vertex AI Agent Builder), and the consulting and engineering firms that actually deliver the build.

How to evaluate AI agent vendors: the 12-point scorecard

Below is the scorecard Sphere's engineering pods use when advising clients on multi-agent vendor selection. Weights total 100%. Adjust within ±2 points per criterion based on your risk profile (regulated industries should over-weight criteria 4, 8, and 10).

01
Orchestration and control flow
12%

Explicit graph or state-machine model with deterministic transitions, support for cycles and conditional branching, durable execution that survives a process restart, and human-in-the-loop checkpoints.

Green flags: Replayable runs from any node, typed state objects, time-travel debugging.
Red flags: Magic agent loops with no inspectable control flow, no way to pause/resume, demos that only show happy-path linear flows.
02
Observability and tracing
11%

Every LLM call, tool call, and agent handoff is captured as a structured trace with token counts, latency, cost, and inputs/outputs. Traces are searchable, exportable, and integrate with your existing APM.

Green flags: OpenTelemetry support, span-level cost attribution, native LangSmith/Langfuse/Arize integration.
Red flags: "We log to stdout," screenshot-only debugging, no per-step token accounting.
03
Evaluation and testing infrastructure
10%

Labeled eval sets, automated regression tests on every prompt and model change, offline batch evaluation, online A/B with shadow traffic, and human-review queues for ambiguous outputs.

Green flags: A vendor that hands you their eval framework on day one. A documented "golden set" methodology.
Red flags: "We test in production." Vibes-based QA. No regression suite when models are upgraded.
04
Security and guardrails
10%

Defense against the OWASP Top 10 for LLM Applications 2025 — prompt injection, excessive agency, sensitive information disclosure, unbounded consumption. Tool-use approval policies, scoped credentials per agent, and red-team test results.

Green flags: Documented prompt-injection mitigations, per-tool authorization, kill switches, rate limits at the agent and tool level.
Red flags: Single shared API key across all agents. No mention of indirect prompt injection. "Our model won’t do that" without test evidence.
05
Model flexibility and vendor lock-in
8%

Model-agnostic by default. Routing logic that can swap GPT-4.1, Claude Sonnet 4.5, Gemini 2.5, Llama, and self-hosted models without rewriting agent logic.

Green flags: Abstraction layer for model providers, ability to run different agents on different models, support for local/on-prem inference.
Red flags: Hard-coded to one provider’s SDK. Pricing models that punish you for switching.
06
Tool integration and MCP support
8%

First-class support for Anthropic’s Model Context Protocol (MCP) — the de facto standard for connecting agents to enterprise tools — plus native connectors for the systems you already run (Salesforce, ServiceNow, SAP, Snowflake, GitHub).

Green flags: Published MCP servers, A2A (agent-to-agent) protocol awareness, OAuth-based tool authorization.
Red flags: Bespoke tool-calling format you’ll have to maintain, no connector marketplace, brittle JSON schemas.
07
Memory and state management
7%

Distinct short-term (conversation) and long-term (entity, episodic) memory layers, with TTLs, scoping rules, and the ability to redact or expire memories on demand.

Green flags: Pluggable vector stores, memory namespacing per user/tenant, audit logs on memory writes.
Red flags: "Memory" that is just an ever-growing prompt prefix. No way to delete a memory for GDPR/CCPA.
08
Deployment and data residency
8%

Cloud, VPC, and on-prem deployment options. Regional data residency for the EU, UK, and other regulated geographies. Documented data-flow diagrams showing where prompts, completions, and tool calls travel.

Green flags: BYOK encryption, private endpoints, no training on customer data by default.
Red flags: Single-region SaaS with no VPC option. Vague answers about where inference happens.
09
Cost transparency and unit economics
8%

A clear cost-per-task or cost-per-resolution metric, dashboards showing token spend by agent and tool, and predictable pricing models that match your usage shape.

Green flags: Cost budgets with hard stops, per-tenant chargeback, published unit economics.
Red flags: "It depends" pricing. Salesforce’s three-times-revised Agentforce pricing (from $2/conversation in 2024 to $0.10/action Flex Credits in May 2025 to per-user licenses at $125+ in late 2025) is a cautionary tale.
10
Compliance and audit readiness
7%

SOC 2 Type II in hand, plus the specific frameworks your industry needs — HIPAA, ISO 27001, PCI-DSS, FedRAMP, EU AI Act risk classification documentation.

Green flags: Independent audit reports, DPA templates ready, agent-action audit logs that satisfy regulators.
Red flags: "SOC 2 in progress" for 18 months. No incident-response runbook for a hallucinated tool call.
11
Team seniority and domain expertise
6%

Named senior engineers on your account with shipped agent systems behind them. Domain context — if you are a bank, the team has built for banks before.

Green flags: Named tech leads with public production case studies, references you can call directly.
Red flags: A glossy sales deck with anonymous team photos. Junior engineers learning on your project.
12
Production track record and references
5%

At least three named customer references running the system in production for six months or more, with quantified outcomes (resolution rate, cost savings, cycle-time reduction).

Green flags: Case studies with real numbers, willingness to share post-incident reviews.
Red flags: Logo slides without permission to contact the customer. "We can’t share that under NDA" for everything.

Vendor comparison: how the major options score

Disclosure

This review is published by Sphere, which is included as one of the evaluated vendors. Scoring methodology is described in our methodology page. Sphere is a delivery partner, not a framework or platform — it is included because most enterprise buyers compare "build with X framework" against "engage a delivery partner" as a single decision.

Scores are 1 (poor) to 5 (best-in-class) across the 12 criteria above, based on Sphere's review of public documentation, internal benchmarks, and production deployments observed between January 2024 and April 2026.

CriterionLangGraphCrewAIMicrosoft AutoGen / Agent FrameworkOpenAI Agents SDKSalesforce AgentforceSphere (delivery partner)
1. Orchestration & control flow534335
2. Observability & tracing5 (LangSmith)24445
3. Evaluation & testing423435
4. Security & guardrails324455
5. Model flexibility5551 (OpenAI only)35
6. Tool integration / MCP435545
7. Memory & state533344
8. Deployment / residency445345
9. Cost transparency444424
10. Compliance & audit335455
11. Team seniority (vendor)435555
12. Production track record434345
Weighted total (out of 5.0)4.23.04.23.53.74.8

A few honest notes on this table. LangGraph and the rebuilt Microsoft Agent Framework (the AutoGen successor) are the two strongest open frameworks for production work. CrewAI is the easiest to prototype with but its observability story remains weak — debugging a misbehaving Crew is still painful in 2026. The OpenAI Agents SDK (which replaced Swarm in March 2025) is excellent if you have already committed to OpenAI models; it loses points on flexibility because that commitment is the price of entry. Agentforce is strong inside Salesforce and weak outside it, and its repeated pricing pivots make budgeting a board-level conversation.

Who should choose which vendor

Framework vs. platform vs. delivery partner: the real build-vs-buy

The build-vs-buy framing is misleading for agentic systems. The real choice is three-way:

  • Build on an open framework (LangGraph, CrewAI, Microsoft Agent Framework). Maximum flexibility, maximum engineering ownership. You own observability, evals, guardrails, and ops. Best when the agent is core to your product.
  • Buy a managed platform (Agentforce, Bedrock Agents, watsonx Orchestrate, Vertex AI Agent Builder, Copilot Studio). Faster time-to-first-pilot, governance built in, but you inherit the platform's pricing model and lock-in. Best when the agent is a workflow accelerator on top of an existing system of record.
  • Engage a delivery partner to do (1) or (2) with senior engineers who have shipped agents before. This is what Sphere's AI-augmented delivery model does — it does not replace the framework or platform decision, it shortens the path through it.

A useful heuristic: if your CTO cannot name three engineers on the team who have personally debugged an agent loop in production, you are not ready to build alone.

Realistic cost ranges for production pilots in 2026

According to Sphere's analysis of 47 enterprise AI agent projects delivered between 2023 and 2025 across financial services, healthcare, insurance, and manufacturing, production pilot costs cluster into three tiers. These are engagement costs (build with a delivery partner), separate from ongoing platform license and inference fees.

TierScopeBuild cost (with partner)Annual run cost
Small pilot (Tier 1)Single agent, 1–2 tools, narrow use case, one business unit$80K–$180K$40K–$120K
Mid-sized production (Tier 2)3–5 cooperating agents, 5–10 tools, eval pipeline, observability stack$180K–$450K$120K–$400K
Large enterprise rollout (Tier 3)Multi-team, multi-region, regulated data, full MLOps, governance, change management$450K–$1.2M$400K–$1.5M+

In Sphere's dataset, the median Tier 2 program ran 16 weeks from kickoff to first production traffic, with 22% of total budget consumed by data preparation, 18% by evaluation infrastructure, and 11% by guardrails and red-teaming — line items that vendor demos almost never highlight.

Two failure patterns showed up repeatedly. First, projects that skipped a labeled eval set saw a 3.4x higher rate of post-launch rollback. Second, projects without per-agent cost dashboards exceeded their first-quarter inference budget by an average of 2.7x. Both are preventable with a vendor that scores 4 or 5 on criteria 2, 3, and 9 above.

A reusable scoring template

Use this template in your RFP. Each vendor self-scores 1–5; you re-score after a technical deep-dive and reference checks. Multiply by weight, sum, and rank. A score below 3.5 weighted should be a hard pass for production work.

#CriterionWeightVendor self-scoreVerified scoreWeighted
1Orchestration & control flow12%
2Observability & tracing11%
3Evaluation & testing10%
4Security & guardrails10%
5Model flexibility8%
6Tool integration / MCP8%
7Memory & state7%
8Deployment / residency8%
9Cost transparency8%
10Compliance & audit7%
11Team seniority6%
12Production track record5%
Total100%/5.0
The bottom line
What separates production-ready agent vendors from demos
  • Most "agentic AI" pitches are agent washing. Gartner estimates only ~130 of the thousands of self-described agent vendors offer real agentic capability — score every vendor on demonstrated autonomy, not marketing language.
  • Observability and evals are the real moat. Vendors that cannot show you a trace and a regression suite on day one will cost you 2–3x more in production. They are the leading indicators of whether a project ends up in MIT NANDA's failing 95%.
  • Cost models are a leading risk indicator. Salesforce shipped three Agentforce pricing models in 18 months. Treat pricing volatility as a vendor risk, not just a procurement detail.
  • Build, buy, and partner are not mutually exclusive. The strongest 2026 enterprise programs run an open framework (usually LangGraph or the Microsoft Agent Framework) inside a managed cloud, with a senior delivery partner closing the gap between pilot and production.
  • Weight your scorecard for your risk profile. Regulated industries should boost criteria 4, 8, and 10. Customer-facing agents should boost 2 and 3. Internal-only automations can lean on 1 and 6.
Frequently Asked Questions

Common CTO Questions

Use the 12-point scorecard above: orchestration, observability, evaluation, security, model flexibility, tool/MCP integration, memory, deployment, cost transparency, compliance, team seniority, and production track record. Weight them based on your risk profile, then verify every self-reported score with a technical deep-dive and at least two customer references. The biggest mistake is over-weighting features and under-weighting observability, evals, and cost transparency — the three criteria most predictive of survival to production.
Score each vendor 1–5 on the 12 criteria, multiply by the published weights (totaling 100%), and sum to a /5.0 score. Anything below 3.5 weighted is a hard pass for production workloads. Score the vendor’s self-reported capability first, then re-score after a technical deep-dive — the gap between the two scores is itself a useful signal of how candid the vendor is.
Five questions cut through most pitches: (1) Show me a trace from a real production run, including token cost per step. (2) Show me your eval set and your last regression test report. (3) Walk me through how you handle prompt injection and tool misuse — name the OWASP categories. (4) Name three customers in production for six months or more, and let me call them. (5) What is your cost per resolved task, and how stable has your pricing model been over the last 12 months? If a vendor cannot answer all five, you are looking at a demo, not a system.
A short checklist version of the 12 criteria: explicit graph or state machine; trace-level observability with cost attribution; labeled eval set and regression suite; OWASP-aligned guardrails; multi-model support; native MCP; namespaced memory with deletion; VPC or on-prem deployment option; per-agent cost dashboards; SOC 2 plus your regulatory framework; named senior engineers; three referenceable production customers. Anything less is a prototype vendor, not a production vendor.
Use the scoring table in this article — copy it into a spreadsheet, list vendors as columns, and complete it with weighted scores. Add three meta-rows below: "Total cost of ownership over 24 months," "Time-to-first-production," and "Exit cost if we switch vendors in year two." Those three rows often change the ranking more than the technical score does.
Build on a framework (LangGraph, Microsoft Agent Framework, CrewAI) when the agent is core to your product or differentiator and you have senior engineering capacity. Buy a platform (Agentforce, Bedrock Agents, watsonx Orchestrate, Vertex AI Agent Builder, Copilot Studio) when the agent accelerates a workflow on top of an existing system of record and time-to-value matters more than flexibility. Most enterprises end up doing both — platform for embedded workflows, framework for differentiated capabilities — and engage a delivery partner like Sphere to integrate them.
A scoped Tier 1 pilot (single agent, narrow use case) takes 6–10 weeks from kickoff to a controlled production rollout. A Tier 2 multi-agent system with evals, observability, and integration into one or two enterprise systems takes 14–22 weeks. Anything claiming "agents in production in two weeks" is either a sandboxed demo or a system that will fail criteria 2, 3, and 4 of this scorecard.
Four show up repeatedly in 2026–2026 post-mortems: (1) agent loop runaway — agents that retry indefinitely with no budget cap; (2) tool misuse and excessive agency, where an agent calls a legitimate tool with out-of-scope parameters; (3) indirect prompt injection through retrieved documents or tool outputs; and (4) cost blowouts from unbounded token consumption. All four are addressed by criteria 1, 2, 4, and 9 of the scorecard — which is why those four together carry 41% of the total weight.
SR
About the Sphere Research Team
Editorial & Research Unit · CTO Accelerator

The Sphere Research Team is the editorial and research arm of Sphere's CTO Accelerator. Our analysis draws on 20+ years of enterprise delivery across AI, cloud, data, and modernization — spanning 230+ projects in financial services, healthcare, insurance, manufacturing, and private equity. Every framework, benchmark, and cost range published here is grounded in real project data and reviewed by Sphere's senior engineering leadership.