1. Evaluation Logic
The design principles behind the Zaira Standard. These explain why the standard is structured the way it is and guide interpretation of the criteria.
The standard scales with tool complexity
The simplest tools face the simplest evaluation. A library installed via npm install and used locally is evaluated against the Base Standard alone: documentation, lifecycle health, supply chain integrity, installation simplicity. No authentication criteria. No API surface area management. No rate limit communication. Those criteria don't apply because the tool doesn't have those concerns.
As a tool's complexity increases (it exposes an API, runs as a hosted service, handles destructive operations, requires credentials) complexity modules activate and add criteria proportional to that complexity. A cloud database with auth, writes, and a REST API is evaluated against the Base Standard plus four complexity modules. Evaluation scope matches the tool's surface area.
Every criterion in a tool's evaluation applies to that tool. There are no N/A markings, no skipped criteria, no denominator adjustments. If a criterion is activated in a tool's evaluation, it is relevant to that evaluation.
Health over activity
The standard measures whether a tool functions, not whether it is being actively developed. A stable, feature-complete library with zero open CVEs, passing continuous integration, and accurate documentation satisfies health criteria even without recent commits; a project with weekly commits and unaddressed issues does not. Criteria that might otherwise penalize inactivity (B4, B11) are defined in terms of empirical health signals (documentation-behavior correspondence, vulnerability patching cadence, installation success on current runtimes) rather than recency signals. Completion is a state demonstrated through continued health, not declared through version increments or announcements.
Discoverability is outside scope
The Zaira Standard evaluates whether a tool is ready for agents to use, not whether agents can find it. Discoverability (registry presence, search engine indexing, structured metadata) is a separate concern. Conflating discoverability with agent-readiness would penalize tools whose distribution is weak and reward tools whose distribution is strong, independent of agent-readiness.
Agent capability floor
The Zaira Standard defines a minimum agent capability threshold: the capability floor. Agents scoring below the floor are not target consumers of Zaira Standard evaluation results.
The capability floor for Zaira Standard v0.9 is 35% on SWE-Bench Pro V9, administered by SEAL. This benchmark evaluates base model and sub-agent performance on software engineering tasks with tool usage. It strips wrapper scaffolding and vendor-specific enhancements.
The floor is defined by benchmark score, not by model name. The capability floor is a normative parameter of each standard version, reviewed with each minor revision (see §5) following the standard change process.
2. How the Standard Works
Module Activation
Every tool starts with the Base Standard: 15 criteria that apply universally. Then, based on what the tool does and how it's accessed, complexity modules activate:
| Module | Trigger Question | Criteria Added |
|---|---|---|
| Programmatic Interface | Does the tool expose an API, SDK, or MCP server? | 16 |
| Network Service | Is the tool a hosted/remote service? | 8 |
| Write Operations | Can the tool create, modify, or delete data/resources? | 4 |
| Authentication | Does the tool require credentials or tokens? | 4 |
| CLI | Does the tool have a command-line interface? | 4 |
After complexity modules, one or more domain modules may apply based on the tool's functional domains (Payments, Databases, Communications, and others defined in §9).
Each trigger question is a simple yes/no. A tool may trigger zero, one, or many complexity modules. The triggers are independent: a tool can be a Network Service without having a CLI, or have Write Operations without requiring Authentication.
Example Evaluations
| Tool | Base | Interface | Network | Write | Auth | CLI | Domain(s) | Total Criteria |
|---|---|---|---|---|---|---|---|---|
| SQLite | ✓ | 15 | ||||||
| lodash | ✓ | 15 | ||||||
| React | ✓ | Frameworks | 18 | |||||
| git CLI | ✓ | ✓ | ✓ | 23 | ||||
| Terraform CLI | ✓ | ✓ | ✓ | ✓ | Hosting | 30 | ||
| Neon (DB) | ✓ | ✓ | ✓ | ✓ | ✓ | Databases | 51 | |
| Stripe | ✓ | ✓ | ✓ | ✓ | ✓ | Payments | 51 | |
| Supabase | ✓ | ✓ | ✓ | ✓ | ✓ | DB + Auth + Hosting | 57 | |
| AWS Redshift | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Databases | 55 |
The simplest tools have the smallest evaluation scope. The most complex tools activate every applicable module.
3. How to Read This Document
Criterion Structure
Each criterion includes:
| Field | Meaning |
|---|---|
| ID | Unique identifier (e.g., B1, AU3, NS5) |
| Name | Short descriptive name |
| Description | What this criterion evaluates |
| Requirement Level | MUST (binary gate), SHOULD (expected), or MAY (aspirational) |
| Weight | Critical (×2) or Standard (×1): determines point multiplier |
| Scoring Gradient | 0 (Failing), 1 (Basic), 2 (Good), 3 (Excellent) |
Criterion IDs use module-based prefixes: B (Base Standard), PI (Programmatic Interface), NS (Network Service), WO (Write Operations), AU (Authentication), CLI (CLI), PM (Payments), CM (Communications), DB (Databases), HI (Hosting), AP (Auth Providers), FL (Frameworks). IDs are sequential within each module: a criterion's ID indicates which module it belongs to.
Requirement Levels (RFC 2119)
- MUST: Binary gate. Score ≥1 required for Agent Ready or Agent Native. A tool that scores 0 on any MUST criterion in its activated modules cannot achieve either designation. MUST gates only appear in complexity modules: the Base Standard has no MUST gates.
- SHOULD: Expected for meaningful agent readiness. Scored and weighted. Low scores reduce tier eligibility.
- MAY: Aspirational. Demonstrates excellence. Primarily differentiates the highest tier.
Weight Categories
- Critical (×2): Criteria with the strongest documented impact on agent success rates, including error handling, description quality, and authentication.
- Standard (×1): All other criteria. Important but with less dramatic measured impact or less universal applicability.
Open-Source vs. Commercial Tool Splits
Where a criterion measures fundamentally different things depending on whether the tool is open-source or commercial, two scoring gradients are provided. An open-source library's sustainability risk is contributor concentration; a commercial SaaS tool's sustainability risk is corporate strategy and sunset policy. Both matter, but they require different evidence.
- Open-source tools: Use the open-source rubric (listed first).
- Commercial/closed-source tools: Use the commercial rubric instead.
- Hybrid tools (open-source core with commercial hosted offering): Evaluate against the open-source rubric for the open-source artifact. If the primary product being evaluated is the hosted service, use the commercial rubric.
Scores are directly comparable: a score of 2 on either rubric indicates the same qualitative level (Good).
4. Scoring System
How Criteria Map to Points
Each criterion is scored on a 0-3 scale:
| Score | Label | Meaning |
|---|---|---|
| 0 | Failing | Does not meet minimum requirements; actively harms agent usability |
| 1 | Basic | Minimum viable implementation; functional but limited |
| 2 | Good | Solid implementation; meaningfully supports agent workflows |
| 3 | Excellent | Exemplary implementation; designed with agents as a first-class consumer |
Point Calculation
Criterion Score = Raw Score (0-3) × Weight (1 or 2)
Module Score = Sum of all criterion scores in module
Total Score = Sum of all module scores
Percentage = Total Score / Maximum Possible Score for activated modules
Because modules only activate when relevant, there are no N/A adjustments. The denominator is always the maximum possible score for the specific modules a tool activates.
What is a good score?
A Zaira Score indicates a tool's agent-readiness. Scores fall into three descriptive bands, each of which characterizes a range of agent-usability. These bands describe how a score is interpreted; they are not certification outcomes.
Agent-Ready Base
Describes the score of a tool whose entire surface is covered by the Base Standard: a utility library, a pure-function module, or any tool without an API, network calls, write operations, or credentials. A score in this band indicates that the fundamentals (documentation, lifecycle health, supply chain integrity, installation) meet the standard. No additional evaluation applies because the tool does not expose further surface area.
| Indicator | What it looks like |
|---|---|
| Tool shape | Triggers zero complexity modules |
| Overall score | ≥60% of Base Standard |
| Zero scores | No more than 3 SHOULD criteria score 0 |
Agent Ready
Describes the score of a tool with substantial surface area (APIs, hosted services, write operations, authentication) that is substantively usable by agents for standard workflows across that surface. MUST gates pass, every applicable module is covered, and any remaining gaps are documented in the scorecard.
| Indicator | What it looks like |
|---|---|
| Tool shape | Triggers one or more complexity modules |
| Overall score | ≥60% |
| MUST criteria | All score ≥1 |
| Zero scores | No more than 5 MUST or SHOULD criteria score 0 |
Agent Native
The highest band. Describes the score of a tool that treats agents as first-class consumers across its full surface: no criterion at zero on any MUST or SHOULD, every MUST scoring at least 2, and an overall percentage of 80% or higher. A score in this band satisfies every Agent Ready indicator plus additional thresholds.
| Indicator | What it looks like |
|---|---|
| Tool shape | Triggers one or more complexity modules |
| Overall score | ≥80% |
| MUST criteria | All score ≥2 |
| Zero scores | No MUST or SHOULD criterion scores 0 |
Anti-Gaming Mechanisms
- MUST gates. MUST criteria in activated modules are binary gates: score 0 on any and the tool cannot achieve Agent Ready or Agent Native regardless of total score.
- Critical weighting. Criteria with the strongest empirical impact on agent success count double, preventing optimization toward lower-weighted criteria.
- Zero-score limits. Agent Ready allows no more than 5 SHOULD/MUST criteria at 0; Agent Native allows none. This prevents broad neglect across any part of the evaluation.
- Evidence requirements. Every score requires documented evidence (automated test output, URL, or evaluator rationale). No score without evidence.
Together, these mechanisms ensure that band placement reflects agent-readiness across a tool's full surface area, not selective optimization of lower-weighted criteria.
Scoring Examples
Example 1: Simple library (lodash)
Modules: Base only (15 criteria)
- 2 Critical × 3 × 2 = 12
- 13 Standard × 3 × 1 = 39
- Max: 51
No complexity modules triggered. Scores produced under this profile fall in the Agent-Ready Base band: the Base Standard covers the tool's entire surface, so the fundamentals are the complete evaluation.
Example 2: CLI tool with write operations (Terraform CLI)
Modules: Base (15) + Write Operations (4) + Authentication (4) + CLI (4) = 27 criteria
- Critical criteria: B1, B10, WO1, AU1 = 4 Critical
- 4 Critical × 3 × 2 = 24
- 23 Standard × 3 × 1 = 69
- Max: 93
MUST gates: AU1 Eligible for Agent Ready or Agent Native band.
Example 3: Full SaaS payment platform (Stripe)
Modules: Base (15) + Programmatic Interface (16) + Network Service (8) + Write Operations (4) + Authentication (4) + Payments (4) = 51 criteria
- Critical criteria: B1, B10, PI1, PI2, PI3, PI4, PI10, NS1, NS2, WO1, AU1 = 11 Critical
- 11 Critical × 3 × 2 = 66
- 40 Standard × 3 × 1 = 120
- Max: 186
MUST gates: PI1, PI2, PI15, NS1, AU1 Eligible for Agent Ready or Agent Native band.
Example scores:
- Base: 34/51 (67%)
- Programmatic Interface: 42/63 (67%)
- Network Service: 20/30 (67%)
- Write Operations: 11/15 (73%)
- Authentication: 11/15 (73%)
- Payments: 8/12 (67%)
Total: 126/186 = 68%
Band check:
- ≥60%? Yes → Agent Ready candidate
- All MUSTs ≥1? (PI1 ✓, PI2 ✓, PI15 ✓, NS1 ✓, AU1 ✓) Yes ✓
- ≤5 SHOULD/MUST criteria score 0? Yes (2 zeros) ✓
- Result: Agent Ready
5. Versioning & Governance
The Zaira Standard is public and stable. Revisions are made only when durable shifts in agent-tool interaction warrant them, not in response to short-term trends. Version changes are announced at least 30 days before they take effect, giving implementers and downstream evaluators time to adapt.
Version Scheme
- v0.9: Current version. Scoring thresholds and weights may be refined based on evaluation data before v1.0.
- v1.0: Stable release with finalized thresholds.
- v1.x (Minor): Additive criteria, refined scoring, weight adjustments. Backward compatible.
- v2.0 (Major): Structural changes (module additions/removals, score band or threshold revisions).
6. Scope & Limitations
The Zaira Standard evaluates agent usability. It does not evaluate:
- Tool quality or fitness for purpose. Whether the tool is good at what it does.
- Security guarantees. A tool can meet all Safety criteria and still have undiscovered vulnerabilities. The standard evaluates evidence and practices, not absence of risk.
- Performance benchmarks. Response time, throughput, and uptime are not evaluated (though health endpoints are).
- Pricing fairness. Whether pricing is transparent and machine-readable is evaluated, not whether it's competitive.
- Training data representation. How well current models know a tool is not under the tool's control.
- Compliance certifications. SOC 2, ISO 27001, HIPAA, etc. are noted when present but not replicated.
- Runtime enforcement. The standard publishes evaluation data; enforcement is the responsibility of agent runtimes and policy engines.
- Human developer experience. Criteria are evaluated from the agent's perspective.
- Discoverability. Whether agents can find a tool is outside the standard's scope.
7. Base Standard
15 criteria that apply to every tool, regardless of type.
The Base Standard evaluates the fundamentals: whether an agent can learn to use a tool, whether it is safe to depend on, and whether it will remain available in the future. Every tool (from a 50-line utility library to a cloud platform) is evaluated against these criteria.
The Base Standard has no MUST gates. Simple tools meet the Base Standard or do not, based on overall score. MUST gates appear in complexity modules, where failure has greater operational impact.
Documentation & Usability
Documentation and installation quality.
Documentation and installation are the entry point. An agent encountering a tool for the first time needs machine-readable, structured, self-contained information to understand capabilities and usage, and a friction-free path to get it installed and working. More documentation is not better. Irrelevant docs actively harm agent performance. Quality, structure, and machine-readability matter more than volume.
B1Machine-Readable Documentation Formats
Whether documentation is available in formats agents can directly consume, not just human-rendered HTML requiring JavaScript.
| Score | Description |
|---|---|
| 0. Failing | Interactive-only docs (Swagger UI without downloadable spec); video tutorials; CSS-styled HTML requiring JS rendering |
| 1. Basic | Docs available as static HTML or Markdown; README with usage instructions |
| 2. Good | Comprehensive Markdown docs; llms.txt present; docs available via content negotiation or direct download |
| 3. Excellent | llms.txt + AGENTS.md + content negotiation (Accept: text/markdown) + version-matched bundled docs or MCP documentation server |
B2Code Example Coverage & Quality
The density, quality, realism, and progressive complexity of code examples.
| Score | Description |
|---|---|
| 0. Failing | No code examples; or examples use placeholder data ("string", 123) |
| 1. Basic | At least one example per major feature; examples use realistic data |
| 2. Good | Examples for most features/methods; include both success and error scenarios; copy-pasteable |
| 3. Excellent | Progressive complexity (minimal → common → advanced); 1–5 examples per feature focusing on ambiguous cases; error recovery examples included |
B3Documentation Structure & Self-Containment
Whether documentation sections are self-contained (extractable independently), structured for agent consumption, and appropriately chunked.
| Score | Description |
|---|---|
| 0. Failing | Documentation is a single monolithic page; no logical sections; requires full-document context to understand any part |
| 1. Basic | Logical section divisions with headings; most sections are readable independently |
| 2. Good | Answer-first format; descriptive headings that function as queries; self-contained sections of 100–200 words; cross-references include inline context |
| 3. Excellent | All sections independently extractable; tables for structured data; critical information in first 30% of each section; optimized for retrieval-augmented generation |
B4Documentation Accuracy & Synchronization
Whether documentation accurately reflects actual tool behavior. Accuracy takes precedence over recency. Documentation that has been unchanged for an extended period but accurately describes current tool behavior scores higher than recently updated documentation that contains errors.
| Score | Description |
|---|---|
| 0. Failing | Documented behavior contradicts actual tool behavior; or documented methods/functions don't exist; or docs describe a different version than the current release |
| 1. Basic | No known major inaccuracies; documented examples produce expected results |
| 2. Good | Docs-as-code (versioned alongside product); documented examples tested; version-matched (docs specify which version they describe) |
| 3. Excellent | Docs updated with every release; CI/CD blocks deployment without doc updates; automated drift detection |
B5Getting Started Completeness
Whether an agent can go from zero to working usage using only the documentation.
| Score | Description |
|---|---|
| 0. Failing | No getting-started guide; or guide requires significant external knowledge |
| 1. Basic | Getting-started guide exists; covers basic setup |
| 2. Good | Guide is completable by following docs alone without external knowledge; includes installation, first usage, and expected output |
| 3. Excellent | Agent-specific quickstart or integration guide; includes common pitfalls; minimal viable example under 20 lines of code; first successful usage achievable in <5 minutes |
B6Changelog & Migration Guidance
Whether changes are communicated in structured, parseable formats.
| Score | Description |
|---|---|
| 0. Failing | No changelog; or changelog is unstructured prose buried in blog posts |
| 1. Basic | Changelog exists with dated entries |
| 2. Good | Structured and parseable changelog (consistent format); semantic versioning; breaking changes clearly marked |
| 3. Excellent | Changelog available as structured data (JSON, RSS/Atom feed); migration guides for breaking changes; deprecated markers in code or specs |
B7Installation & Configuration Simplicity
How easily an agent can install, set up, and start using a tool.
| Score | Description |
|---|---|
| 0. Failing | GUI installer required; complex multi-step build process; missing executables with no clear resolution |
| 1. Basic | Package manager install (npm install, pip install); basic documentation |
| 2. Good | Single command install; environment variable configuration; clear error messages on misconfiguration |
| 3. Excellent | Single static binary or zero-dependency install; zero-config usage possible; scaffolding tools for project setup; JSON Schema for configuration validation |
Safety Fundamentals
Supply chain integrity and vulnerability disclosure.
Two safety criteria apply to every tool regardless of type: supply chain integrity and vulnerability disclosure. Together they establish the minimum baseline: provenance verifiability and a defined path for reporting security issues.
B8Supply Chain Integrity
Whether publisher identity is verified, releases are signed, and the supply chain is tamper-evident.
Open-source tools: evaluate this section:
| Score | Description |
|---|---|
| 0. Failing | Anonymous or unclear maintainer identity; no integrity signals |
| 1. Basic | Verified publisher on npm/PyPI; GitHub org verification; some identity signals present |
| 2. Good | Signed tags or releases; dependency pinning (lock files); verified domain → repo → artifact chain |
| 3. Excellent | Sigstore/cosign signing; SLSA provenance attestation; reproducible builds; SBOM published; complete identity chain (domain → repo → artifact maintainer match) |
Trust signal hierarchy: Reproducible builds > SLSA attestation > Code signing > Verified domain > GitHub org verification > npm/PyPI verified publisher > SBOM > security.txt
Commercial/closed-source tools: evaluate this section instead:
| Score | Description |
|---|---|
| 0. Failing | No verifiable publisher identity; SDK or agent distributed through unofficial channels; no integrity signals |
| 1. Basic | Verified company domain; SDK published under verified org on npm/PyPI; official distribution channels clearly identified |
| 2. Good | SDKs signed or published with verified provenance; official distribution channels documented; dependency pinning in SDK; checksums for downloadable artifacts |
| 3. Excellent | SDKs with code signing and provenance attestation; SOC 2 Type II or equivalent supply chain controls; SBOM for SDK dependencies; documented build and release security practices |
B9Vulnerability Disclosure & Security Contact
Whether a clear, machine-readable path exists for reporting security vulnerabilities.
| Score | Description |
|---|---|
| 0. Failing | No clear vulnerability reporting path |
| 1. Basic | Generic security contact exists (email address, contact form) |
| 2. Good | Published vulnerability disclosure policy with clear process and timeline commitments |
| 3. Excellent | security.txt present (RFC 9116) at /.well-known/security.txt; published vulnerability disclosure policy; bug bounty program; clear response commitments (acknowledgment, assessment, and fix timelines) |
Lifecycle Health
Long-term reliability and continuity.
These criteria evaluate long-term viability: whether a tool will continue to work, be maintained, and remain safe to depend on. A tool that scores well at evaluation but is abandoned soon after offers limited long-term reliability for agent dependency.
Open-source and commercial tools have fundamentally different risk profiles. An open-source tool's risk is contributor abandonment; a commercial tool's risk is corporate sunset or acquisition. Where this distinction matters, criteria provide separate rubrics. For open-source tools, most criteria are fully automatable from public data (GitHub API, package registries, OpenSSF Scorecard).
B10Project Sustainability
The likelihood that this tool will continue to be maintained and supported over time.
Open-source tools: evaluate this section:
| Score | Description |
|---|---|
| 0. Failing | Bus factor of 1 (single contributor accounts for >50% of contributions); or no commits in 12+ months with open issues |
| 1. Basic | Bus factor of 2–3; some contributor diversity |
| 2. Good | Bus factor of 4–10; multiple active contributors; no single contributor >50% of recent commits |
| 3. Excellent | Bus factor >10; organizational backing; contributor pipeline visible (new contributors joining) |
Commercial/closed-source tools: evaluate this section instead:
| Score | Description |
|---|---|
| 0. Failing | No visible team or organization; single-person operation with no stated continuity plan |
| 1. Basic | Established company; identifiable team; product actively marketed |
| 2. Good | Company with public funding or revenue signals; dedicated product team; published product roadmap |
| 3. Excellent | Publicly traded or well-funded company; product is a core revenue line (not a side project); published sunset/migration policy; data export API |
B11Maintenance Health
Whether the tool demonstrates operational health. Active health is not synonymous with recent commit activity: it is evidence that the tool continues to function and that unresolved issues receive maintainer response. A project with no open issues and no recent commits satisfies this criterion; a project with numerous unanswered issues and no recent commits does not. The distinction is measurable.
| Score | Description |
|---|---|
| 0. Failing | Unanswered issues accumulating (>10 open issues with no maintainer response in 90+ days); or unpatched known vulnerabilities >90 days old; or fails to install on current LTS runtimes |
| 1. Basic | Open issues receive some response; no unpatched critical vulnerabilities; installs and runs on current platforms |
| 2. Good | <7 days median issue response when issues exist; dependencies up to date or pinned to non-vulnerable versions; CI passing on current runtimes |
| 3. Excellent | <48 hours issue triage; proactive dependency updates; CI tests against multiple runtime versions; clear triage labels; published response time commitments. OR for mature stable projects: zero unpatched vulnerabilities; CI passing on current LTS runtimes; <7 day response on the last 5 issues filed (whenever they were filed); dependencies pinned to non-vulnerable versions; no open issues older than 180 days without maintainer response |
B12Semver Adherence & Version Stability
Whether breaking changes are confined to major versions and the tool follows predictable versioning.
| Score | Description |
|---|---|
| 0. Failing | No versioning strategy; breaking changes in minor/patch releases; or perpetually pre-1.0 with breaking changes |
| 1. Basic | Versioned releases exist; some semver adherence |
| 2. Good | Semver-compliant; breaking changes in major versions only; pre-1.0 tools clearly labeled as unstable |
| 3. Excellent | Strict semver; documented API stability guarantees; machine-readable compatibility matrices; LTS versions for production use |
B13Governance & Continuity
Whether the tool has governance structures that reduce single-entity risk and provide continuity assurance.
Open-source tools: evaluate this section:
| Score | Description |
|---|---|
| 0. Failing | Single individual maintainer with no organizational backing; no succession plan |
| 1. Basic | Multiple maintainers with informal governance; or backed by a single company |
| 2. Good | Open governance model; contributor guidelines; decision-making process documented; multiple organizational contributors |
| 3. Excellent | Foundation governance (CNCF, Apache, Linux Foundation); formal succession planning; multiple organizational contributors with commit rights |
Commercial/closed-source tools: evaluate this section instead:
| Score | Description |
|---|---|
| 0. Failing | No public information about the company or team; no terms of service addressing continuity |
| 1. Basic | Established company with identifiable leadership; standard terms of service |
| 2. Good | Published data portability/export mechanisms; documented SLA; company financials or funding publicly known |
| 3. Excellent | Publicly traded or independently audited financials; published sunset policy with migration timeline commitments; data escrow or open-source fallback clause; contractual SLA with uptime guarantees |
B14Security Track Record
Vulnerability response speed and proactive security practices.
Open-source tools: evaluate this section:
| Score | Description |
|---|---|
| 0. Failing | Known unpatched vulnerabilities >90 days old; no security response history; OpenSSF Scorecard <3/10 |
| 1. Basic | Vulnerabilities patched within 90 days; some security practices visible; OpenSSF Scorecard 3–5/10 |
| 2. Good | Vulnerabilities patched within 30 days; code review enforced; branch protection enabled; OpenSSF Scorecard 5–7/10 |
| 3. Excellent | Vulnerabilities patched within 14 days; comprehensive security practices; CI security scanning; OpenSSF Scorecard >7/10; Code-Review check passing |
Commercial/closed-source tools: evaluate this section instead:
| Score | Description |
|---|---|
| 0. Failing | No public security information; no evidence of security practices; known incidents with no public response |
| 1. Basic | Security contact or security.txt exists; incidents acknowledged publicly; some security practices described on website |
| 2. Good | Published security practices page; SOC 2 Type I or equivalent; vulnerability disclosure policy with timeline commitments; incident post-mortems published |
| 3. Excellent | SOC 2 Type II or ISO 27001 certified; bug bounty program; incident post-mortems with root cause analysis; proactive security advisories; SDK dependencies regularly audited |
B15Terms & Licensing Stability
Whether the terms under which the tool is available are stable and free from change risk signals.
Open-source tools: evaluate this section:
| Score | Description |
|---|---|
| 0. Failing | No license specified; or non-standard/proprietary license with no stability commitment |
| 1. Basic | OSI-approved license |
| 2. Good | Stable OSI license with no change risk signals (no single-company >80% commits + broad CLA + cloud competition pattern) |
| 3. Excellent | Stable license + none of the known change risk indicators; or irrevocable license grant; foundation-held copyright |
Commercial/closed-source tools: evaluate this section instead:
| Score | Description |
|---|---|
| 0. Failing | No published terms of service; or terms allow unilateral changes with no notice |
| 1. Basic | Published terms of service; clear commercial licensing terms |
| 2. Good | Pricing commitments of 12+ months; terms require 90+ days notice for material changes; grandfathering policy for existing customers |
| 3. Excellent | Multi-year pricing commitments or published pricing history demonstrating stability; contractual protection against adverse term changes; machine-readable pricing API; published API deprecation policy with 12+ month windows |
8. Complexity Modules
Complexity modules add criteria based on how the tool is accessed and what it does. Each module is activated by a yes/no trigger question. A tool may activate zero, one, or many complexity modules: the triggers are independent.
Activating at least one complexity module places a tool's score in the Agent Ready or Agent Native band (see §4). Tools evaluated on the Base Standard alone land in the Agent-Ready Base band.
8.1 Module: Programmatic Interface
Trigger: Does the tool expose an API (REST, GraphQL, gRPC), SDK, or MCP server?
16 criteria. Evaluates the quality and safety of the agent-facing programmatic interface: descriptions, schemas, outputs, naming, protocol support, and interface-level security.
The quality and safety of the tool's agent-facing programmatic interface. Tool description quality is identified across 13+ independent sources as the factor with the greatest single impact on agent success rates. Across description quality, schema design, output size, and naming, minimalism correlates with improved agent performance: fewer tools, tighter schemas, smaller outputs, and more precise names each improve measured outcomes. Interface-level security (input sanitization, prompt injection resistance) applies to any programmatic interface: read-only or read-write.
PI1Interface Reference Completeness
Whether the programmatic interface (API endpoints, SDK methods, MCP tools) is documented with sufficient detail for an agent to use without guessing.
| Score | Description |
|---|---|
| 0. Failing | No interface documentation; or docs exist but cover <50% of methods/endpoints |
| 1. Basic | Methods/endpoints are documented; >50% have basic descriptions |
| 2. Good | >80% of methods/endpoints have request/response examples; parameter types and constraints documented |
| 3. Excellent | 100% coverage with examples, edge cases, and error scenarios documented per method/endpoint; parameter constraints include formats, ranges, and valid values |
PI2Tool/Endpoint Description Quality
The completeness, specificity, and actionability of descriptions attached to tools, functions, or API endpoints.
| Score | Description |
|---|---|
| 0. Failing | Missing descriptions, name restatement ("Gets data"), no parameter descriptions |
| 1. Basic | Basic description of what tool/endpoint does; some parameter documentation |
| 2. Good | Specific descriptions with when-to-use guidance; all parameters described with types and examples; 1–2 usage examples per tool |
| 3. Excellent | When-to-use AND when-NOT-to-use; inline examples with realistic data; enum values listed; return format documented; 1–5 examples focusing on ambiguous cases; edge cases noted |
PI3Tool Count & Surface Area Management
The number of tools/endpoints exposed to an agent at once, and whether mechanisms exist to manage surface area.
| Score | Description |
|---|---|
| 0. Failing | 50+ undifferentiated tools; each tool = one REST endpoint (API-mirroring pattern); or 31–49 tools with no meaningful grouping or surface area management |
| 1. Basic | ≤30 tools; some logical grouping |
| 2. Good | 5–15 focused tools designed around user outcomes (not API operations); logical grouping |
| 3. Excellent | 5–15 tools + dynamic discovery/deferred loading for larger catalogs; code-mode pattern for complex APIs; semantic search over tool catalog |
PI4Input Schema Design
The degree to which input validation prevents common agent errors through strict schemas, constrained formats, and actionable feedback.
| Score | Description |
|---|---|
| 0. Failing | No schema validation; loose types; accepts malformed input silently |
| 1. Basic | Basic JSON Schema with types; some required field marking |
| 2. Good | Strict schemas with enums for constrained fields; format examples; ≤3 nesting levels; all properties documented |
| 3. Excellent | Flat top-level primitives; comprehensive enum/default/description; ≤500 tokens per tool schema; strict: true compatible; additionalProperties: false |
PI5Output Quality & Token Efficiency
Whether responses contain high-signal, bounded-size data with pagination and format optimization.
| Score | Description |
|---|---|
| 0. Failing | Unbounded responses with no size constraints (100K+ tokens possible); opaque UUIDs without context; no pagination |
| 1. Basic | Pagination exists; typical responses <10K tokens |
| 2. Good | Paginated with cursor metadata (has_more, next_cursor); compact summaries; semantic identifiers; filtering parameters |
| 3. Excellent | Token-budgeted responses; outputSchema defined; concise mode available; cursor-based pagination; CSV/TSV option for tabular data; response size bounded by default |
PI6Response Envelope Consistency
Whether all API endpoints return responses in the same structural shape.
| Score | Description |
|---|---|
| 0. Failing | Variable response shapes across endpoints; inconsistent naming; fields omitted when null; type instability (field is sometimes string, sometimes array) |
| 1. Basic | Mostly consistent; some endpoints deviate; same general pattern |
| 2. Good | Consistent envelope structure; consistent naming convention (snake_case or camelCase, not mixed); null fields included as null (scalars) or [] (collections) |
| 3. Excellent | Identical envelope everywhere; automated linting enforces consistency; type stability guaranteed; published response schema |
PI7Naming & Namespacing
The predictability, distinctiveness, and collision-resistance of tool, function, or endpoint names.
| Score | Description |
|---|---|
| 0. Failing | Generic names like search, get_data, doThing; inconsistent casing; no namespace prefix |
| 1. Basic | Descriptive names; consistent casing (snake_case preferred); no service prefix |
| 2. Good | Service-prefixed snake_case (e.g., stripe_create_charge, github_list_issues) |
| 3. Excellent | Service-prefixed + self-descriptive + unique within multi-server ecosystem + predictable pattern across all tools |
PI8Behavioral Metadata & Annotations
Machine-readable metadata declaring whether a tool is read-only, destructive, idempotent, and whether it interacts with external entities.
| Score | Description |
|---|---|
| 0. Failing | No behavioral annotations; all tools appear equivalent |
| 1. Basic | readOnlyHint set on obvious read-only tools |
| 2. Good | All four MCP annotations set accurately (readOnlyHint, destructiveHint, idempotentHint, openWorldHint) on every tool |
| 3. Excellent | Full annotations + output annotations (audience, priority) + risk ratings + HTTP method semantics matching behavior (GET = read-only, DELETE = destructive) |
PI9MCP Implementation Quality
When an MCP server exists (official or community), the quality of that implementation.
| Score | Description |
|---|---|
| 0. Failing | No MCP server exists (official or community); OR MCP server exists but has critical quality issues: 50+ undifferentiated tools, no descriptions, command injection vulnerabilities, no error handling |
| 1. Basic | Tools have descriptions; auth documented; read and write operations present |
| 2. Good | 5–20 focused tools designed around outcomes (not API mirroring); all four MCP annotations set accurately; read-only and read-write tools clearly separated; documented auth |
| 3. Excellent | All of above + deferred loading / dynamic discovery for large catalogs; safety-tiered tools (read/write/destructive separated); read-only mode available; output annotations; supported clients documented; maintained alongside product releases |
Always evaluated: PI9 is always included in the evaluation for tools that trigger the Programmatic Interface module (it is never excluded from the denominator). Tools without an MCP server score 0 on PI9. Because PI9 is a MAY criterion, this score-0 has limited impact on reaching the Agent Ready band (where MAY criteria primarily contribute to the overall percentage) but meaningfully affects the Agent Native band (which requires no SHOULD or MUST criterion scores 0, and where MAY criteria still contribute to the ≥80% overall threshold). The practical effect: a tool's score can reach Agent Ready without MCP, but Agent Native demands either an MCP server or enough excellence elsewhere to absorb the PI9 zero. This creates a directional incentive toward MCP adoption without penalizing non-adoption at the Agent Ready band.
PI10Programmatic Setup / Time to First API Call
The amount of tool-specific configuration required after an agent has valid credentials (or no credentials are needed) before it can make a successful API call. This criterion measures post-credential setup quality: how well the tool minimizes friction between credential acquisition and a successful first API call.
Separation of concerns with AU1: Credential acquisition (account creation, key generation, OAuth setup) is evaluated under AU1 (Non-Interactive Authentication Methods). PI10 measures everything after authentication is solved. A tool that requires browser-based account creation is already penalized on AU1; PI10 does not double-count that friction.
| Score | Description |
|---|---|
| 0. Failing | >10 minutes post-credential setup; multiple dashboard-only configuration steps required before first API call; tool-specific configuration requires human interaction |
| 1. Basic | 5–10 minutes; 1–2 tool-specific configuration steps (project creation, API enablement, webhook setup) |
| 2. Good | 2–5 minutes; single environment variable or config file; sandbox/test mode available immediately; clear error on misconfiguration |
| 3. Excellent | <2 minutes; zero-config possible for basic usage; test/sandbox works immediately with credentials alone; config validation with actionable errors; programmatic project setup via API |
PI11API Workflow Coverage
The percentage of common workflows completable entirely through the API without requiring web dashboard interaction.
| Score | Description |
|---|---|
| 0. Failing | Core functionality requires dashboard; API covers <50% of common workflows |
| 1. Basic | Core CRUD operations available via API; some configuration requires dashboard |
| 2. Good | >80% of common workflows completable via API; dashboard-only steps documented |
| 3. Excellent | 100% of functionality available via API; no dashboard-only features for any common workflow |
PI12Versioning & API Stability
Whether the API uses explicit versioning with adequate deprecation signals and managed breaking changes.
| Score | Description |
|---|---|
| 0. Failing | No versioning strategy; unannounced breaking changes |
| 1. Basic | Version identifier exists (URL path, header, or parameter); some deprecation notices |
| 2. Good | Explicit versioning with documented deprecation policy; deprecated: true in specs; 6+ month deprecation windows |
| 3. Excellent | Semver-adherent; Sunset headers; machine-readable deprecation timeline; previous version maintained for 12+ months after deprecation |
PI13SDK Availability & Quality
Whether official SDKs exist in languages agents commonly use, and whether they're well-maintained.
| Score | Description |
|---|---|
| 0. Failing | No SDK; raw HTTP only |
| 1. Basic | Official SDK in 1 major language (Python or TypeScript/JavaScript) |
| 2. Good | Official SDKs in 2+ major languages; idiomatic to each; typed interfaces |
| 3. Excellent | SDKs in 4+ languages; type-safe with branded types; auto-generated from OpenAPI spec; maintained in sync with API releases |
PI14Agent Protocol Availability
Whether the tool provides high-quality programmatic interfaces for agent interaction. A well-designed REST API, an MCP server, or both are valid paths.
| Score | Description |
|---|---|
| 0. Failing | No programmatic interface; GUI/dashboard only; or API exists but is undocumented |
| 1. Basic | REST API with basic documentation; OR community MCP server exists |
| 2. Good | Well-designed REST API with OpenAPI spec and SDKs in 2+ languages; OR official MCP server with documented auth and core operation coverage |
| 3. Excellent | Excellent REST API with comprehensive SDKs AND official MCP server; OR one interface executed at exceptional quality (e.g., Stripe-quality API without MCP, or best-in-class MCP without REST) |
PI15Input Sanitization & Injection Resistance
Whether the tool demonstrates evidence of input sanitization and defense against injection attacks through schema design, security infrastructure, documentation, and architectural patterns. Because PI15 is a MUST gate, a tool that exposes a programmatic interface with no evidence of input sanitization cannot reach the Agent Ready or Agent Native band regardless of total score.
| Score | Description |
|---|---|
| 0. Failing | No evidence of input sanitization: no schema validation, no parameterized queries, no security infrastructure, no documentation of input handling practices |
| 1. Basic | Basic input validation evidenced: strict input schemas with type checking (from PI4); parameterized queries documented; or WAF/CDN security infrastructure detected |
| 2. Good | Comprehensive sanitization evidence: strict schemas with additionalProperties: false across all endpoints; parameterized operations documented throughout; security infrastructure present; input validation practices documented |
| 3. Excellent | All of above + allowlist-based input validation documented where feasible; security testing in CI (detected via workflow analysis); defense-in-depth architecture documented; vendor-provided security assessment or third-party audit results available |
PI16Prompt Injection Resistance
Tool-level defense-in-depth against prompt injection: strict schemas, output sanitization, injection-resistant designs.
| Score | Description |
|---|---|
| 0. Failing | Unsafe patterns present or encouraged; no awareness of injection risks; tool descriptions contain narrative or references to other tools |
| 1. Basic | Strict input schema validation; parameterized operations; minimal description surface |
| 2. Good | Output structured with clear field boundaries (JSON); response size limits; descriptions concise and self-contained; no cross-tool references in descriptions |
| 3. Excellent | Explicit design mitigations documented; policy layer for untrusted content; output validation; structured action metadata; separation of untrusted content from control flow |
8.2 Module: Network Service
Trigger: Is the tool a hosted or remote service (SaaS, PaaS, cloud API)?
8 criteria. Evaluates concerns specific to services that run remotely: error handling, rate limits, health endpoints, observability, sandboxing, environment separation, and data portability.
NS1Error Response Quality & Structure
Whether error responses provide structured, machine-parseable information enabling agents to diagnose problems, determine retryability, and execute recovery actions.
| Score | Description |
|---|---|
| 0. Failing | HTML error pages; empty responses; generic "Something went wrong"; silent failures (no error flag set) |
| 1. Basic | JSON errors with human-readable message and machine-readable error code; MCP errors set isError: true |
| 2. Good | RFC 9457 compliant (type, title, status, detail); all validation errors reported simultaneously (not one-at-a-time); field-level identification; doc_url per error type |
| 3. Excellent | All of above + is_retriable boolean + retry_after_seconds + suggested alternative actions + hierarchical error taxonomy (e.g., Stripe: type → code → decline_code) + numbered recovery steps |
NS2Rate Limit Communication
Whether rate limits are communicated proactively and include machine-actionable timing signals.
| Score | Description |
|---|---|
| 0. Failing | No rate limit headers; no Retry-After on 429 responses; undocumented limits |
| 1. Basic | Retry-After on 429 responses; limits documented somewhere |
| 2. Good | Rate limit headers on every response (X-RateLimit-Remaining, X-RateLimit-Limit, X-RateLimit-Reset); per-key limits; scope declared (per-endpoint vs. global) |
| 3. Excellent | Full header suite on all responses; batch endpoints to reduce call count; resource-aware cost metadata (e.g., operationCost: { credits: 5 }); per-agent rate limits |
NS3Health & Status Communication
Whether the tool provides structured health endpoints that agents can query to assess availability.
| Score | Description |
|---|---|
| 0. Failing | No health endpoint; HTML status pages only |
| 1. Basic | /health endpoint returns JSON with aggregate up/down status |
| 2. Good | Component-level status; Retry-After on 503 responses; maintenance schedule available |
| 3. Excellent | Per-dependency status; degradation warnings in response metadata; application/health+json format (IETF Internet-Draft); planned maintenance pre-signaled |
NS4Audit & Observability
Whether the tool logs agent interactions with sufficient detail for forensic analysis and compliance.
| Score | Description |
|---|---|
| 0. Failing | No meaningful logs or observability for API/tool interactions |
| 1. Basic | Basic request/response logging; API key identified in logs |
| 2. Good | Audit logs with correlation IDs; sensitive-data redaction; rate limits enforced with logged violations; agent identity distinguished from human in logs |
| 3. Excellent | OpenTelemetry-compatible trace/span IDs; immutable append-only audit logs; delegation chain logging; anomaly detection or alerting; per-action risk tier logging |
NS5Test/Sandbox Environment Support
Whether the tool provides sandbox environments, test keys, and safe experimentation modes.
| Score | Description |
|---|---|
| 0. Failing | No test mode; no sandbox; all mutations occur in the production environment |
| 1. Basic | Test mode exists but with limited simulation |
| 2. Good | Separate sandbox environment + basic behavioral simulation + API-verifiable mode (test responses indicate test mode) |
| 3. Excellent | Structurally distinct test/live keys (prefixed like sk_test_); separate sandbox URLs; full behavioral simulation; multiple sandboxes; time simulation (Stripe test clocks, Neon database branching) |
NS6Environment Separation
Whether the tool architecturally separates development, staging, and production environments.
| Score | Description |
|---|---|
| 0. Failing | No environment separation; single set of credentials for all environments; test and production data co-mingled |
| 1. Basic | Separate environments exist but share credentials or configuration |
| 2. Good | Distinct credentials per environment; environment clearly indicated in API responses; preview/staging deployments available |
| 3. Excellent | Environment-specific URLs and credentials; database branching for isolated experimentation; deploy previews via API; environment promotion workflow (dev → staging → prod) |
NS7Asynchronous Operation Support
Whether long-running operations return immediately with a durable handle and provide status mechanisms.
| Score | Description |
|---|---|
| 0. Failing | Long-running operations block until complete; timeouts cause retries with potential duplicates |
| 1. Basic | HTTP 202 Accepted pattern with task/job ID on some operations |
| 2. Good | Consistent async pattern across all long-running operations; polling endpoint with status; estimated_seconds in 202 response |
| 3. Excellent | Full async with lifecycle states (working → completed/failed/cancelled); both polling and webhook notification; blocking result endpoint for simple cases; progress reporting |
NS8Data Portability & Pricing Transparency
Whether the service provides programmatic access to pricing, usage tracking, and data export. Agents operating autonomously cannot parse marketing pages, "Contact Sales" buttons, or dashboard-only usage tracking: they need structured, machine-readable access to costs, consumption, and data portability.
| Score | Description |
|---|---|
| 0. Failing | No data export capability; pricing only on marketing pages; no usage tracking API |
| 1. Basic | Manual data export (dashboard); published pricing page; basic usage visible in dashboard |
| 2. Good | Programmatic data export API; published pricing with clear unit costs; usage tracking API; billing alerts |
| 3. Excellent | Bulk export API with standard formats (CSV, JSON, Parquet); machine-readable pricing API or structured pricing page; real-time usage tracking; spending limit API; cost estimation before provisioning |
Relationship to HI1 (Cost Guardrails): HI1 evaluates mechanisms to prevent cost overruns (spending limits, auto-stop, cost caps). NS8 evaluates information availability: can agents determine what something costs, how much has been spent, and whether data can be extracted? A tool can score well on NS8 (transparent pricing, usage API) while scoring poorly on HI1 (no spending limits), or vice versa.
8.3 Module: Write Operations
Trigger: Can the tool create, modify, or delete data or resources?
4 criteria. Evaluates safeguards for irreversible actions: destructive operation safety, dry-run capability, idempotency, and multi-step error handling.
WO1Destructive Operation Safety
Mechanisms that prevent agents from executing irreversible destructive operations without appropriate safeguards.
| Score | Description |
|---|---|
| 0. Failing | No guardrails; agent gets full read/write/delete access by default; no confirmation patterns |
| 1. Basic | Database-level or API-level permissions with agent-specific restricted roles; some operations require confirmation |
| 2. Good | Layered defenses: read-only modes + lexical blocklists (DROP, DELETE, TRUNCATE) + human confirmation gates for high-risk operations; soft delete support |
| 3. Excellent | Physical write prevention (read-only replicas); destructive ops excluded from agent-facing interfaces; structural prevention patterns (auth-capture for payments, plan-apply for infra); draft/preview/publish separation |
WO2Dry-Run / Validation Capability
Whether the tool provides mechanisms to validate requests without executing them.
| Score | Description |
|---|---|
| 0. Failing | No dry-run or validation capability |
| 1. Basic | Validation endpoint exists for some operations |
| 2. Good | Dry-run parameter or validation endpoint for most mutating operations; returns what would happen without side effects |
| 3. Excellent | Dry-run executes full validation chain (Terraform plan, Kubernetes server-side dry-run); standardized parameter (e.g., validate_only: true per Google AIP-163); diff output showing proposed changes |
WO3Idempotency & Safe Retry Support
Whether mutating operations accept idempotency keys to prevent duplicate side effects when agents retry failed requests.
| Score | Description |
|---|---|
| 0. Failing | No idempotency support; retries cause duplicate side effects |
| 1. Basic | Idempotency-Key accepted on critical mutating operations |
| 2. Good | Idempotency enforced with 24h+ key persistence; concurrent request handling via locking; documented key behavior |
| 3. Excellent | Comprehensive idempotency across all non-idempotent operations; conflict detection (same key, different params → 409); Stripe-model parameter validation |
WO4Workflow Error Communication
Whether multi-step operations communicate progress, partial success, and resumability.
| Score | Description |
|---|---|
| 0. Failing | No step-level feedback; atomic success-or-fail with no intermediate state visibility |
| 1. Basic | Failed step identified in error response; no resume capability |
| 2. Good | Completed/failed/pending step enumeration; resume tokens or checkpoint IDs; severity indication (reversible vs. irreversible failure) |
| 3. Excellent | Full checkpoint-based recovery; draft/preview/publish separation; compensating transactions for partial failures; 202 Accepted + polling for multi-step workflows |
8.4 Module: Authentication
Trigger: Does the tool require credentials, API keys, OAuth, or any form of authentication?
4 criteria. Authentication is widely cited as the most persistent unresolved problem in agent-tool interaction. It functions as a binary gate: a tool that satisfies every other criterion but cannot be authenticated by an agent provides no agent utility.
AU1Non-Interactive Authentication Methods
Whether the tool supports at least one authentication method that agents can complete without human interaction.
| Score | Description |
|---|---|
| 0. Failing | Only browser-based OAuth requiring human interaction; CAPTCHA-gated; 2FA with no bypass for service accounts |
| 1. Basic | API keys available; basic documentation for key usage |
| 2. Good | API keys + Client Credentials grant + M2M documentation + Device Flow for delegated access |
| 3. Excellent | Multiple non-interactive methods + brokered credentials + programmatic key creation/rotation via API |
AU1 is a MUST gate. If a tool requires authentication, support for at least one non-interactive authentication method is required. A score of 0 on AU1 disqualifies the tool from the Agent Ready or Agent Native band regardless of overall score.
AU2Permission Granularity
How finely the tool allows scoping what an agent can access and do.
| Score | Description |
|---|---|
| 0. Failing | Single admin key with full access; no scoping mechanism |
| 1. Basic | Read/write separation available |
| 2. Good | Per-resource scoped keys + fine-grained OAuth scopes + insufficient permissions error includes required scope |
| 3. Excellent | Per-resource per-operation scoping + machine-readable permission manifests + deny-by-default for destructive operations |
AU3Credential Lifecycle Management
Whether the tool supports automated credential rotation, refresh, expiry signaling, and per-agent revocation.
| Score | Description |
|---|---|
| 0. Failing | Manual rotation only; no programmatic credential management |
| 1. Basic | API for key creation/rotation + refresh tokens |
| 2. Good | Automatic rotation + zero-downtime overlap + per-key revocation + expiry metadata |
| 3. Excellent | Brokered credentials + dual-secret rotation + proactive refresh guidance + per-key audit trail |
AU4Agent Identity Support
Whether the tool treats AI agents as a distinct identity type.
| Score | Description |
|---|---|
| 0. Failing | Shared credentials only; no way to distinguish agent from human |
| 1. Basic | Service accounts with some scoping |
| 2. Good | M2M auth with client_credentials + agent-specific rate limits |
| 3. Excellent | Agent as first-class identity type + Token Vault + CIBA + per-action audit trail |
8.5 Module: CLI
Trigger: Does the tool have a command-line interface?
4 criteria. Evaluates agent-specific CLI concerns: non-interactive execution, structured output, cross-platform behavior, and configuration safety.
CLI1Non-Interactive Execution
The ability to run a tool without any human interaction, no confirmation prompts, no editor invocations, no TTY-dependent output.
| Score | Description |
|---|---|
| 0. Failing | Tool hangs or crashes without TTY; interactive prompts with no bypass |
| 1. Basic | Some non-interactive flags exist (--yes, --no-input); some prompts remain |
| 2. Good | Non-interactive flags for most prompts; CI mode detection; --json output mode |
| 3. Excellent | Auto-detects non-TTY environment; flags for all interactive points; JSON output implies non-interactive; NO_COLOR=1 support; separate stderr/stdout |
CLI2Structured Output Mode
Whether the CLI provides machine-parseable output alongside human-readable output. Agents consuming CLI output require structured data that can be parsed reliably. Without structured output, agents fall back to parsing formatted text through regular expressions, which is brittle across tool versions and locales.
| Score | Description |
|---|---|
| 0. Failing | Text-only output; no --json or equivalent flag; ANSI colors/formatting in default output with no disable mechanism; exit code 0/non-zero only with no structured error information |
| 1. Basic | --json or --format json flag available for primary commands; basic exit codes (0 = success, non-zero = failure); stderr and stdout may be mixed |
| 2. Good | JSON output available on all major commands; meaningful exit codes with descriptive stderr; stderr and stdout cleanly separated; NO_COLOR=1 or --no-color supported |
| 3. Excellent | Multiple structured formats (JSON + YAML + custom templates); structured output implies non-interactive mode (Terraform pattern: --json implies --input=false); --porcelain stability guarantee across versions (Git pattern); semantic exit codes (distinct codes for distinct failure modes); consistent JSON schema across CLI versions |
CLI3Cross-Platform Consistency
Whether the CLI behaves identically across Linux, macOS, and Windows. Agents trained primarily on Linux/macOS generate commands that fail silently on Windows: path separators, line endings, shell syntax, and temp directory locations all differ. A tool that works on one platform but behaves differently on another creates unpredictable agent failures.
| Score | Description |
|---|---|
| 0. Failing | Single-platform only (e.g., bash-only scripts); hard-coded platform-specific paths (/tmp/, C:\); no Windows support |
| 1. Basic | Available on Linux, macOS, and Windows; but behavior or output may differ across platforms; platform-specific installation instructions |
| 2. Good | Cross-platform binary distribution or container; consistent output format across platforms; path handling works with both / and \; no platform-specific shell syntax required |
| 3. Excellent | CI tests on all three major platforms; byte-identical output across platforms; single static binary or zero-dependency install; devcontainer or Nix support for environment reproducibility; platform-specific differences documented |
CLI4Configuration Format Safety
Whether the tool's configuration format is safe for agent generation. YAML's whitespace sensitivity and implicit type coercion produce subtle, silent failures in agent-generated configuration: single-space indentation errors change data structure without raising syntax errors, and implicit type coercion (the "Norway problem" in which NO becomes false) silently corrupts data. JSON Schema validation materially reduces these failures by enabling agents to validate configuration before applying it.
| Score | Description |
|---|---|
| 0. Failing | YAML-only config with no schema validation; no config validation command; implicit type coercion undocumented |
| 1. Basic | Config format documented; basic structure validation (file parses without error); YAML accepted but JSON alternative available |
| 2. Good | JSON or TOML as primary config format; JSON Schema exists for config files; standalone validation command available (validate, check, lint); actionable error messages on misconfiguration |
| 3. Excellent | JSON or TOML primary with published JSON Schema; schema-driven IDE and agent autocompletion; validation runs automatically before any destructive action; error messages include specific fix suggestions; no implicit type coercion; config secrets isolated from main config file |
9. Domain Modules
Domain modules add criteria based on a tool's functional domains. A tool may trigger one or more domain modules: for example, Supabase (Databases + Auth Providers + Hosting) or Firebase (Databases + Auth Providers + Communications). Each activated domain module adds its criteria to the tool's evaluation, expanding both the numerator and denominator like complexity modules.
9.1 Module: Payments & Financial
Applies to: Payment processors, billing platforms, financial APIs
| ID | Criterion | Req | Weight |
|---|---|---|---|
| PM1 | Idempotency Depth | SHOULD | Standard |
| PM2 | Test Simulation Fidelity | SHOULD | Standard |
| PM3 | Compliance Automation | MAY | Standard |
| PM4 | Currency & Amount Safety | SHOULD | Standard |
PM1Idempotency Depth
Whether the payment API provides deep idempotency beyond basic key acceptance: including parameter validation, key persistence windows, and concurrent request serialization. Agents retry failed requests frequently; without robust idempotency, retries create duplicate charges.
| Score | Description |
|---|---|
| 0. Failing | No idempotency support; retried requests create duplicate charges |
| 1. Basic | Idempotency key header accepted; duplicate requests return cached response |
| 2. Good | Key acceptance + documented persistence window (e.g., 24 hours) + concurrent request serialization |
| 3. Excellent | Parameter validation (same key + different params → error/409), documented key lifetime, concurrent locking, idempotency across all POST/write endpoints |
PM2Test Simulation Fidelity
How comprehensively the platform simulates real payment scenarios in test mode: including decline codes, dispute flows, subscription lifecycle, and webhook events. Agents cannot safely learn payment integration on live data.
| Score | Description |
|---|---|
| 0. Failing | No test mode; or test mode limited to basic success/fail with no scenario simulation |
| 1. Basic | Test/sandbox environment with key separation; basic test card numbers for success and generic decline |
| 2. Good | Multiple test cards covering specific decline codes and card brands; webhook forwarding/simulation; isolated test data |
| 3. Excellent | 30+ test cards with specific scenarios; Test Clocks API for time-dependent flows (subscriptions, trials); dispute/refund simulation; CLI event triggering and replay |
PM3Compliance Automation
Whether the platform automates regulatory compliance burdens (tax calculation, PCI scope reduction, 3DS/SCA flows) so agents don't need jurisdiction-specific knowledge. An agent creating a payment flow should not need to understand VAT rules for 200 countries.
| Score | Description |
|---|---|
| 0. Failing | No compliance automation; agent must manually implement tax calculation, PCI handling, and 3DS flows |
| 1. Basic | Hosted checkout or client-side tokenization reduces PCI scope; basic 3DS support via redirects |
| 2. Good | Built-in tax engine (enable via API); automatic 3DS/SCA handling with machine-readable requires_action status; PCI scope fully eliminated via hosted flows |
| 3. Excellent | Merchant of Record model (platform handles all tax, compliance, remittance); or built-in tax engine covering 200+ markets with threshold monitoring and VAT ID validation |
PM4Currency & Amount Safety
Whether the API prevents currency-related agent errors through clear unit documentation, smallest-unit enforcement, zero-decimal currency handling, and validation of ambiguous amounts. Currency math errors are among the highest-impact agent errors in payment systems.
| Score | Description |
|---|---|
| 0. Failing | Ambiguous amount units (unclear if cents or dollars); no zero-decimal currency handling; no minimum amount enforcement |
| 1. Basic | Documentation states amounts are in smallest currency unit; minimum charge amount enforced |
| 2. Good | Explicit unit in API responses; zero-decimal currencies (JPY) and three-decimal currencies (BHD) documented; amount validation with clear error messages |
| 3. Excellent | Currency-aware validation rejecting ambiguous amounts; explicit decimal count per currency in API metadata; auth-capture pattern support for human review before charge |
9.2 Module: Communications
Applies to: Email, SMS, messaging, notification platforms
| ID | Criterion | Req | Weight |
|---|---|---|---|
| CM1 | Irreversibility Safeguards | SHOULD | Standard |
| CM2 | Delivery Verification | SHOULD | Standard |
| CM3 | Webhook/Event Infrastructure | SHOULD | Standard |
CM1Irreversibility Safeguards
Whether the platform provides safety mechanisms to prevent agents from sending irreversible communications without review: including sandbox/test modes, draft-then-send patterns, batch limits, and scheduled send with cancellation. Sent messages cannot be recalled.
| Score | Description |
|---|---|
| 0. Failing | No sandbox mode; no batch limits; no draft/preview capability; agent can send unlimited messages immediately |
| 1. Basic | Test/sandbox mode available (messages validated but not delivered); basic rate limiting on outbound sends |
| 2. Good | Sandbox mode + batch send limits (≤1,000 per call) + rate limiting; draft/preview API or scheduled send with cancellation window |
| 3. Excellent | Sandbox mode validating full request format; per-second rate limits as safety brakes; draft-then-send pattern with human approval gate; scheduled send with cancellation; loop prevention circuit breaker |
CM2Delivery Verification
Whether the platform provides structured, machine-readable delivery status tracking: including bounce categorization (hard/soft), complaint tracking, and suppression list management. Agents need programmatic feedback to know if messages were actually delivered.
| Score | Description |
|---|---|
| 0. Failing | No delivery status feedback; fire-and-forget sending with no bounce or complaint data |
| 1. Basic | Basic delivery/bounce webhooks; suppression list exists but is not API-accessible |
| 2. Good | Structured delivery receipts (delivered/bounced/complained); bounce categorization (hard/soft); API-accessible suppression lists; unsubscribe handling |
| 3. Excellent | Full event lifecycle (processed → delivered → opened → clicked → unsubscribed → complained); automatic suppression management; per-recipient status tracking; bounce type classification with machine-readable codes |
CM3Webhook/Event Infrastructure
Whether the platform supports programmatic webhook configuration, cryptographic signature verification, event replay, and structured event payloads. Agents managing communication workflows need reliable, verifiable event delivery, not dashboard-only webhook setup.
| Score | Description |
|---|---|
| 0. Failing | No webhook support; or webhooks require dashboard-only configuration with no signature verification |
| 1. Basic | Webhook URLs configurable via API; events delivered as structured JSON payloads |
| 2. Good | API-managed webhooks + cryptographic signature verification (HMAC or ECDSA); standard event types across the delivery lifecycle |
| 3. Excellent | Full CRUD webhook management via API; signature verification; event replay capability; batched event delivery; per-stream webhook URLs; inbound message processing via webhooks |
9.3 Module: Databases
Applies to: Databases, data platforms, ORMs
| ID | Criterion | Req | Weight |
|---|---|---|---|
| DB1 | Safe Experimentation | SHOULD | Standard |
| DB2 | Schema Introspection Quality | SHOULD | Standard |
| DB3 | Query Interface Safety | SHOULD | Standard |
| DB4 | Connection Management | MAY | Standard |
DB1Safe Experimentation
Whether the database provides mechanisms for agents to experiment without risking production data: including branching, read-only replicas, point-in-time recovery, and copy-on-write environments. Agents occasionally issue destructive operations in error. The database must render those operations reversible.
| Score | Description |
|---|---|
| 0. Failing | No branching, snapshots, or recovery mechanism; destructive operations are permanent |
| 1. Basic | Point-in-time recovery (PITR) available; manual backup/restore process |
| 2. Good | Read-only replicas available; PITR with reasonable granularity; snapshot/clone capability (minutes to create) |
| 3. Excellent | Instant copy-on-write branching (<1s creation); branch reset to parent state; schema-only and full-data branch modes; PITR with fine granularity |
DB2Schema Introspection Quality
Whether the database exposes machine-readable schema metadata: including table/column types, relationships, constraints, and semantic descriptions. Agents generating SQL require accurate schema context, but full schema exports are prohibitively large for direct LLM context injection.
| Score | Description |
|---|---|
| 0. Failing | No programmatic schema discovery; agent must guess table structure or rely on documentation alone |
| 1. Basic | Standard schema discovery (e.g., information_schema, SHOW TABLES); table and column names with types exposed |
| 2. Good | Full schema with foreign key/relationship metadata; constraint enumeration; schema accessible via HTTP API (not just SQL) |
| 3. Excellent | Semantic catalog with natural-language column/table descriptions (e.g., COMMENT ON); token-efficient schema representation; schema caching with DDL-change invalidation |
DB3Query Interface Safety
Whether the database enforces safe query patterns: including parameterized queries, row-level security, query validation before execution, and protection against the high error rate of agent-generated SQL. Agent-generated SQL fails at higher rates than human-written SQL in measured studies.
| Score | Description |
|---|---|
| 0. Failing | Raw SQL string concatenation accepted; no parameterized query enforcement; no row-level security |
| 1. Basic | Parameterized queries supported; basic SQL injection prevention |
| 2. Good | Parameterized queries enforced by default; row-level security (RLS) available; query explain/validation before execution |
| 3. Excellent | RLS enabled by default on new tables; query validation with cost estimation; read-only query mode for exploration; guardrails against broad DELETE/UPDATE without WHERE clauses |
DB4Connection Management
Whether the database provides HTTP/REST access, managed connection pooling, and edge-compatible drivers. Agents running in serverless and edge environments (Vercel Edge, Cloudflare Workers) cannot establish TCP connections. HTTP-based access is the only viable path.
| Score | Description |
|---|---|
| 0. Failing | TCP-only access; no connection pooling; no serverless-compatible drivers |
| 1. Basic | Managed connection pooling available; standard database drivers with connection management |
| 2. Good | HTTP/REST API available alongside TCP; serverless-compatible drivers; connection pooling with scale-to-zero |
| 3. Excellent | Auto-generated HTTP/REST API (e.g., PostgREST); WebSocket support for multi-statement transactions; edge-compatible drivers; scale-to-zero with sub-second cold starts |
9.4 Module: Hosting & Infrastructure
Applies to: Cloud platforms, PaaS, serverless, container services
| ID | Criterion | Req | Weight |
|---|---|---|---|
| HI1 | Cost Guardrails | SHOULD | Standard |
| HI2 | Deployment Lifecycle Completeness | SHOULD | Standard |
| HI3 | Preview/Staging Deployments | MAY | Standard |
HI1Cost Guardrails
Whether the platform provides spending limits, auto-stop/scale-down, cost estimation, and usage tracking that agents can use programmatically. Agents do not reliably model the cost impact of provisioning decisions. Without guardrails, they may allocate resources and fail to release them.
| Score | Description |
|---|---|
| 0. Failing | No spending limits or cost controls; no usage tracking API; API defaults more permissive than console defaults |
| 1. Basic | Usage-based pricing with basic spending alerts; manual cost controls available via dashboard |
| 2. Good | Spending limits configurable via API; auto-stop for idle resources; usage tracking API; cost alerts with configurable thresholds |
| 3. Excellent | Cost estimation before deployment; per-project spending caps via API; auto-scale-down to zero when idle; real-time cost tracking; API defaults match or are more restrictive than console defaults |
HI2Deployment Lifecycle Completeness
Whether the full deployment lifecycle (build, deploy, rollback, scale, log access, and environment management) is available via API, CLI, or MCP. An agent that can deploy but cannot roll back, or that can deploy but cannot access logs, has an operationally unsafe capability gap.
| Score | Description |
|---|---|
| 0. Failing | Dashboard-only deployment; no API or CLI for triggering deploys or reading logs |
| 1. Basic | Deploy and status check available via API/CLI; log access available but limited |
| 2. Good | Build, deploy, log access, and environment variable management via API; rollback available (redeploy previous version) |
| 3. Excellent | Full lifecycle via API/CLI/MCP: deploy, rollback, scale, streaming logs, environment management; MCP server covering read and write operations across the lifecycle |
HI3Preview/Staging Deployments
Whether the platform supports creating isolated preview/staging environments via API: including branch deployments, ephemeral environments, and automatic cleanup. Agents deploying directly to production without preview create unrecoverable failures.
| Score | Description |
|---|---|
| 0. Failing | No preview or staging environment support; all deployments go directly to production |
| 1. Basic | Manual staging environment available; preview deployments require dashboard configuration |
| 2. Good | Preview deployments creatable via API; branch-based deployments; rollback to previous deployment via API |
| 3. Excellent | Automatic preview deployment per branch/PR via API; ephemeral environments with automatic cleanup; instant rollback by promoting previous deployment; progressive rollout support (canary/blue-green) |
9.5 Module: Auth Providers
Applies to: Identity/authentication platforms (Auth0, Clerk, Firebase Auth, etc.)
| ID | Criterion | Req | Weight |
|---|---|---|---|
| AP1 | Agent-as-End-User Support | SHOULD | Standard |
| AP2 | Social/External Connection API | SHOULD | Standard |
| AP3 | Token Architecture Transparency | MAY | Standard |
AP1Agent-as-End-User Support
Whether the auth platform supports flows where the end user is an agent, not a human with a browser. Standard OAuth redirects, approval interfaces, and email-based verification do not function when the end user is a non-interactive agent. CIBA, Device Flow, Client Credentials, and dedicated agent identity types address this gap.
| Score | Description |
|---|---|
| 0. Failing | All auth flows require browser-based interaction (redirects, consent screens); no machine-to-machine support |
| 1. Basic | Client Credentials grant supported for M2M authentication; basic service account support |
| 2. Good | Client Credentials + Device Flow or CIBA for async human approval; token vault or credential delegation for agents acting on behalf of users |
| 3. Excellent | Dedicated agent identity type (not retrofitted service accounts); credential vault with 35+ integrations; async authorization (CIBA) with push notification approval; scoped, time-bounded agent credentials with full audit trail |
AP2Social/External Connection API
Whether social/OAuth provider connections, redirect URIs, email templates, and session settings can be configured entirely via API: without requiring dashboard interaction. An agent bootstrapping auth for a new project must be able to complete setup programmatically.
| Score | Description |
|---|---|
| 0. Failing | Social connections and auth configuration require dashboard-only setup; no Management API |
| 1. Basic | Core auth settings configurable via API; some provider setup (e.g., social connections) still requires dashboard |
| 2. Good | Social connections configurable via API for most providers; email templates accessible via API; redirect URI management via API |
| 3. Excellent | All configuration API-driven (social providers, email templates, branding, custom domains); 60+ social providers configurable via API; dynamic client registration support |
AP3Token Architecture Transparency
Whether the platform clearly documents delegation chains, token lifetimes, refresh semantics, and trust boundaries: and supports emerging standards for agent-to-app authorization. Opaque token architectures prevent agents from reasoning about their own permissions and capabilities.
| Score | Description |
|---|---|
| 0. Failing | Opaque token architecture; no documentation of token lifetimes, refresh semantics, or delegation chains |
| 1. Basic | Token lifetimes and refresh semantics documented; basic scope documentation |
| 2. Good | Delegation chain documentation; Rich Authorization Requests (RAR) support; fine-grained authorization (FGA); credential rotation via API |
| 3. Excellent | Support for agent-to-app protocols (XAA or equivalent); per-action authorization logging; dual-secret rotation without downtime; brokered credentials preventing LLM token exposure; EU AI Act-ready audit trail |
9.6 Module: Frameworks & Libraries
Applies to: Web frameworks, ORMs, UI libraries, build tools
| ID | Criterion | Req | Weight |
|---|---|---|---|
| FL1 | Type System Quality | SHOULD | Standard |
| FL2 | Scaffolding & Code Generation | MAY | Standard |
| FL3 | Configuration Validation | SHOULD | Standard |
FL1Type System Quality
Whether the framework provides strong, expressive types (TypeScript types, Python type hints, or equivalent) that constrain agent-generated code at compile time. Type systems provide the tightest feedback loop for agents: milliseconds to detect errors versus seconds or minutes for runtime failures.
| Score | Description |
|---|---|
| 0. Failing | No type definitions; untyped JavaScript, untyped Python, or equivalent; agents get no compile-time feedback |
| 1. Basic | Type definitions available (e.g., @types/ package, basic type hints); core API surface typed |
| 2. Good | Comprehensive types across full API surface; generated types from schema (e.g., Prisma, GraphQL codegen); type inference support reducing annotation burden |
| 3. Excellent | Branded/nominal types preventing ID confusion (e.g., BuildingID vs. CustomerID); generated types with schema-first design; types covering edge cases and error states; type-check performance stable as schema grows |
FL2Scaffolding & Code Generation
Whether the framework provides CLI generators, project templates, and code scaffolding that work non-interactively. Agents benefit from scaffolding over hand-constructing project structure, but scaffolding tools that depend on interactive prompts (arrow-key menus, confirmation dialogs) are inaccessible to agents.
| Score | Description |
|---|---|
| 0. Failing | No scaffolding tools; or scaffolding requires interactive prompts with no CLI flag bypass |
| 1. Basic | Project scaffolding CLI available; can generate basic project structure with default options via flags |
| 2. Good | Project + component/module generators; templates for common patterns; all prompts bypassable via CLI flags |
| 3. Excellent | Full non-interactive scaffolding with --yes/--defaults flags; generates project-specific configuration (e.g., AGENTS.md, type definitions); template library covering common patterns; generator output is immediately buildable/runnable |
FL3Configuration Validation
Whether the framework validates configuration files with actionable error messages and provides validation as a standalone command (not just at runtime). Agents generate configuration frequently and require immediate feedback on misconfiguration; deferred failures at application startup or runtime are operationally costly.
| Score | Description |
|---|---|
| 0. Failing | No configuration validation; silent misconfiguration; runtime crashes on bad config with unhelpful errors |
| 1. Basic | Runtime validation with error messages on misconfiguration; configuration file format documented |
| 2. Good | JSON Schema for configuration files enabling editor validation; actionable error messages with suggested fixes; validation runs at startup before executing |
| 3. Excellent | Standalone validation command (lint, check, validate) runnable without starting the application; JSON Schema published for IDE/agent integration; error messages include specific fix suggestions; type-safe configuration with compile-time checking |
Appendix A: Criteria Quick Reference
Base Standard (15 criteria)
| ID | Criterion | Group | Req | Weight |
|---|---|---|---|---|
| B1 | Machine-Readable Documentation Formats | Documentation & Usability | SHOULD | Critical |
| B2 | Code Example Coverage & Quality | Documentation & Usability | SHOULD | Standard |
| B3 | Documentation Structure & Self-Containment | Documentation & Usability | SHOULD | Standard |
| B4 | Documentation Accuracy & Synchronization | Documentation & Usability | SHOULD | Standard |
| B5 | Getting Started Completeness | Documentation & Usability | SHOULD | Standard |
| B6 | Changelog & Migration Guidance | Documentation & Usability | MAY | Standard |
| B7 | Installation & Configuration Simplicity | Documentation & Usability | SHOULD | Standard |
| B8 | Supply Chain Integrity | Safety | SHOULD | Standard |
| B9 | Vulnerability Disclosure & Security Contact | Safety | SHOULD | Standard |
| B10 | Project Sustainability | Lifecycle | SHOULD | Critical |
| B11 | Maintenance Health | Lifecycle | SHOULD | Standard |
| B12 | Semver Adherence & Version Stability | Lifecycle | SHOULD | Standard |
| B13 | Governance & Continuity | Lifecycle | MAY | Standard |
| B14 | Security Track Record | Lifecycle | SHOULD | Standard |
| B15 | Terms & Licensing Stability | Lifecycle | SHOULD | Standard |
Module: Programmatic Interface (16 criteria)
| ID | Criterion | Req | Weight |
|---|---|---|---|
| PI1 | Interface Reference Completeness | MUST | Critical |
| PI2 | Tool/Endpoint Description Quality | MUST | Critical |
| PI3 | Tool Count & Surface Area Management | SHOULD | Critical |
| PI4 | Input Schema Design | SHOULD | Critical |
| PI5 | Output Quality & Token Efficiency | SHOULD | Standard |
| PI6 | Response Envelope Consistency | SHOULD | Standard |
| PI7 | Naming & Namespacing | SHOULD | Standard |
| PI8 | Behavioral Metadata & Annotations | SHOULD | Standard |
| PI9 | MCP Implementation Quality | MAY | Standard |
| PI10 | Programmatic Setup / TTFC | SHOULD | Critical |
| PI11 | API Workflow Coverage | SHOULD | Standard |
| PI12 | Versioning & API Stability | SHOULD | Standard |
| PI13 | SDK Availability & Quality | SHOULD | Standard |
| PI14 | Agent Protocol Availability | SHOULD | Standard |
| PI15 | Input Sanitization & Injection Resistance | MUST | Standard |
| PI16 | Prompt Injection Resistance | SHOULD | Standard |
Module: Network Service (8 criteria)
| ID | Criterion | Req | Weight |
|---|---|---|---|
| NS1 | Error Response Quality & Structure | MUST | Critical |
| NS2 | Rate Limit Communication | SHOULD | Critical |
| NS3 | Health & Status Communication | SHOULD | Standard |
| NS4 | Audit & Observability | SHOULD | Standard |
| NS5 | Test/Sandbox Environment Support | SHOULD | Standard |
| NS6 | Environment Separation | SHOULD | Standard |
| NS7 | Asynchronous Operation Support | MAY | Standard |
| NS8 | Data Portability & Pricing Transparency | MAY | Standard |
Module: Write Operations (4 criteria)
| ID | Criterion | Req | Weight |
|---|---|---|---|
| WO1 | Destructive Operation Safety | SHOULD | Critical |
| WO2 | Dry-Run / Validation Capability | MAY | Standard |
| WO3 | Idempotency & Safe Retry Support | SHOULD | Standard |
| WO4 | Workflow Error Communication | SHOULD | Standard |
Module: Authentication (4 criteria)
| ID | Criterion | Req | Weight |
|---|---|---|---|
| AU1 | Non-Interactive Authentication Methods | MUST | Critical |
| AU2 | Permission Granularity | SHOULD | Standard |
| AU3 | Credential Lifecycle Management | SHOULD | Standard |
| AU4 | Agent Identity Support | MAY | Standard |
Module: CLI (4 criteria)
| ID | Criterion | Req | Weight |
|---|---|---|---|
| CLI1 | Non-Interactive Execution | SHOULD | Standard |
| CLI2 | Structured Output Mode | SHOULD | Standard |
| CLI3 | Cross-Platform Consistency | SHOULD | Standard |
| CLI4 | Configuration Format Safety | MAY | Standard |
⚑ = Open-source/commercial split rubric.
Domain Module: Payments & Financial (4 criteria)
| ID | Criterion | Req | Weight |
|---|---|---|---|
| PM1 | Idempotency Depth | SHOULD | Standard |
| PM2 | Test Simulation Fidelity | SHOULD | Standard |
| PM3 | Compliance Automation | MAY | Standard |
| PM4 | Currency & Amount Safety | SHOULD | Standard |
Domain Module: Communications (3 criteria)
| ID | Criterion | Req | Weight |
|---|---|---|---|
| CM1 | Irreversibility Safeguards | SHOULD | Standard |
| CM2 | Delivery Verification | SHOULD | Standard |
| CM3 | Webhook/Event Infrastructure | SHOULD | Standard |
Domain Module: Databases (4 criteria)
| ID | Criterion | Req | Weight |
|---|---|---|---|
| DB1 | Safe Experimentation | SHOULD | Standard |
| DB2 | Schema Introspection Quality | SHOULD | Standard |
| DB3 | Query Interface Safety | SHOULD | Standard |
| DB4 | Connection Management | MAY | Standard |
Domain Module: Hosting & Infrastructure (3 criteria)
| ID | Criterion | Req | Weight |
|---|---|---|---|
| HI1 | Cost Guardrails | SHOULD | Standard |
| HI2 | Deployment Lifecycle Completeness | SHOULD | Standard |
| HI3 | Preview/Staging Deployments | MAY | Standard |
Domain Module: Auth Providers (3 criteria)
| ID | Criterion | Req | Weight |
|---|---|---|---|
| AP1 | Agent-as-End-User Support | SHOULD | Standard |
| AP2 | Social/External Connection API | SHOULD | Standard |
| AP3 | Token Architecture Transparency | MAY | Standard |
Domain Module: Frameworks & Libraries (3 criteria)
| ID | Criterion | Req | Weight |
|---|---|---|---|
| FL1 | Type System Quality | SHOULD | Standard |
| FL2 | Scaffolding & Code Generation | MAY | Standard |
| FL3 | Configuration Validation | SHOULD | Standard |
Totals: 15 base + 16 interface + 8 network + 4 write + 4 auth + 4 CLI = 51 base + complexity criteria | 20 domain-specific across 6 modules (4+3+4+3+3+3) | 5 MUST gates (all in complexity modules) | A tool may trigger multiple domain modules