Zaira Standard v0.9 | zaira

1. Evaluation Logic

The design principles behind the Zaira Standard. These explain why the standard is structured the way it is and guide interpretation of the criteria.

The standard scales with tool complexity

The simplest tools face the simplest evaluation. A library installed via npm install and used locally is evaluated against the Base Standard alone: documentation, lifecycle health, supply chain integrity, installation simplicity. No authentication criteria. No API surface area management. No rate limit communication. Those criteria don't apply because the tool doesn't have those concerns.

As a tool's complexity increases (it exposes an API, runs as a hosted service, handles destructive operations, requires credentials) complexity modules activate and add criteria proportional to that complexity. A cloud database with auth, writes, and a REST API is evaluated against the Base Standard plus four complexity modules. Evaluation scope matches the tool's surface area.

Every criterion in a tool's evaluation applies to that tool. There are no N/A markings, no skipped criteria, no denominator adjustments. If a criterion is activated in a tool's evaluation, it is relevant to that evaluation.

Health over activity

The standard measures whether a tool functions, not whether it is being actively developed. A stable, feature-complete library with zero open CVEs, passing continuous integration, and accurate documentation satisfies health criteria even without recent commits; a project with weekly commits and unaddressed issues does not. Criteria that might otherwise penalize inactivity (B4, B11) are defined in terms of empirical health signals (documentation-behavior correspondence, vulnerability patching cadence, installation success on current runtimes) rather than recency signals. Completion is a state demonstrated through continued health, not declared through version increments or announcements.

Discoverability is outside scope

The Zaira Standard evaluates whether a tool is ready for agents to use, not whether agents can find it. Discoverability (registry presence, search engine indexing, structured metadata) is a separate concern. Conflating discoverability with agent-readiness would penalize tools whose distribution is weak and reward tools whose distribution is strong, independent of agent-readiness.

Agent capability floor

The Zaira Standard defines a minimum agent capability threshold: the capability floor. Agents scoring below the floor are not target consumers of Zaira Standard evaluation results.

The capability floor for Zaira Standard v0.9 is 35% on SWE-Bench Pro V9, administered by SEAL. This benchmark evaluates base model and sub-agent performance on software engineering tasks with tool usage. It strips wrapper scaffolding and vendor-specific enhancements.

The floor is defined by benchmark score, not by model name. The capability floor is a normative parameter of each standard version, reviewed with each minor revision (see §5) following the standard change process.

2. How the Standard Works

Module Activation

Every tool starts with the Base Standard: 15 criteria that apply universally. Then, based on what the tool does and how it's accessed, complexity modules activate:

Module	Trigger Question	Criteria Added
Programmatic Interface	Does the tool expose an API, SDK, or MCP server?	16
Network Service	Is the tool a hosted/remote service?	8
Write Operations	Can the tool create, modify, or delete data/resources?	4
Authentication	Does the tool require credentials or tokens?	4
CLI	Does the tool have a command-line interface?	4

After complexity modules, one or more domain modules may apply based on the tool's functional domains (Payments, Databases, Communications, and others defined in §9).

Each trigger question is a simple yes/no. A tool may trigger zero, one, or many complexity modules. The triggers are independent: a tool can be a Network Service without having a CLI, or have Write Operations without requiring Authentication.

Example Evaluations

Tool	Base	Interface	Network	Write	Auth	CLI	Domain(s)	Total Criteria
SQLite	✓							15
lodash	✓							15
React	✓						Frameworks	18
git CLI	✓			✓		✓		23
Terraform CLI	✓			✓	✓	✓	Hosting	30
Neon (DB)	✓	✓	✓	✓	✓		Databases	51
Stripe	✓	✓	✓	✓	✓		Payments	51
Supabase	✓	✓	✓	✓	✓		DB + Auth + Hosting	57
AWS Redshift	✓	✓	✓	✓	✓	✓	Databases	55

The simplest tools have the smallest evaluation scope. The most complex tools activate every applicable module.

3. How to Read This Document

Criterion Structure

Each criterion includes:

Field	Meaning
ID	Unique identifier (e.g., `B1`, `AU3`, `NS5`)
Name	Short descriptive name
Description	What this criterion evaluates
Requirement Level	`MUST` (binary gate), `SHOULD` (expected), or `MAY` (aspirational)
Weight	`Critical` (×2) or `Standard` (×1): determines point multiplier
Scoring Gradient	0 (Failing), 1 (Basic), 2 (Good), 3 (Excellent)

Criterion IDs use module-based prefixes: B (Base Standard), PI (Programmatic Interface), NS (Network Service), WO (Write Operations), AU (Authentication), CLI (CLI), PM (Payments), CM (Communications), DB (Databases), HI (Hosting), AP (Auth Providers), FL (Frameworks). IDs are sequential within each module: a criterion's ID indicates which module it belongs to.

Requirement Levels (RFC 2119)

MUST: Binary gate. Score ≥1 required for Agent Ready or Agent Native. A tool that scores 0 on any MUST criterion in its activated modules cannot achieve either designation. MUST gates only appear in complexity modules: the Base Standard has no MUST gates.
SHOULD: Expected for meaningful agent readiness. Scored and weighted. Low scores reduce tier eligibility.
MAY: Aspirational. Demonstrates excellence. Primarily differentiates the highest tier.

Weight Categories

Critical (×2): Criteria with the strongest documented impact on agent success rates, including error handling, description quality, and authentication.
Standard (×1): All other criteria. Important but with less dramatic measured impact or less universal applicability.

Open-Source vs. Commercial Tool Splits

Where a criterion measures fundamentally different things depending on whether the tool is open-source or commercial, two scoring gradients are provided. An open-source library's sustainability risk is contributor concentration; a commercial SaaS tool's sustainability risk is corporate strategy and sunset policy. Both matter, but they require different evidence.

Open-source tools: Use the open-source rubric (listed first).
Commercial/closed-source tools: Use the commercial rubric instead.
Hybrid tools (open-source core with commercial hosted offering): Evaluate against the open-source rubric for the open-source artifact. If the primary product being evaluated is the hosted service, use the commercial rubric.

Scores are directly comparable: a score of 2 on either rubric indicates the same qualitative level (Good).

4. Scoring System

How Criteria Map to Points

Each criterion is scored on a 0-3 scale:

Score	Label	Meaning
0	Failing	Does not meet minimum requirements; actively harms agent usability
1	Basic	Minimum viable implementation; functional but limited
2	Good	Solid implementation; meaningfully supports agent workflows
3	Excellent	Exemplary implementation; designed with agents as a first-class consumer

Point Calculation

Criterion Score = Raw Score (0-3) × Weight (1 or 2)

Module Score = Sum of all criterion scores in module
Total Score = Sum of all module scores

Percentage = Total Score / Maximum Possible Score for activated modules

Because modules only activate when relevant, there are no N/A adjustments. The denominator is always the maximum possible score for the specific modules a tool activates.

What is a good score?

A Zaira Score indicates a tool's agent-readiness. Scores fall into three descriptive bands, each of which characterizes a range of agent-usability. These bands describe how a score is interpreted; they are not certification outcomes.

Agent-Ready Base

Describes the score of a tool whose entire surface is covered by the Base Standard: a utility library, a pure-function module, or any tool without an API, network calls, write operations, or credentials. A score in this band indicates that the fundamentals (documentation, lifecycle health, supply chain integrity, installation) meet the standard. No additional evaluation applies because the tool does not expose further surface area.

Indicator	What it looks like
Tool shape	Triggers zero complexity modules
Overall score	≥60% of Base Standard
Zero scores	No more than 3 SHOULD criteria score 0

Agent Ready

Describes the score of a tool with substantial surface area (APIs, hosted services, write operations, authentication) that is substantively usable by agents for standard workflows across that surface. MUST gates pass, every applicable module is covered, and any remaining gaps are documented in the scorecard.

Indicator	What it looks like
Tool shape	Triggers one or more complexity modules
Overall score	≥60%
MUST criteria	All score ≥1
Zero scores	No more than 5 MUST or SHOULD criteria score 0

Agent Native

The highest band. Describes the score of a tool that treats agents as first-class consumers across its full surface: no criterion at zero on any MUST or SHOULD, every MUST scoring at least 2, and an overall percentage of 80% or higher. A score in this band satisfies every Agent Ready indicator plus additional thresholds.

Indicator	What it looks like
Tool shape	Triggers one or more complexity modules
Overall score	≥80%
MUST criteria	All score ≥2
Zero scores	No MUST or SHOULD criterion scores 0

Anti-Gaming Mechanisms

MUST gates. MUST criteria in activated modules are binary gates: score 0 on any and the tool cannot achieve Agent Ready or Agent Native regardless of total score.
Critical weighting. Criteria with the strongest empirical impact on agent success count double, preventing optimization toward lower-weighted criteria.
Zero-score limits. Agent Ready allows no more than 5 SHOULD/MUST criteria at 0; Agent Native allows none. This prevents broad neglect across any part of the evaluation.
Evidence requirements. Every score requires documented evidence (automated test output, URL, or evaluator rationale). No score without evidence.

Together, these mechanisms ensure that band placement reflects agent-readiness across a tool's full surface area, not selective optimization of lower-weighted criteria.

Scoring Examples

Example 1: Simple library (lodash)

Modules: Base only (15 criteria)

2 Critical × 3 × 2 = 12
13 Standard × 3 × 1 = 39
Max: 51

No complexity modules triggered. Scores produced under this profile fall in the Agent-Ready Base band: the Base Standard covers the tool's entire surface, so the fundamentals are the complete evaluation.

Example 2: CLI tool with write operations (Terraform CLI)

Modules: Base (15) + Write Operations (4) + Authentication (4) + CLI (4) = 27 criteria

Critical criteria: B1, B10, WO1, AU1 = 4 Critical
4 Critical × 3 × 2 = 24
23 Standard × 3 × 1 = 69
Max: 93

MUST gates: AU1 Eligible for Agent Ready or Agent Native band.

Example 3: Full SaaS payment platform (Stripe)

Modules: Base (15) + Programmatic Interface (16) + Network Service (8) + Write Operations (4) + Authentication (4) + Payments (4) = 51 criteria

Critical criteria: B1, B10, PI1, PI2, PI3, PI4, PI10, NS1, NS2, WO1, AU1 = 11 Critical
11 Critical × 3 × 2 = 66
40 Standard × 3 × 1 = 120
Max: 186

MUST gates: PI1, PI2, PI15, NS1, AU1 Eligible for Agent Ready or Agent Native band.

Example scores:

Base: 34/51 (67%)
Programmatic Interface: 42/63 (67%)
Network Service: 20/30 (67%)
Write Operations: 11/15 (73%)
Authentication: 11/15 (73%)
Payments: 8/12 (67%)

Total: 126/186 = 68%

Band check:

≥60%? Yes → Agent Ready candidate
All MUSTs ≥1? (PI1 ✓, PI2 ✓, PI15 ✓, NS1 ✓, AU1 ✓) Yes ✓
≤5 SHOULD/MUST criteria score 0? Yes (2 zeros) ✓
Result: Agent Ready

5. Versioning & Governance

The Zaira Standard is public and stable. Revisions are made only when durable shifts in agent-tool interaction warrant them, not in response to short-term trends. Version changes are announced at least 30 days before they take effect, giving implementers and downstream evaluators time to adapt.

Version Scheme

v0.9: Current version. Scoring thresholds and weights may be refined based on evaluation data before v1.0.
v1.0: Stable release with finalized thresholds.
v1.x (Minor): Additive criteria, refined scoring, weight adjustments. Backward compatible.
v2.0 (Major): Structural changes (module additions/removals, score band or threshold revisions).

6. Scope & Limitations

The Zaira Standard evaluates agent usability. It does not evaluate:

Tool quality or fitness for purpose. Whether the tool is good at what it does.
Security guarantees. A tool can meet all Safety criteria and still have undiscovered vulnerabilities. The standard evaluates evidence and practices, not absence of risk.
Performance benchmarks. Response time, throughput, and uptime are not evaluated (though health endpoints are).
Pricing fairness. Whether pricing is transparent and machine-readable is evaluated, not whether it's competitive.
Training data representation. How well current models know a tool is not under the tool's control.
Compliance certifications. SOC 2, ISO 27001, HIPAA, etc. are noted when present but not replicated.
Runtime enforcement. The standard publishes evaluation data; enforcement is the responsibility of agent runtimes and policy engines.
Human developer experience. Criteria are evaluated from the agent's perspective.
Discoverability. Whether agents can find a tool is outside the standard's scope.

7. Base Standard

15 criteria that apply to every tool, regardless of type.

The Base Standard evaluates the fundamentals: whether an agent can learn to use a tool, whether it is safe to depend on, and whether it will remain available in the future. Every tool (from a 50-line utility library to a cloud platform) is evaluated against these criteria.

The Base Standard has no MUST gates. Simple tools meet the Base Standard or do not, based on overall score. MUST gates appear in complexity modules, where failure has greater operational impact.

Documentation & Usability

Documentation and installation quality.

Documentation and installation are the entry point. An agent encountering a tool for the first time needs machine-readable, structured, self-contained information to understand capabilities and usage, and a friction-free path to get it installed and working. More documentation is not better. Irrelevant docs actively harm agent performance. Quality, structure, and machine-readability matter more than volume.

`B1`Machine-Readable Documentation Formats

Requirement: SHOULD | Weight: Critical (×2)

Whether documentation is available in formats agents can directly consume, not just human-rendered HTML requiring JavaScript.

Score	Description
0. Failing	Interactive-only docs (Swagger UI without downloadable spec); video tutorials; CSS-styled HTML requiring JS rendering
1. Basic	Docs available as static HTML or Markdown; README with usage instructions
2. Good	Comprehensive Markdown docs; `llms.txt` present; docs available via content negotiation or direct download
3. Excellent	`llms.txt` + `AGENTS.md` + content negotiation (`Accept: text/markdown`) + version-matched bundled docs or MCP documentation server

`B2`Code Example Coverage & Quality

Requirement: SHOULD | Weight: Standard (×1)

The density, quality, realism, and progressive complexity of code examples.

Score	Description
0. Failing	No code examples; or examples use placeholder data (`"string"`, `123`)
1. Basic	At least one example per major feature; examples use realistic data
2. Good	Examples for most features/methods; include both success and error scenarios; copy-pasteable
3. Excellent	Progressive complexity (minimal → common → advanced); 1–5 examples per feature focusing on ambiguous cases; error recovery examples included

`B3`Documentation Structure & Self-Containment

Requirement: SHOULD | Weight: Standard (×1)

Whether documentation sections are self-contained (extractable independently), structured for agent consumption, and appropriately chunked.

Score	Description
0. Failing	Documentation is a single monolithic page; no logical sections; requires full-document context to understand any part
1. Basic	Logical section divisions with headings; most sections are readable independently
2. Good	Answer-first format; descriptive headings that function as queries; self-contained sections of 100–200 words; cross-references include inline context
3. Excellent	All sections independently extractable; tables for structured data; critical information in first 30% of each section; optimized for retrieval-augmented generation

`B4`Documentation Accuracy & Synchronization

Requirement: SHOULD | Weight: Standard (×1)

Whether documentation accurately reflects actual tool behavior. Accuracy takes precedence over recency. Documentation that has been unchanged for an extended period but accurately describes current tool behavior scores higher than recently updated documentation that contains errors.

Score	Description
0. Failing	Documented behavior contradicts actual tool behavior; or documented methods/functions don't exist; or docs describe a different version than the current release
1. Basic	No known major inaccuracies; documented examples produce expected results
2. Good	Docs-as-code (versioned alongside product); documented examples tested; version-matched (docs specify which version they describe)
3. Excellent	Docs updated with every release; CI/CD blocks deployment without doc updates; automated drift detection

`B5`Getting Started Completeness

Requirement: SHOULD | Weight: Standard (×1)

Whether an agent can go from zero to working usage using only the documentation.

Score	Description
0. Failing	No getting-started guide; or guide requires significant external knowledge
1. Basic	Getting-started guide exists; covers basic setup
2. Good	Guide is completable by following docs alone without external knowledge; includes installation, first usage, and expected output
3. Excellent	Agent-specific quickstart or integration guide; includes common pitfalls; minimal viable example under 20 lines of code; first successful usage achievable in <5 minutes

`B6`Changelog & Migration Guidance

Requirement: MAY | Weight: Standard (×1)

Whether changes are communicated in structured, parseable formats.

Score	Description
0. Failing	No changelog; or changelog is unstructured prose buried in blog posts
1. Basic	Changelog exists with dated entries
2. Good	Structured and parseable changelog (consistent format); semantic versioning; breaking changes clearly marked
3. Excellent	Changelog available as structured data (JSON, RSS/Atom feed); migration guides for breaking changes; `deprecated` markers in code or specs

`B7`Installation & Configuration Simplicity

Requirement: SHOULD | Weight: Standard (×1)

How easily an agent can install, set up, and start using a tool.

Score	Description
0. Failing	GUI installer required; complex multi-step build process; missing executables with no clear resolution
1. Basic	Package manager install (`npm install`, `pip install`); basic documentation
2. Good	Single command install; environment variable configuration; clear error messages on misconfiguration
3. Excellent	Single static binary or zero-dependency install; zero-config usage possible; scaffolding tools for project setup; JSON Schema for configuration validation

Safety Fundamentals

Supply chain integrity and vulnerability disclosure.

Two safety criteria apply to every tool regardless of type: supply chain integrity and vulnerability disclosure. Together they establish the minimum baseline: provenance verifiability and a defined path for reporting security issues.

`B8`Supply Chain Integrity

Requirement: SHOULD | Weight: Standard (×1)

Whether publisher identity is verified, releases are signed, and the supply chain is tamper-evident.

Open-source tools: evaluate this section:

Score	Description
0. Failing	Anonymous or unclear maintainer identity; no integrity signals
1. Basic	Verified publisher on npm/PyPI; GitHub org verification; some identity signals present
2. Good	Signed tags or releases; dependency pinning (lock files); verified domain → repo → artifact chain
3. Excellent	Sigstore/cosign signing; SLSA provenance attestation; reproducible builds; SBOM published; complete identity chain (domain → repo → artifact maintainer match)

Trust signal hierarchy: Reproducible builds > SLSA attestation > Code signing > Verified domain > GitHub org verification > npm/PyPI verified publisher > SBOM > security.txt

Commercial/closed-source tools: evaluate this section instead:

Score	Description
0. Failing	No verifiable publisher identity; SDK or agent distributed through unofficial channels; no integrity signals
1. Basic	Verified company domain; SDK published under verified org on npm/PyPI; official distribution channels clearly identified
2. Good	SDKs signed or published with verified provenance; official distribution channels documented; dependency pinning in SDK; checksums for downloadable artifacts
3. Excellent	SDKs with code signing and provenance attestation; SOC 2 Type II or equivalent supply chain controls; SBOM for SDK dependencies; documented build and release security practices

`B9`Vulnerability Disclosure & Security Contact

Requirement: SHOULD | Weight: Standard (×1)

Whether a clear, machine-readable path exists for reporting security vulnerabilities.

Score	Description
0. Failing	No clear vulnerability reporting path
1. Basic	Generic security contact exists (email address, contact form)
2. Good	Published vulnerability disclosure policy with clear process and timeline commitments
3. Excellent	`security.txt` present (RFC 9116) at `/.well-known/security.txt`; published vulnerability disclosure policy; bug bounty program; clear response commitments (acknowledgment, assessment, and fix timelines)

Lifecycle Health

Long-term reliability and continuity.

These criteria evaluate long-term viability: whether a tool will continue to work, be maintained, and remain safe to depend on. A tool that scores well at evaluation but is abandoned soon after offers limited long-term reliability for agent dependency.

Open-source and commercial tools have fundamentally different risk profiles. An open-source tool's risk is contributor abandonment; a commercial tool's risk is corporate sunset or acquisition. Where this distinction matters, criteria provide separate rubrics. For open-source tools, most criteria are fully automatable from public data (GitHub API, package registries, OpenSSF Scorecard).

`B10`Project Sustainability

Requirement: SHOULD | Weight: Critical (×2)

The likelihood that this tool will continue to be maintained and supported over time.

Open-source tools: evaluate this section:

Score	Description
0. Failing	Bus factor of 1 (single contributor accounts for >50% of contributions); or no commits in 12+ months with open issues
1. Basic	Bus factor of 2–3; some contributor diversity
2. Good	Bus factor of 4–10; multiple active contributors; no single contributor >50% of recent commits
3. Excellent	Bus factor >10; organizational backing; contributor pipeline visible (new contributors joining)

Commercial/closed-source tools: evaluate this section instead:

Score	Description
0. Failing	No visible team or organization; single-person operation with no stated continuity plan
1. Basic	Established company; identifiable team; product actively marketed
2. Good	Company with public funding or revenue signals; dedicated product team; published product roadmap
3. Excellent	Publicly traded or well-funded company; product is a core revenue line (not a side project); published sunset/migration policy; data export API

`B11`Maintenance Health

Requirement: SHOULD | Weight: Standard (×1)

Whether the tool demonstrates operational health. Active health is not synonymous with recent commit activity: it is evidence that the tool continues to function and that unresolved issues receive maintainer response. A project with no open issues and no recent commits satisfies this criterion; a project with numerous unanswered issues and no recent commits does not. The distinction is measurable.

Score	Description
0. Failing	Unanswered issues accumulating (>10 open issues with no maintainer response in 90+ days); or unpatched known vulnerabilities >90 days old; or fails to install on current LTS runtimes
1. Basic	Open issues receive some response; no unpatched critical vulnerabilities; installs and runs on current platforms
2. Good	<7 days median issue response when issues exist; dependencies up to date or pinned to non-vulnerable versions; CI passing on current runtimes
3. Excellent	<48 hours issue triage; proactive dependency updates; CI tests against multiple runtime versions; clear triage labels; published response time commitments. OR for mature stable projects: zero unpatched vulnerabilities; CI passing on current LTS runtimes; <7 day response on the last 5 issues filed (whenever they were filed); dependencies pinned to non-vulnerable versions; no open issues older than 180 days without maintainer response

`B12`Semver Adherence & Version Stability

Requirement: SHOULD | Weight: Standard (×1)

Whether breaking changes are confined to major versions and the tool follows predictable versioning.

Score	Description
0. Failing	No versioning strategy; breaking changes in minor/patch releases; or perpetually pre-1.0 with breaking changes
1. Basic	Versioned releases exist; some semver adherence
2. Good	Semver-compliant; breaking changes in major versions only; pre-1.0 tools clearly labeled as unstable
3. Excellent	Strict semver; documented API stability guarantees; machine-readable compatibility matrices; LTS versions for production use

`B13`Governance & Continuity

Requirement: MAY | Weight: Standard (×1)

Whether the tool has governance structures that reduce single-entity risk and provide continuity assurance.

Open-source tools: evaluate this section:

Score	Description
0. Failing	Single individual maintainer with no organizational backing; no succession plan
1. Basic	Multiple maintainers with informal governance; or backed by a single company
2. Good	Open governance model; contributor guidelines; decision-making process documented; multiple organizational contributors
3. Excellent	Foundation governance (CNCF, Apache, Linux Foundation); formal succession planning; multiple organizational contributors with commit rights

Commercial/closed-source tools: evaluate this section instead:

Score	Description
0. Failing	No public information about the company or team; no terms of service addressing continuity
1. Basic	Established company with identifiable leadership; standard terms of service
2. Good	Published data portability/export mechanisms; documented SLA; company financials or funding publicly known
3. Excellent	Publicly traded or independently audited financials; published sunset policy with migration timeline commitments; data escrow or open-source fallback clause; contractual SLA with uptime guarantees

`B14`Security Track Record

Requirement: SHOULD | Weight: Standard (×1)

Vulnerability response speed and proactive security practices.

Open-source tools: evaluate this section:

Score	Description
0. Failing	Known unpatched vulnerabilities >90 days old; no security response history; OpenSSF Scorecard <3/10
1. Basic	Vulnerabilities patched within 90 days; some security practices visible; OpenSSF Scorecard 3–5/10
2. Good	Vulnerabilities patched within 30 days; code review enforced; branch protection enabled; OpenSSF Scorecard 5–7/10
3. Excellent	Vulnerabilities patched within 14 days; comprehensive security practices; CI security scanning; OpenSSF Scorecard >7/10; Code-Review check passing

Commercial/closed-source tools: evaluate this section instead:

Score	Description
0. Failing	No public security information; no evidence of security practices; known incidents with no public response
1. Basic	Security contact or security.txt exists; incidents acknowledged publicly; some security practices described on website
2. Good	Published security practices page; SOC 2 Type I or equivalent; vulnerability disclosure policy with timeline commitments; incident post-mortems published
3. Excellent	SOC 2 Type II or ISO 27001 certified; bug bounty program; incident post-mortems with root cause analysis; proactive security advisories; SDK dependencies regularly audited

`B15`Terms & Licensing Stability

Requirement: SHOULD | Weight: Standard (×1)

Whether the terms under which the tool is available are stable and free from change risk signals.

Open-source tools: evaluate this section:

Score	Description
0. Failing	No license specified; or non-standard/proprietary license with no stability commitment
1. Basic	OSI-approved license
2. Good	Stable OSI license with no change risk signals (no single-company >80% commits + broad CLA + cloud competition pattern)
3. Excellent	Stable license + none of the known change risk indicators; or irrevocable license grant; foundation-held copyright

Commercial/closed-source tools: evaluate this section instead:

Score	Description
0. Failing	No published terms of service; or terms allow unilateral changes with no notice
1. Basic	Published terms of service; clear commercial licensing terms
2. Good	Pricing commitments of 12+ months; terms require 90+ days notice for material changes; grandfathering policy for existing customers
3. Excellent	Multi-year pricing commitments or published pricing history demonstrating stability; contractual protection against adverse term changes; machine-readable pricing API; published API deprecation policy with 12+ month windows

8. Complexity Modules

Complexity modules add criteria based on how the tool is accessed and what it does. Each module is activated by a yes/no trigger question. A tool may activate zero, one, or many complexity modules: the triggers are independent.

Activating at least one complexity module places a tool's score in the Agent Ready or Agent Native band (see §4). Tools evaluated on the Base Standard alone land in the Agent-Ready Base band.

8.1 Module: Programmatic Interface

Trigger: Does the tool expose an API (REST, GraphQL, gRPC), SDK, or MCP server?

16 criteria. Evaluates the quality and safety of the agent-facing programmatic interface: descriptions, schemas, outputs, naming, protocol support, and interface-level security.

The quality and safety of the tool's agent-facing programmatic interface. Tool description quality is identified across 13+ independent sources as the factor with the greatest single impact on agent success rates. Across description quality, schema design, output size, and naming, minimalism correlates with improved agent performance: fewer tools, tighter schemas, smaller outputs, and more precise names each improve measured outcomes. Interface-level security (input sanitization, prompt injection resistance) applies to any programmatic interface: read-only or read-write.

`PI1`Interface Reference Completeness

Requirement: MUST | Weight: Critical (×2)

Whether the programmatic interface (API endpoints, SDK methods, MCP tools) is documented with sufficient detail for an agent to use without guessing.

Dual-interface tools: For tools with multiple programmatic interfaces (e.g., REST API and MCP server), the MUST gate evaluates the primary interface's documentation coverage. The primary interface is determined automatically during classification based on the tool's documented recommended integration path (typically REST API for tools with both REST and MCP). Poor documentation coverage on a secondary interface is captured in the criterion's overall score and in PI9 (MCP Implementation Quality), but does not independently trigger the MUST gate failure. This prevents penalizing tools for offering an additional interface: a tool should not be disincentivized from publishing an MCP server by the risk that its MCP documentation triggers a MUST gate failure when its REST API documentation is comprehensive.

Score	Description
0. Failing	No interface documentation; or docs exist but cover <50% of methods/endpoints
1. Basic	Methods/endpoints are documented; >50% have basic descriptions
2. Good	>80% of methods/endpoints have request/response examples; parameter types and constraints documented
3. Excellent	100% coverage with examples, edge cases, and error scenarios documented per method/endpoint; parameter constraints include formats, ranges, and valid values

`PI2`Tool/Endpoint Description Quality

Requirement: MUST | Weight: Critical (×2)

The completeness, specificity, and actionability of descriptions attached to tools, functions, or API endpoints.

Score	Description
0. Failing	Missing descriptions, name restatement ("Gets data"), no parameter descriptions
1. Basic	Basic description of what tool/endpoint does; some parameter documentation
2. Good	Specific descriptions with when-to-use guidance; all parameters described with types and examples; 1–2 usage examples per tool
3. Excellent	When-to-use AND when-NOT-to-use; inline examples with realistic data; enum values listed; return format documented; 1–5 examples focusing on ambiguous cases; edge cases noted

`PI3`Tool Count & Surface Area Management

Requirement: SHOULD | Weight: Critical (×2)

The number of tools/endpoints exposed to an agent at once, and whether mechanisms exist to manage surface area.

Score	Description
0. Failing	50+ undifferentiated tools; each tool = one REST endpoint (API-mirroring pattern); or 31–49 tools with no meaningful grouping or surface area management
1. Basic	≤30 tools; some logical grouping
2. Good	5–15 focused tools designed around user outcomes (not API operations); logical grouping
3. Excellent	5–15 tools + dynamic discovery/deferred loading for larger catalogs; code-mode pattern for complex APIs; semantic search over tool catalog

`PI4`Input Schema Design

Requirement: SHOULD | Weight: Critical (×2)

The degree to which input validation prevents common agent errors through strict schemas, constrained formats, and actionable feedback.

Score	Description
0. Failing	No schema validation; loose types; accepts malformed input silently
1. Basic	Basic JSON Schema with types; some required field marking
2. Good	Strict schemas with enums for constrained fields; format examples; ≤3 nesting levels; all properties documented
3. Excellent	Flat top-level primitives; comprehensive enum/default/description; ≤500 tokens per tool schema; `strict: true` compatible; `additionalProperties: false`

`PI5`Output Quality & Token Efficiency

Requirement: SHOULD | Weight: Standard (×1)

Whether responses contain high-signal, bounded-size data with pagination and format optimization.

Score	Description
0. Failing	Unbounded responses with no size constraints (100K+ tokens possible); opaque UUIDs without context; no pagination
1. Basic	Pagination exists; typical responses <10K tokens
2. Good	Paginated with cursor metadata (`has_more`, `next_cursor`); compact summaries; semantic identifiers; filtering parameters
3. Excellent	Token-budgeted responses; `outputSchema` defined; concise mode available; cursor-based pagination; CSV/TSV option for tabular data; response size bounded by default

`PI6`Response Envelope Consistency

Requirement: SHOULD | Weight: Standard (×1)

Whether all API endpoints return responses in the same structural shape.

Score	Description
0. Failing	Variable response shapes across endpoints; inconsistent naming; fields omitted when null; type instability (field is sometimes string, sometimes array)
1. Basic	Mostly consistent; some endpoints deviate; same general pattern
2. Good	Consistent envelope structure; consistent naming convention (snake_case or camelCase, not mixed); null fields included as `null` (scalars) or `[]` (collections)
3. Excellent	Identical envelope everywhere; automated linting enforces consistency; type stability guaranteed; published response schema

`PI7`Naming & Namespacing

Requirement: SHOULD | Weight: Standard (×1)

The predictability, distinctiveness, and collision-resistance of tool, function, or endpoint names.

Score	Description
0. Failing	Generic names like `search`, `get_data`, `doThing`; inconsistent casing; no namespace prefix
1. Basic	Descriptive names; consistent casing (snake_case preferred); no service prefix
2. Good	Service-prefixed snake_case (e.g., `stripe_create_charge`, `github_list_issues`)
3. Excellent	Service-prefixed + self-descriptive + unique within multi-server ecosystem + predictable pattern across all tools

`PI8`Behavioral Metadata & Annotations

Requirement: SHOULD | Weight: Standard (×1)

Machine-readable metadata declaring whether a tool is read-only, destructive, idempotent, and whether it interacts with external entities.

Score	Description
0. Failing	No behavioral annotations; all tools appear equivalent
1. Basic	`readOnlyHint` set on obvious read-only tools
2. Good	All four MCP annotations set accurately (`readOnlyHint`, `destructiveHint`, `idempotentHint`, `openWorldHint`) on every tool
3. Excellent	Full annotations + output annotations (audience, priority) + risk ratings + HTTP method semantics matching behavior (GET = read-only, DELETE = destructive)

`PI9`MCP Implementation Quality

Requirement: MAY | Weight: Standard (×1)

When an MCP server exists (official or community), the quality of that implementation.

Score	Description
0. Failing	No MCP server exists (official or community); OR MCP server exists but has critical quality issues: 50+ undifferentiated tools, no descriptions, command injection vulnerabilities, no error handling
1. Basic	Tools have descriptions; auth documented; read and write operations present
2. Good	5–20 focused tools designed around outcomes (not API mirroring); all four MCP annotations set accurately; read-only and read-write tools clearly separated; documented auth
3. Excellent	All of above + deferred loading / dynamic discovery for large catalogs; safety-tiered tools (read/write/destructive separated); read-only mode available; output annotations; supported clients documented; maintained alongside product releases

Always evaluated: PI9 is always included in the evaluation for tools that trigger the Programmatic Interface module (it is never excluded from the denominator). Tools without an MCP server score 0 on PI9. Because PI9 is a MAY criterion, this score-0 has limited impact on reaching the Agent Ready band (where MAY criteria primarily contribute to the overall percentage) but meaningfully affects the Agent Native band (which requires no SHOULD or MUST criterion scores 0, and where MAY criteria still contribute to the ≥80% overall threshold). The practical effect: a tool's score can reach Agent Ready without MCP, but Agent Native demands either an MCP server or enough excellence elsewhere to absorb the PI9 zero. This creates a directional incentive toward MCP adoption without penalizing non-adoption at the Agent Ready band.

`PI10`Programmatic Setup / Time to First API Call

Requirement: SHOULD | Weight: Critical (×2)

The amount of tool-specific configuration required after an agent has valid credentials (or no credentials are needed) before it can make a successful API call. This criterion measures post-credential setup quality: how well the tool minimizes friction between credential acquisition and a successful first API call.

Separation of concerns with AU1: Credential acquisition (account creation, key generation, OAuth setup) is evaluated under AU1 (Non-Interactive Authentication Methods). PI10 measures everything after authentication is solved. A tool that requires browser-based account creation is already penalized on AU1; PI10 does not double-count that friction.

Score	Description
0. Failing	>10 minutes post-credential setup; multiple dashboard-only configuration steps required before first API call; tool-specific configuration requires human interaction
1. Basic	5–10 minutes; 1–2 tool-specific configuration steps (project creation, API enablement, webhook setup)
2. Good	2–5 minutes; single environment variable or config file; sandbox/test mode available immediately; clear error on misconfiguration
3. Excellent	<2 minutes; zero-config possible for basic usage; test/sandbox works immediately with credentials alone; config validation with actionable errors; programmatic project setup via API

`PI11`API Workflow Coverage

Requirement: SHOULD | Weight: Standard (×1)

The percentage of common workflows completable entirely through the API without requiring web dashboard interaction.

Score	Description
0. Failing	Core functionality requires dashboard; API covers <50% of common workflows
1. Basic	Core CRUD operations available via API; some configuration requires dashboard
2. Good	>80% of common workflows completable via API; dashboard-only steps documented
3. Excellent	100% of functionality available via API; no dashboard-only features for any common workflow

`PI12`Versioning & API Stability

Requirement: SHOULD | Weight: Standard (×1)

Whether the API uses explicit versioning with adequate deprecation signals and managed breaking changes.

Score	Description
0. Failing	No versioning strategy; unannounced breaking changes
1. Basic	Version identifier exists (URL path, header, or parameter); some deprecation notices
2. Good	Explicit versioning with documented deprecation policy; `deprecated: true` in specs; 6+ month deprecation windows
3. Excellent	Semver-adherent; `Sunset` headers; machine-readable deprecation timeline; previous version maintained for 12+ months after deprecation

`PI13`SDK Availability & Quality

Requirement: SHOULD | Weight: Standard (×1)

Whether official SDKs exist in languages agents commonly use, and whether they're well-maintained.

Score	Description
0. Failing	No SDK; raw HTTP only
1. Basic	Official SDK in 1 major language (Python or TypeScript/JavaScript)
2. Good	Official SDKs in 2+ major languages; idiomatic to each; typed interfaces
3. Excellent	SDKs in 4+ languages; type-safe with branded types; auto-generated from OpenAPI spec; maintained in sync with API releases

`PI14`Agent Protocol Availability

Requirement: SHOULD | Weight: Standard (×1)

Whether the tool provides high-quality programmatic interfaces for agent interaction. A well-designed REST API, an MCP server, or both are valid paths.

Score	Description
0. Failing	No programmatic interface; GUI/dashboard only; or API exists but is undocumented
1. Basic	REST API with basic documentation; OR community MCP server exists
2. Good	Well-designed REST API with OpenAPI spec and SDKs in 2+ languages; OR official MCP server with documented auth and core operation coverage
3. Excellent	Excellent REST API with comprehensive SDKs AND official MCP server; OR one interface executed at exceptional quality (e.g., Stripe-quality API without MCP, or best-in-class MCP without REST)

`PI15`Input Sanitization & Injection Resistance

Requirement: MUST | Weight: Standard (×1)

Whether the tool demonstrates evidence of input sanitization and defense against injection attacks through schema design, security infrastructure, documentation, and architectural patterns. Because PI15 is a MUST gate, a tool that exposes a programmatic interface with no evidence of input sanitization cannot reach the Agent Ready or Agent Native band regardless of total score.

Score	Description
0. Failing	No evidence of input sanitization: no schema validation, no parameterized queries, no security infrastructure, no documentation of input handling practices
1. Basic	Basic input validation evidenced: strict input schemas with type checking (from PI4); parameterized queries documented; or WAF/CDN security infrastructure detected
2. Good	Comprehensive sanitization evidence: strict schemas with `additionalProperties: false` across all endpoints; parameterized operations documented throughout; security infrastructure present; input validation practices documented
3. Excellent	All of above + allowlist-based input validation documented where feasible; security testing in CI (detected via workflow analysis); defense-in-depth architecture documented; vendor-provided security assessment or third-party audit results available

`PI16`Prompt Injection Resistance

Requirement: SHOULD | Weight: Standard (×1)

Tool-level defense-in-depth against prompt injection: strict schemas, output sanitization, injection-resistant designs.

Score	Description
0. Failing	Unsafe patterns present or encouraged; no awareness of injection risks; tool descriptions contain narrative or references to other tools
1. Basic	Strict input schema validation; parameterized operations; minimal description surface
2. Good	Output structured with clear field boundaries (JSON); response size limits; descriptions concise and self-contained; no cross-tool references in descriptions
3. Excellent	Explicit design mitigations documented; policy layer for untrusted content; output validation; structured action metadata; separation of untrusted content from control flow

8.2 Module: Network Service

Trigger: Is the tool a hosted or remote service (SaaS, PaaS, cloud API)?

8 criteria. Evaluates concerns specific to services that run remotely: error handling, rate limits, health endpoints, observability, sandboxing, environment separation, and data portability.

`NS1`Error Response Quality & Structure

Requirement: MUST | Weight: Critical (×2)

Whether error responses provide structured, machine-parseable information enabling agents to diagnose problems, determine retryability, and execute recovery actions.

Score	Description
0. Failing	HTML error pages; empty responses; generic "Something went wrong"; silent failures (no error flag set)
1. Basic	JSON errors with human-readable message and machine-readable error code; MCP errors set `isError: true`
2. Good	RFC 9457 compliant (`type`, `title`, `status`, `detail`); all validation errors reported simultaneously (not one-at-a-time); field-level identification; `doc_url` per error type
3. Excellent	All of above + `is_retriable` boolean + `retry_after_seconds` + suggested alternative actions + hierarchical error taxonomy (e.g., Stripe: type → code → decline_code) + numbered recovery steps

`NS2`Rate Limit Communication

Requirement: SHOULD | Weight: Critical (×2)

Whether rate limits are communicated proactively and include machine-actionable timing signals.

Score	Description
0. Failing	No rate limit headers; no `Retry-After` on 429 responses; undocumented limits
1. Basic	`Retry-After` on 429 responses; limits documented somewhere
2. Good	Rate limit headers on every response (`X-RateLimit-Remaining`, `X-RateLimit-Limit`, `X-RateLimit-Reset`); per-key limits; scope declared (per-endpoint vs. global)
3. Excellent	Full header suite on all responses; batch endpoints to reduce call count; resource-aware cost metadata (e.g., `operationCost: { credits: 5 }`); per-agent rate limits

`NS3`Health & Status Communication

Requirement: SHOULD | Weight: Standard (×1)

Whether the tool provides structured health endpoints that agents can query to assess availability.

Score	Description
0. Failing	No health endpoint; HTML status pages only
1. Basic	`/health` endpoint returns JSON with aggregate up/down status
2. Good	Component-level status; `Retry-After` on 503 responses; maintenance schedule available
3. Excellent	Per-dependency status; degradation warnings in response metadata; `application/health+json` format (IETF Internet-Draft); planned maintenance pre-signaled

`NS4`Audit & Observability

Requirement: SHOULD | Weight: Standard (×1)

Whether the tool logs agent interactions with sufficient detail for forensic analysis and compliance.

Score	Description
0. Failing	No meaningful logs or observability for API/tool interactions
1. Basic	Basic request/response logging; API key identified in logs
2. Good	Audit logs with correlation IDs; sensitive-data redaction; rate limits enforced with logged violations; agent identity distinguished from human in logs
3. Excellent	OpenTelemetry-compatible trace/span IDs; immutable append-only audit logs; delegation chain logging; anomaly detection or alerting; per-action risk tier logging

`NS5`Test/Sandbox Environment Support

Requirement: SHOULD | Weight: Standard (×1)

Whether the tool provides sandbox environments, test keys, and safe experimentation modes.

Score	Description
0. Failing	No test mode; no sandbox; all mutations occur in the production environment
1. Basic	Test mode exists but with limited simulation
2. Good	Separate sandbox environment + basic behavioral simulation + API-verifiable mode (test responses indicate test mode)
3. Excellent	Structurally distinct test/live keys (prefixed like `sk_test_`); separate sandbox URLs; full behavioral simulation; multiple sandboxes; time simulation (Stripe test clocks, Neon database branching)

`NS6`Environment Separation

Requirement: SHOULD | Weight: Standard (×1)

Whether the tool architecturally separates development, staging, and production environments.

Score	Description
0. Failing	No environment separation; single set of credentials for all environments; test and production data co-mingled
1. Basic	Separate environments exist but share credentials or configuration
2. Good	Distinct credentials per environment; environment clearly indicated in API responses; preview/staging deployments available
3. Excellent	Environment-specific URLs and credentials; database branching for isolated experimentation; deploy previews via API; environment promotion workflow (dev → staging → prod)

`NS7`Asynchronous Operation Support

Requirement: MAY | Weight: Standard (×1)

Whether long-running operations return immediately with a durable handle and provide status mechanisms.

Score	Description
0. Failing	Long-running operations block until complete; timeouts cause retries with potential duplicates
1. Basic	HTTP 202 Accepted pattern with task/job ID on some operations
2. Good	Consistent async pattern across all long-running operations; polling endpoint with status; `estimated_seconds` in 202 response
3. Excellent	Full async with lifecycle states (working → completed/failed/cancelled); both polling and webhook notification; blocking result endpoint for simple cases; progress reporting

`NS8`Data Portability & Pricing Transparency

Requirement: MAY | Weight: Standard (×1)

Whether the service provides programmatic access to pricing, usage tracking, and data export. Agents operating autonomously cannot parse marketing pages, "Contact Sales" buttons, or dashboard-only usage tracking: they need structured, machine-readable access to costs, consumption, and data portability.

Score	Description
0. Failing	No data export capability; pricing only on marketing pages; no usage tracking API
1. Basic	Manual data export (dashboard); published pricing page; basic usage visible in dashboard
2. Good	Programmatic data export API; published pricing with clear unit costs; usage tracking API; billing alerts
3. Excellent	Bulk export API with standard formats (CSV, JSON, Parquet); machine-readable pricing API or structured pricing page; real-time usage tracking; spending limit API; cost estimation before provisioning

Relationship to HI1 (Cost Guardrails): HI1 evaluates mechanisms to prevent cost overruns (spending limits, auto-stop, cost caps). NS8 evaluates information availability: can agents determine what something costs, how much has been spent, and whether data can be extracted? A tool can score well on NS8 (transparent pricing, usage API) while scoring poorly on HI1 (no spending limits), or vice versa.

8.3 Module: Write Operations

Trigger: Can the tool create, modify, or delete data or resources?

4 criteria. Evaluates safeguards for irreversible actions: destructive operation safety, dry-run capability, idempotency, and multi-step error handling.

`WO1`Destructive Operation Safety

Requirement: SHOULD | Weight: Critical (×2)

Mechanisms that prevent agents from executing irreversible destructive operations without appropriate safeguards.

Score	Description
0. Failing	No guardrails; agent gets full read/write/delete access by default; no confirmation patterns
1. Basic	Database-level or API-level permissions with agent-specific restricted roles; some operations require confirmation
2. Good	Layered defenses: read-only modes + lexical blocklists (DROP, DELETE, TRUNCATE) + human confirmation gates for high-risk operations; soft delete support
3. Excellent	Physical write prevention (read-only replicas); destructive ops excluded from agent-facing interfaces; structural prevention patterns (auth-capture for payments, plan-apply for infra); draft/preview/publish separation

`WO2`Dry-Run / Validation Capability

Requirement: MAY | Weight: Standard (×1)

Whether the tool provides mechanisms to validate requests without executing them.

Score	Description
0. Failing	No dry-run or validation capability
1. Basic	Validation endpoint exists for some operations
2. Good	Dry-run parameter or validation endpoint for most mutating operations; returns what would happen without side effects
3. Excellent	Dry-run executes full validation chain (Terraform plan, Kubernetes server-side dry-run); standardized parameter (e.g., `validate_only: true` per Google AIP-163); diff output showing proposed changes

`WO3`Idempotency & Safe Retry Support

Requirement: SHOULD | Weight: Standard (×1)

Whether mutating operations accept idempotency keys to prevent duplicate side effects when agents retry failed requests.

Score	Description
0. Failing	No idempotency support; retries cause duplicate side effects
1. Basic	Idempotency-Key accepted on critical mutating operations
2. Good	Idempotency enforced with 24h+ key persistence; concurrent request handling via locking; documented key behavior
3. Excellent	Comprehensive idempotency across all non-idempotent operations; conflict detection (same key, different params → 409); Stripe-model parameter validation

`WO4`Workflow Error Communication

Requirement: SHOULD | Weight: Standard (×1)

Whether multi-step operations communicate progress, partial success, and resumability.

Score	Description
0. Failing	No step-level feedback; atomic success-or-fail with no intermediate state visibility
1. Basic	Failed step identified in error response; no resume capability
2. Good	Completed/failed/pending step enumeration; resume tokens or checkpoint IDs; severity indication (reversible vs. irreversible failure)
3. Excellent	Full checkpoint-based recovery; draft/preview/publish separation; compensating transactions for partial failures; 202 Accepted + polling for multi-step workflows

8.4 Module: Authentication

Trigger: Does the tool require credentials, API keys, OAuth, or any form of authentication?

4 criteria. Authentication is widely cited as the most persistent unresolved problem in agent-tool interaction. It functions as a binary gate: a tool that satisfies every other criterion but cannot be authenticated by an agent provides no agent utility.

`AU1`Non-Interactive Authentication Methods

Requirement: MUST | Weight: Critical (×2)

Whether the tool supports at least one authentication method that agents can complete without human interaction.

Score	Description
0. Failing	Only browser-based OAuth requiring human interaction; CAPTCHA-gated; 2FA with no bypass for service accounts
1. Basic	API keys available; basic documentation for key usage
2. Good	API keys + Client Credentials grant + M2M documentation + Device Flow for delegated access
3. Excellent	Multiple non-interactive methods + brokered credentials + programmatic key creation/rotation via API

AU1 is a MUST gate. If a tool requires authentication, support for at least one non-interactive authentication method is required. A score of 0 on AU1 disqualifies the tool from the Agent Ready or Agent Native band regardless of overall score.

`AU2`Permission Granularity

Requirement: SHOULD | Weight: Standard (×1)

How finely the tool allows scoping what an agent can access and do.

Score	Description
0. Failing	Single admin key with full access; no scoping mechanism
1. Basic	Read/write separation available
2. Good	Per-resource scoped keys + fine-grained OAuth scopes + insufficient permissions error includes required scope
3. Excellent	Per-resource per-operation scoping + machine-readable permission manifests + deny-by-default for destructive operations

`AU3`Credential Lifecycle Management

Requirement: SHOULD | Weight: Standard (×1)

Whether the tool supports automated credential rotation, refresh, expiry signaling, and per-agent revocation.

Score	Description
0. Failing	Manual rotation only; no programmatic credential management
1. Basic	API for key creation/rotation + refresh tokens
2. Good	Automatic rotation + zero-downtime overlap + per-key revocation + expiry metadata
3. Excellent	Brokered credentials + dual-secret rotation + proactive refresh guidance + per-key audit trail

`AU4`Agent Identity Support

Requirement: MAY | Weight: Standard (×1)

Whether the tool treats AI agents as a distinct identity type.

Score	Description
0. Failing	Shared credentials only; no way to distinguish agent from human
1. Basic	Service accounts with some scoping
2. Good	M2M auth with `client_credentials` + agent-specific rate limits
3. Excellent	Agent as first-class identity type + Token Vault + CIBA + per-action audit trail

8.5 Module: CLI

Trigger: Does the tool have a command-line interface?

4 criteria. Evaluates agent-specific CLI concerns: non-interactive execution, structured output, cross-platform behavior, and configuration safety.

`CLI1`Non-Interactive Execution

Requirement: SHOULD | Weight: Standard (×1)

The ability to run a tool without any human interaction, no confirmation prompts, no editor invocations, no TTY-dependent output.

Score	Description
0. Failing	Tool hangs or crashes without TTY; interactive prompts with no bypass
1. Basic	Some non-interactive flags exist (`--yes`, `--no-input`); some prompts remain
2. Good	Non-interactive flags for most prompts; CI mode detection; `--json` output mode
3. Excellent	Auto-detects non-TTY environment; flags for all interactive points; JSON output implies non-interactive; `NO_COLOR=1` support; separate stderr/stdout

`CLI2`Structured Output Mode

Requirement: SHOULD | Weight: Standard (×1)

Whether the CLI provides machine-parseable output alongside human-readable output. Agents consuming CLI output require structured data that can be parsed reliably. Without structured output, agents fall back to parsing formatted text through regular expressions, which is brittle across tool versions and locales.

Score	Description
0. Failing	Text-only output; no `--json` or equivalent flag; ANSI colors/formatting in default output with no disable mechanism; exit code 0/non-zero only with no structured error information
1. Basic	`--json` or `--format json` flag available for primary commands; basic exit codes (0 = success, non-zero = failure); stderr and stdout may be mixed
2. Good	JSON output available on all major commands; meaningful exit codes with descriptive stderr; stderr and stdout cleanly separated; `NO_COLOR=1` or `--no-color` supported
3. Excellent	Multiple structured formats (JSON + YAML + custom templates); structured output implies non-interactive mode (Terraform pattern: `--json` implies `--input=false`); `--porcelain` stability guarantee across versions (Git pattern); semantic exit codes (distinct codes for distinct failure modes); consistent JSON schema across CLI versions

`CLI3`Cross-Platform Consistency

Requirement: SHOULD | Weight: Standard (×1)

Whether the CLI behaves identically across Linux, macOS, and Windows. Agents trained primarily on Linux/macOS generate commands that fail silently on Windows: path separators, line endings, shell syntax, and temp directory locations all differ. A tool that works on one platform but behaves differently on another creates unpredictable agent failures.

Score	Description
0. Failing	Single-platform only (e.g., bash-only scripts); hard-coded platform-specific paths (`/tmp/`, `C:\`); no Windows support
1. Basic	Available on Linux, macOS, and Windows; but behavior or output may differ across platforms; platform-specific installation instructions
2. Good	Cross-platform binary distribution or container; consistent output format across platforms; path handling works with both `/` and `\`; no platform-specific shell syntax required
3. Excellent	CI tests on all three major platforms; byte-identical output across platforms; single static binary or zero-dependency install; devcontainer or Nix support for environment reproducibility; platform-specific differences documented

`CLI4`Configuration Format Safety

Requirement: MAY | Weight: Standard (×1)

Whether the tool's configuration format is safe for agent generation. YAML's whitespace sensitivity and implicit type coercion produce subtle, silent failures in agent-generated configuration: single-space indentation errors change data structure without raising syntax errors, and implicit type coercion (the "Norway problem" in which NO becomes false) silently corrupts data. JSON Schema validation materially reduces these failures by enabling agents to validate configuration before applying it.

Score	Description
0. Failing	YAML-only config with no schema validation; no config validation command; implicit type coercion undocumented
1. Basic	Config format documented; basic structure validation (file parses without error); YAML accepted but JSON alternative available
2. Good	JSON or TOML as primary config format; JSON Schema exists for config files; standalone validation command available (`validate`, `check`, `lint`); actionable error messages on misconfiguration
3. Excellent	JSON or TOML primary with published JSON Schema; schema-driven IDE and agent autocompletion; validation runs automatically before any destructive action; error messages include specific fix suggestions; no implicit type coercion; config secrets isolated from main config file

9. Domain Modules

Domain modules add criteria based on a tool's functional domains. A tool may trigger one or more domain modules: for example, Supabase (Databases + Auth Providers + Hosting) or Firebase (Databases + Auth Providers + Communications). Each activated domain module adds its criteria to the tool's evaluation, expanding both the numerator and denominator like complexity modules.

9.1 Module: Payments & Financial

Applies to: Payment processors, billing platforms, financial APIs

ID	Criterion	Req	Weight
PM1	Idempotency Depth	SHOULD	Standard
PM2	Test Simulation Fidelity	SHOULD	Standard
PM3	Compliance Automation	MAY	Standard
PM4	Currency & Amount Safety	SHOULD	Standard

`PM1`Idempotency Depth

Requirement: SHOULD | Weight: Standard (×1)

Whether the payment API provides deep idempotency beyond basic key acceptance: including parameter validation, key persistence windows, and concurrent request serialization. Agents retry failed requests frequently; without robust idempotency, retries create duplicate charges.

Score	Description
0. Failing	No idempotency support; retried requests create duplicate charges
1. Basic	Idempotency key header accepted; duplicate requests return cached response
2. Good	Key acceptance + documented persistence window (e.g., 24 hours) + concurrent request serialization
3. Excellent	Parameter validation (same key + different params → error/409), documented key lifetime, concurrent locking, idempotency across all POST/write endpoints

`PM2`Test Simulation Fidelity

Requirement: SHOULD | Weight: Standard (×1)

How comprehensively the platform simulates real payment scenarios in test mode: including decline codes, dispute flows, subscription lifecycle, and webhook events. Agents cannot safely learn payment integration on live data.

Score	Description
0. Failing	No test mode; or test mode limited to basic success/fail with no scenario simulation
1. Basic	Test/sandbox environment with key separation; basic test card numbers for success and generic decline
2. Good	Multiple test cards covering specific decline codes and card brands; webhook forwarding/simulation; isolated test data
3. Excellent	30+ test cards with specific scenarios; Test Clocks API for time-dependent flows (subscriptions, trials); dispute/refund simulation; CLI event triggering and replay

`PM3`Compliance Automation

Requirement: MAY | Weight: Standard (×1)

Whether the platform automates regulatory compliance burdens (tax calculation, PCI scope reduction, 3DS/SCA flows) so agents don't need jurisdiction-specific knowledge. An agent creating a payment flow should not need to understand VAT rules for 200 countries.

Score	Description
0. Failing	No compliance automation; agent must manually implement tax calculation, PCI handling, and 3DS flows
1. Basic	Hosted checkout or client-side tokenization reduces PCI scope; basic 3DS support via redirects
2. Good	Built-in tax engine (enable via API); automatic 3DS/SCA handling with machine-readable `requires_action` status; PCI scope fully eliminated via hosted flows
3. Excellent	Merchant of Record model (platform handles all tax, compliance, remittance); or built-in tax engine covering 200+ markets with threshold monitoring and VAT ID validation

`PM4`Currency & Amount Safety

Requirement: SHOULD | Weight: Standard (×1)

Whether the API prevents currency-related agent errors through clear unit documentation, smallest-unit enforcement, zero-decimal currency handling, and validation of ambiguous amounts. Currency math errors are among the highest-impact agent errors in payment systems.

Score	Description
0. Failing	Ambiguous amount units (unclear if cents or dollars); no zero-decimal currency handling; no minimum amount enforcement
1. Basic	Documentation states amounts are in smallest currency unit; minimum charge amount enforced
2. Good	Explicit unit in API responses; zero-decimal currencies (JPY) and three-decimal currencies (BHD) documented; amount validation with clear error messages
3. Excellent	Currency-aware validation rejecting ambiguous amounts; explicit decimal count per currency in API metadata; auth-capture pattern support for human review before charge

9.2 Module: Communications

Applies to: Email, SMS, messaging, notification platforms

ID	Criterion	Req	Weight
CM1	Irreversibility Safeguards	SHOULD	Standard
CM2	Delivery Verification	SHOULD	Standard
CM3	Webhook/Event Infrastructure	SHOULD	Standard

`CM1`Irreversibility Safeguards

Requirement: SHOULD | Weight: Standard (×1)

Whether the platform provides safety mechanisms to prevent agents from sending irreversible communications without review: including sandbox/test modes, draft-then-send patterns, batch limits, and scheduled send with cancellation. Sent messages cannot be recalled.

Score	Description
0. Failing	No sandbox mode; no batch limits; no draft/preview capability; agent can send unlimited messages immediately
1. Basic	Test/sandbox mode available (messages validated but not delivered); basic rate limiting on outbound sends
2. Good	Sandbox mode + batch send limits (≤1,000 per call) + rate limiting; draft/preview API or scheduled send with cancellation window
3. Excellent	Sandbox mode validating full request format; per-second rate limits as safety brakes; draft-then-send pattern with human approval gate; scheduled send with cancellation; loop prevention circuit breaker

`CM2`Delivery Verification

Requirement: SHOULD | Weight: Standard (×1)

Whether the platform provides structured, machine-readable delivery status tracking: including bounce categorization (hard/soft), complaint tracking, and suppression list management. Agents need programmatic feedback to know if messages were actually delivered.

Score	Description
0. Failing	No delivery status feedback; fire-and-forget sending with no bounce or complaint data
1. Basic	Basic delivery/bounce webhooks; suppression list exists but is not API-accessible
2. Good	Structured delivery receipts (delivered/bounced/complained); bounce categorization (hard/soft); API-accessible suppression lists; unsubscribe handling
3. Excellent	Full event lifecycle (processed → delivered → opened → clicked → unsubscribed → complained); automatic suppression management; per-recipient status tracking; bounce type classification with machine-readable codes

`CM3`Webhook/Event Infrastructure

Requirement: SHOULD | Weight: Standard (×1)

Whether the platform supports programmatic webhook configuration, cryptographic signature verification, event replay, and structured event payloads. Agents managing communication workflows need reliable, verifiable event delivery, not dashboard-only webhook setup.

Score	Description
0. Failing	No webhook support; or webhooks require dashboard-only configuration with no signature verification
1. Basic	Webhook URLs configurable via API; events delivered as structured JSON payloads
2. Good	API-managed webhooks + cryptographic signature verification (HMAC or ECDSA); standard event types across the delivery lifecycle
3. Excellent	Full CRUD webhook management via API; signature verification; event replay capability; batched event delivery; per-stream webhook URLs; inbound message processing via webhooks

9.3 Module: Databases

Applies to: Databases, data platforms, ORMs

ID	Criterion	Req	Weight
DB1	Safe Experimentation	SHOULD	Standard
DB2	Schema Introspection Quality	SHOULD	Standard
DB3	Query Interface Safety	SHOULD	Standard
DB4	Connection Management	MAY	Standard

`DB1`Safe Experimentation

Requirement: SHOULD | Weight: Standard (×1)

Whether the database provides mechanisms for agents to experiment without risking production data: including branching, read-only replicas, point-in-time recovery, and copy-on-write environments. Agents occasionally issue destructive operations in error. The database must render those operations reversible.

Score	Description
0. Failing	No branching, snapshots, or recovery mechanism; destructive operations are permanent
1. Basic	Point-in-time recovery (PITR) available; manual backup/restore process
2. Good	Read-only replicas available; PITR with reasonable granularity; snapshot/clone capability (minutes to create)
3. Excellent	Instant copy-on-write branching (<1s creation); branch reset to parent state; schema-only and full-data branch modes; PITR with fine granularity

`DB2`Schema Introspection Quality

Requirement: SHOULD | Weight: Standard (×1)

Whether the database exposes machine-readable schema metadata: including table/column types, relationships, constraints, and semantic descriptions. Agents generating SQL require accurate schema context, but full schema exports are prohibitively large for direct LLM context injection.

Score	Description
0. Failing	No programmatic schema discovery; agent must guess table structure or rely on documentation alone
1. Basic	Standard schema discovery (e.g., `information_schema`, `SHOW TABLES`); table and column names with types exposed
2. Good	Full schema with foreign key/relationship metadata; constraint enumeration; schema accessible via HTTP API (not just SQL)
3. Excellent	Semantic catalog with natural-language column/table descriptions (e.g., `COMMENT ON`); token-efficient schema representation; schema caching with DDL-change invalidation

`DB3`Query Interface Safety

Requirement: SHOULD | Weight: Standard (×1)

Whether the database enforces safe query patterns: including parameterized queries, row-level security, query validation before execution, and protection against the high error rate of agent-generated SQL. Agent-generated SQL fails at higher rates than human-written SQL in measured studies.

Score	Description
0. Failing	Raw SQL string concatenation accepted; no parameterized query enforcement; no row-level security
1. Basic	Parameterized queries supported; basic SQL injection prevention
2. Good	Parameterized queries enforced by default; row-level security (RLS) available; query explain/validation before execution
3. Excellent	RLS enabled by default on new tables; query validation with cost estimation; read-only query mode for exploration; guardrails against broad `DELETE`/`UPDATE` without `WHERE` clauses

`DB4`Connection Management

Requirement: MAY | Weight: Standard (×1)

Whether the database provides HTTP/REST access, managed connection pooling, and edge-compatible drivers. Agents running in serverless and edge environments (Vercel Edge, Cloudflare Workers) cannot establish TCP connections. HTTP-based access is the only viable path.

Score	Description
0. Failing	TCP-only access; no connection pooling; no serverless-compatible drivers
1. Basic	Managed connection pooling available; standard database drivers with connection management
2. Good	HTTP/REST API available alongside TCP; serverless-compatible drivers; connection pooling with scale-to-zero
3. Excellent	Auto-generated HTTP/REST API (e.g., PostgREST); WebSocket support for multi-statement transactions; edge-compatible drivers; scale-to-zero with sub-second cold starts

9.4 Module: Hosting & Infrastructure

Applies to: Cloud platforms, PaaS, serverless, container services

ID	Criterion	Req	Weight
HI1	Cost Guardrails	SHOULD	Standard
HI2	Deployment Lifecycle Completeness	SHOULD	Standard
HI3	Preview/Staging Deployments	MAY	Standard

`HI1`Cost Guardrails

Requirement: SHOULD | Weight: Standard (×1)

Whether the platform provides spending limits, auto-stop/scale-down, cost estimation, and usage tracking that agents can use programmatically. Agents do not reliably model the cost impact of provisioning decisions. Without guardrails, they may allocate resources and fail to release them.

Score	Description
0. Failing	No spending limits or cost controls; no usage tracking API; API defaults more permissive than console defaults
1. Basic	Usage-based pricing with basic spending alerts; manual cost controls available via dashboard
2. Good	Spending limits configurable via API; auto-stop for idle resources; usage tracking API; cost alerts with configurable thresholds
3. Excellent	Cost estimation before deployment; per-project spending caps via API; auto-scale-down to zero when idle; real-time cost tracking; API defaults match or are more restrictive than console defaults

`HI2`Deployment Lifecycle Completeness

Requirement: SHOULD | Weight: Standard (×1)

Whether the full deployment lifecycle (build, deploy, rollback, scale, log access, and environment management) is available via API, CLI, or MCP. An agent that can deploy but cannot roll back, or that can deploy but cannot access logs, has an operationally unsafe capability gap.

Score	Description
0. Failing	Dashboard-only deployment; no API or CLI for triggering deploys or reading logs
1. Basic	Deploy and status check available via API/CLI; log access available but limited
2. Good	Build, deploy, log access, and environment variable management via API; rollback available (redeploy previous version)
3. Excellent	Full lifecycle via API/CLI/MCP: deploy, rollback, scale, streaming logs, environment management; MCP server covering read and write operations across the lifecycle

`HI3`Preview/Staging Deployments

Requirement: MAY | Weight: Standard (×1)

Whether the platform supports creating isolated preview/staging environments via API: including branch deployments, ephemeral environments, and automatic cleanup. Agents deploying directly to production without preview create unrecoverable failures.

Score	Description
0. Failing	No preview or staging environment support; all deployments go directly to production
1. Basic	Manual staging environment available; preview deployments require dashboard configuration
2. Good	Preview deployments creatable via API; branch-based deployments; rollback to previous deployment via API
3. Excellent	Automatic preview deployment per branch/PR via API; ephemeral environments with automatic cleanup; instant rollback by promoting previous deployment; progressive rollout support (canary/blue-green)

9.5 Module: Auth Providers

Applies to: Identity/authentication platforms (Auth0, Clerk, Firebase Auth, etc.)

ID	Criterion	Req	Weight
AP1	Agent-as-End-User Support	SHOULD	Standard
AP2	Social/External Connection API	SHOULD	Standard
AP3	Token Architecture Transparency	MAY	Standard

`AP1`Agent-as-End-User Support

Requirement: SHOULD | Weight: Standard (×1)

Whether the auth platform supports flows where the end user is an agent, not a human with a browser. Standard OAuth redirects, approval interfaces, and email-based verification do not function when the end user is a non-interactive agent. CIBA, Device Flow, Client Credentials, and dedicated agent identity types address this gap.

Score	Description
0. Failing	All auth flows require browser-based interaction (redirects, consent screens); no machine-to-machine support
1. Basic	Client Credentials grant supported for M2M authentication; basic service account support
2. Good	Client Credentials + Device Flow or CIBA for async human approval; token vault or credential delegation for agents acting on behalf of users
3. Excellent	Dedicated agent identity type (not retrofitted service accounts); credential vault with 35+ integrations; async authorization (CIBA) with push notification approval; scoped, time-bounded agent credentials with full audit trail

`AP2`Social/External Connection API

Requirement: SHOULD | Weight: Standard (×1)

Whether social/OAuth provider connections, redirect URIs, email templates, and session settings can be configured entirely via API: without requiring dashboard interaction. An agent bootstrapping auth for a new project must be able to complete setup programmatically.

Score	Description
0. Failing	Social connections and auth configuration require dashboard-only setup; no Management API
1. Basic	Core auth settings configurable via API; some provider setup (e.g., social connections) still requires dashboard
2. Good	Social connections configurable via API for most providers; email templates accessible via API; redirect URI management via API
3. Excellent	All configuration API-driven (social providers, email templates, branding, custom domains); 60+ social providers configurable via API; dynamic client registration support

`AP3`Token Architecture Transparency

Requirement: MAY | Weight: Standard (×1)

Whether the platform clearly documents delegation chains, token lifetimes, refresh semantics, and trust boundaries: and supports emerging standards for agent-to-app authorization. Opaque token architectures prevent agents from reasoning about their own permissions and capabilities.

Score	Description
0. Failing	Opaque token architecture; no documentation of token lifetimes, refresh semantics, or delegation chains
1. Basic	Token lifetimes and refresh semantics documented; basic scope documentation
2. Good	Delegation chain documentation; Rich Authorization Requests (RAR) support; fine-grained authorization (FGA); credential rotation via API
3. Excellent	Support for agent-to-app protocols (XAA or equivalent); per-action authorization logging; dual-secret rotation without downtime; brokered credentials preventing LLM token exposure; EU AI Act-ready audit trail

9.6 Module: Frameworks & Libraries

Applies to: Web frameworks, ORMs, UI libraries, build tools

ID	Criterion	Req	Weight
FL1	Type System Quality	SHOULD	Standard
FL2	Scaffolding & Code Generation	MAY	Standard
FL3	Configuration Validation	SHOULD	Standard

`FL1`Type System Quality

Requirement: SHOULD | Weight: Standard (×1)

Whether the framework provides strong, expressive types (TypeScript types, Python type hints, or equivalent) that constrain agent-generated code at compile time. Type systems provide the tightest feedback loop for agents: milliseconds to detect errors versus seconds or minutes for runtime failures.

Score	Description
0. Failing	No type definitions; untyped JavaScript, untyped Python, or equivalent; agents get no compile-time feedback
1. Basic	Type definitions available (e.g., `@types/` package, basic type hints); core API surface typed
2. Good	Comprehensive types across full API surface; generated types from schema (e.g., Prisma, GraphQL codegen); type inference support reducing annotation burden
3. Excellent	Branded/nominal types preventing ID confusion (e.g., `BuildingID` vs. `CustomerID`); generated types with schema-first design; types covering edge cases and error states; type-check performance stable as schema grows

`FL2`Scaffolding & Code Generation

Requirement: MAY | Weight: Standard (×1)

Whether the framework provides CLI generators, project templates, and code scaffolding that work non-interactively. Agents benefit from scaffolding over hand-constructing project structure, but scaffolding tools that depend on interactive prompts (arrow-key menus, confirmation dialogs) are inaccessible to agents.

Score	Description
0. Failing	No scaffolding tools; or scaffolding requires interactive prompts with no CLI flag bypass
1. Basic	Project scaffolding CLI available; can generate basic project structure with default options via flags
2. Good	Project + component/module generators; templates for common patterns; all prompts bypassable via CLI flags
3. Excellent	Full non-interactive scaffolding with `--yes`/`--defaults` flags; generates project-specific configuration (e.g., `AGENTS.md`, type definitions); template library covering common patterns; generator output is immediately buildable/runnable

`FL3`Configuration Validation

Requirement: SHOULD | Weight: Standard (×1)

Whether the framework validates configuration files with actionable error messages and provides validation as a standalone command (not just at runtime). Agents generate configuration frequently and require immediate feedback on misconfiguration; deferred failures at application startup or runtime are operationally costly.

Score	Description
0. Failing	No configuration validation; silent misconfiguration; runtime crashes on bad config with unhelpful errors
1. Basic	Runtime validation with error messages on misconfiguration; configuration file format documented
2. Good	JSON Schema for configuration files enabling editor validation; actionable error messages with suggested fixes; validation runs at startup before executing
3. Excellent	Standalone validation command (`lint`, `check`, `validate`) runnable without starting the application; JSON Schema published for IDE/agent integration; error messages include specific fix suggestions; type-safe configuration with compile-time checking

Appendix A: Criteria Quick Reference

Base Standard (15 criteria)

ID	Criterion	Group	Req	Weight
B1	Machine-Readable Documentation Formats	Documentation & Usability	SHOULD	Critical
B2	Code Example Coverage & Quality	Documentation & Usability	SHOULD	Standard
B3	Documentation Structure & Self-Containment	Documentation & Usability	SHOULD	Standard
B4	Documentation Accuracy & Synchronization	Documentation & Usability	SHOULD	Standard
B5	Getting Started Completeness	Documentation & Usability	SHOULD	Standard
B6	Changelog & Migration Guidance	Documentation & Usability	MAY	Standard
B7	Installation & Configuration Simplicity	Documentation & Usability	SHOULD	Standard
B8	Supply Chain Integrity	Safety	SHOULD	Standard
B9	Vulnerability Disclosure & Security Contact	Safety	SHOULD	Standard
B10	Project Sustainability	Lifecycle	SHOULD	Critical
B11	Maintenance Health	Lifecycle	SHOULD	Standard
B12	Semver Adherence & Version Stability	Lifecycle	SHOULD	Standard
B13	Governance & Continuity	Lifecycle	MAY	Standard
B14	Security Track Record	Lifecycle	SHOULD	Standard
B15	Terms & Licensing Stability	Lifecycle	SHOULD	Standard

Module: Programmatic Interface (16 criteria)

ID	Criterion	Req	Weight
PI1	Interface Reference Completeness	MUST	Critical
PI2	Tool/Endpoint Description Quality	MUST	Critical
PI3	Tool Count & Surface Area Management	SHOULD	Critical
PI4	Input Schema Design	SHOULD	Critical
PI5	Output Quality & Token Efficiency	SHOULD	Standard
PI6	Response Envelope Consistency	SHOULD	Standard
PI7	Naming & Namespacing	SHOULD	Standard
PI8	Behavioral Metadata & Annotations	SHOULD	Standard
PI9	MCP Implementation Quality	MAY	Standard
PI10	Programmatic Setup / TTFC	SHOULD	Critical
PI11	API Workflow Coverage	SHOULD	Standard
PI12	Versioning & API Stability	SHOULD	Standard
PI13	SDK Availability & Quality	SHOULD	Standard
PI14	Agent Protocol Availability	SHOULD	Standard
PI15	Input Sanitization & Injection Resistance	MUST	Standard
PI16	Prompt Injection Resistance	SHOULD	Standard

Module: Network Service (8 criteria)

ID	Criterion	Req	Weight
NS1	Error Response Quality & Structure	MUST	Critical
NS2	Rate Limit Communication	SHOULD	Critical
NS3	Health & Status Communication	SHOULD	Standard
NS4	Audit & Observability	SHOULD	Standard
NS5	Test/Sandbox Environment Support	SHOULD	Standard
NS6	Environment Separation	SHOULD	Standard
NS7	Asynchronous Operation Support	MAY	Standard
NS8	Data Portability & Pricing Transparency	MAY	Standard

Module: Write Operations (4 criteria)

ID	Criterion	Req	Weight
WO1	Destructive Operation Safety	SHOULD	Critical
WO2	Dry-Run / Validation Capability	MAY	Standard
WO3	Idempotency & Safe Retry Support	SHOULD	Standard
WO4	Workflow Error Communication	SHOULD	Standard

Module: Authentication (4 criteria)

ID	Criterion	Req	Weight
AU1	Non-Interactive Authentication Methods	MUST	Critical
AU2	Permission Granularity	SHOULD	Standard
AU3	Credential Lifecycle Management	SHOULD	Standard
AU4	Agent Identity Support	MAY	Standard

Module: CLI (4 criteria)

ID	Criterion	Req	Weight
CLI1	Non-Interactive Execution	SHOULD	Standard
CLI2	Structured Output Mode	SHOULD	Standard
CLI3	Cross-Platform Consistency	SHOULD	Standard
CLI4	Configuration Format Safety	MAY	Standard

⚑ = Open-source/commercial split rubric.

Domain Module: Payments & Financial (4 criteria)

ID	Criterion	Req	Weight
PM1	Idempotency Depth	SHOULD	Standard
PM2	Test Simulation Fidelity	SHOULD	Standard
PM3	Compliance Automation	MAY	Standard
PM4	Currency & Amount Safety	SHOULD	Standard

Domain Module: Communications (3 criteria)

ID	Criterion	Req	Weight
CM1	Irreversibility Safeguards	SHOULD	Standard
CM2	Delivery Verification	SHOULD	Standard
CM3	Webhook/Event Infrastructure	SHOULD	Standard

Domain Module: Databases (4 criteria)

ID	Criterion	Req	Weight
DB1	Safe Experimentation	SHOULD	Standard
DB2	Schema Introspection Quality	SHOULD	Standard
DB3	Query Interface Safety	SHOULD	Standard
DB4	Connection Management	MAY	Standard

Domain Module: Hosting & Infrastructure (3 criteria)

ID	Criterion	Req	Weight
HI1	Cost Guardrails	SHOULD	Standard
HI2	Deployment Lifecycle Completeness	SHOULD	Standard
HI3	Preview/Staging Deployments	MAY	Standard

Domain Module: Auth Providers (3 criteria)

ID	Criterion	Req	Weight
AP1	Agent-as-End-User Support	SHOULD	Standard
AP2	Social/External Connection API	SHOULD	Standard
AP3	Token Architecture Transparency	MAY	Standard

Domain Module: Frameworks & Libraries (3 criteria)

ID	Criterion	Req	Weight
FL1	Type System Quality	SHOULD	Standard
FL2	Scaffolding & Code Generation	MAY	Standard
FL3	Configuration Validation	SHOULD	Standard

Totals: 15 base + 16 interface + 8 network + 4 write + 4 auth + 4 CLI = 51 base + complexity criteria | 20 domain-specific across 6 modules (4+3+4+3+3+3) | 5 MUST gates (all in complexity modules) | A tool may trigger multiple domain modules

1. Evaluation Logic

The standard scales with tool complexity

Health over activity

Discoverability is outside scope

Agent capability floor

2. How the Standard Works

Module Activation

Example Evaluations

3. How to Read This Document

Criterion Structure

Requirement Levels (RFC 2119)

Weight Categories

Open-Source vs. Commercial Tool Splits

4. Scoring System

How Criteria Map to Points

Point Calculation

What is a good score?

Agent-Ready Base

Agent Ready

Agent Native

Anti-Gaming Mechanisms

Scoring Examples

5. Versioning & Governance

Version Scheme

6. Scope & Limitations

7. Base Standard

Documentation & Usability

B1Machine-Readable Documentation Formats

B2Code Example Coverage & Quality

B3Documentation Structure & Self-Containment

B4Documentation Accuracy & Synchronization

B5Getting Started Completeness

B6Changelog & Migration Guidance

B7Installation & Configuration Simplicity

Safety Fundamentals

B8Supply Chain Integrity

B9Vulnerability Disclosure & Security Contact

Lifecycle Health

B10Project Sustainability

B11Maintenance Health

B12Semver Adherence & Version Stability

B13Governance & Continuity

B14Security Track Record

B15Terms & Licensing Stability

8. Complexity Modules

8.1 Module: Programmatic Interface

PI1Interface Reference Completeness

PI2Tool/Endpoint Description Quality

PI3Tool Count & Surface Area Management

PI4Input Schema Design

PI5Output Quality & Token Efficiency

PI6Response Envelope Consistency

PI7Naming & Namespacing

PI8Behavioral Metadata & Annotations

PI9MCP Implementation Quality

PI10Programmatic Setup / Time to First API Call

PI11API Workflow Coverage

PI12Versioning & API Stability

PI13SDK Availability & Quality

PI14Agent Protocol Availability

PI15Input Sanitization & Injection Resistance

PI16Prompt Injection Resistance

8.2 Module: Network Service

NS1Error Response Quality & Structure

NS2Rate Limit Communication

NS3Health & Status Communication

NS4Audit & Observability

NS5Test/Sandbox Environment Support

NS6Environment Separation

NS7Asynchronous Operation Support

NS8Data Portability & Pricing Transparency

8.3 Module: Write Operations

WO1Destructive Operation Safety

WO2Dry-Run / Validation Capability

WO3Idempotency & Safe Retry Support

WO4Workflow Error Communication

8.4 Module: Authentication

AU1Non-Interactive Authentication Methods

AU2Permission Granularity

AU3Credential Lifecycle Management

`B1`Machine-Readable Documentation Formats

`B2`Code Example Coverage & Quality

`B3`Documentation Structure & Self-Containment

`B4`Documentation Accuracy & Synchronization

`B5`Getting Started Completeness

`B6`Changelog & Migration Guidance

`B7`Installation & Configuration Simplicity

`B8`Supply Chain Integrity

`B9`Vulnerability Disclosure & Security Contact

`B10`Project Sustainability

`B11`Maintenance Health

`B12`Semver Adherence & Version Stability

`B13`Governance & Continuity

`B14`Security Track Record

`B15`Terms & Licensing Stability

`PI1`Interface Reference Completeness

`PI2`Tool/Endpoint Description Quality

`PI3`Tool Count & Surface Area Management

`PI4`Input Schema Design

`PI5`Output Quality & Token Efficiency

`PI6`Response Envelope Consistency

`PI7`Naming & Namespacing

`PI8`Behavioral Metadata & Annotations

`PI9`MCP Implementation Quality

`PI10`Programmatic Setup / Time to First API Call

`PI11`API Workflow Coverage

`PI12`Versioning & API Stability

`PI13`SDK Availability & Quality

`PI14`Agent Protocol Availability

`PI15`Input Sanitization & Injection Resistance

`PI16`Prompt Injection Resistance

`NS1`Error Response Quality & Structure

`NS2`Rate Limit Communication

`NS3`Health & Status Communication

`NS4`Audit & Observability

`NS5`Test/Sandbox Environment Support

`NS6`Environment Separation

`NS7`Asynchronous Operation Support

`NS8`Data Portability & Pricing Transparency

`WO1`Destructive Operation Safety

`WO2`Dry-Run / Validation Capability

`WO3`Idempotency & Safe Retry Support

`WO4`Workflow Error Communication

`AU1`Non-Interactive Authentication Methods

`AU2`Permission Granularity

`AU3`Credential Lifecycle Management

`AU4`Agent Identity Support

`CLI1`Non-Interactive Execution

`CLI2`Structured Output Mode

`CLI3`Cross-Platform Consistency

`CLI4`Configuration Format Safety

`PM1`Idempotency Depth

`PM2`Test Simulation Fidelity

`PM3`Compliance Automation

`PM4`Currency & Amount Safety

`CM1`Irreversibility Safeguards

`CM2`Delivery Verification

`CM3`Webhook/Event Infrastructure

`DB1`Safe Experimentation

`DB2`Schema Introspection Quality

`DB3`Query Interface Safety

`DB4`Connection Management

`HI1`Cost Guardrails

`HI2`Deployment Lifecycle Completeness

`HI3`Preview/Staging Deployments

`AP1`Agent-as-End-User Support

`AP2`Social/External Connection API

`AP3`Token Architecture Transparency

`FL1`Type System Quality

`FL2`Scaffolding & Code Generation

`FL3`Configuration Validation