Building for Agents v0.9

tl;dr

Your tool has a new class of user that reads structured text, calls your API, and moves on. This guide covers 9 areas where developer tools succeed or fail with AI agents: documentation, API design, error handling, authentication, safety, CLI, health signals, and sandboxing. Each section is grounded in published research and references the relevant Zaira Standard criteria.

Short on time? Jump to where to start.

1. The shift

Your tool has a new class of user. It does not read your landing page. It does not watch your conference talk. It does not browse your docs site looking for inspiration. It reads structured text, calls your API, parses the response, and moves on. If it fails, it retries with a different approach or picks a different tool entirely. It has no patience, no institutional memory, and no ability to click through a dashboard to fix a misconfiguration.

This is not a future state. Neon reports that over 80% of databases provisioned on their platform are now created by AI agents. Anthropic's Claude Code, Cursor, and similar coding agents make tool selection decisions mid-workflow, choosing databases, auth providers, hosting platforms, and payment processors based on whatever structured information they can find. The quality of that information, and the quality of your tool's interface, determines whether agents choose you, use you successfully, or abandon you after the first error.

This guide covers what agent-ready tools look like in practice. Each section addresses a surface area of your product: documentation, API design, error handling, authentication, safety, CLI, health signals, and testing infrastructure. The practices are grounded in published research and in the evaluation criteria defined by the Zaira Standard (referenced by criterion ID where relevant, e.g. B1, PI3). But the goal is not to teach you how to pass an evaluation. The goal is to help you build a tool that agents can actually use.

2. Documentation is an interface, not a manual

The most common assumption tool teams make about agent consumption is that agents read docs the way humans do: start at the top, scan for relevance, follow links. They do not. An agent encountering your tool for the first time needs machine-readable, structured, self-contained text that answers specific questions. It pulls a chunk of your docs into its context window, generates a plan, and executes. If the chunk it pulled is wrong, incomplete, or too large, it fails.

More documentation is not better. Research on long-context LLM performance shows accuracy degrades by 30% or more when relevant information sits in the middle of a large context window (Liu et al., "Lost in the Middle"). Dumping your entire API reference into an agent's context actively degrades performance. The goal is the right information, structured for extraction, not maximum volume.

Machine-readable formats

Three formats have emerged as the primary channels for agent documentation consumption:

llms.txt is a plaintext file at your domain root that provides a structured overview of your tool: what it does, how to authenticate, key endpoints, and links to detailed docs. Over 844,000 websites have implemented it as of early 2026. It costs nothing to create, and it is 2.5x cheaper for agents to consume than MCP-based documentation delivery. If you do one thing after reading this guide, add an llms.txt file.

AGENTS.md is a repository-level file that gives coding agents the context they need to work with your codebase or SDK. Over 60,000 projects have adopted it, with early data showing 35-55% fewer agent-generated bugs in projects that include one. Where llms.txt describes your tool to agents evaluating whether to use it, AGENTS.md describes your codebase to agents already working inside it.

Content negotiation (Accept: text/markdown) lets agents request documentation in a format they can parse directly, without scraping HTML. This is the most technically involved option, but it means agents can pull exactly the docs they need from your existing documentation infrastructure.

None of these replace your human-facing docs. They sit alongside them. The cost of maintaining all three is low because the content overlaps with what you already publish. B1

Structure for extraction

Agents do not read linearly. They retrieve chunks. This means every section of your documentation should be independently useful when extracted from its surrounding context.

The research points to specific structural patterns that improve agent performance. Put the answer first: research on LLM citation patterns shows 44.2% of citations come from the first 30% of text on a page. Keep sections between 134 and 167 words (the empirically optimal chunk size for retrieval-augmented generation). Use tables for structured data (tables achieve a 2.3x citation rate compared to equivalent prose). Make headings descriptive enough to function as search queries.

A section titled "Authentication" is less useful than "How to authenticate with API keys." A section that opens with three paragraphs of background before stating the configuration format is less useful than one that opens with the configuration format and follows with context. B3

Examples that teach

Code examples are among the highest-value documentation assets for agents. In one benchmark, Claude 3 Haiku improved from 11% to 75% accuracy when provided with three well-chosen examples. A separate Anthropic study found tool use examples improved accuracy from 72% to 90%.

The key word is "well-chosen." Performance gains plateau around 5-6 examples. Beyond that, additional examples consume context without improving outcomes. Focus your examples on:

The minimal working case (one example that gets an agent from zero to a successful call)
Ambiguous cases (where the correct usage is not obvious from the type signature)
Error recovery (what to do when the obvious approach fails)

Placeholder data ("string", 123, foo) teaches nothing. Use realistic data that demonstrates the actual shape and constraints of your inputs and outputs. B2 B5

3. Design your API surface for a context window

When an agent uses your tool, every tool description, every schema definition, and every response payload occupies space in a finite context window. The design of your programmatic interface (REST API, SDK, or MCP server) determines how much of that window your tool consumes and how effectively agents can use what remains.

The research on this is unambiguous: less is more.

Fewer tools, better descriptions

Research across multiple benchmarks shows tool selection accuracy drops from approximately 90% at 1-30 tools to roughly 14% at 100+ tools. The degradation is steep, consistent across model families, and gets worse fast.

Block's content management system is the canonical case study. Their initial MCP server exposed 50 tools (one per REST endpoint) and achieved 31% agent success. They redesigned around 8 outcome-oriented tools and success rose to 89% (Block Engineering, 2025). The pattern that failed was the intuitive one: mirror the REST API, one tool per endpoint. The pattern that worked was designing tools around what agents actually try to accomplish.

If your API has more than 15 endpoints, do not expose them all as individual tools. Group operations by user intent. A manage_subscription tool that handles create, update, cancel, and status is more usable than four separate tools. For large catalogs, implement deferred loading or dynamic discovery, where agents start with a small set of tools and request additional ones as needed. This pattern improved accuracy from 49% to 74% while reducing context consumption by 85% (Anthropic, 2025). PI3

Descriptions are your most important interface

Tool description quality is consistently identified across research (13+ independent studies) as a critical factor in agent tool use success. Anthropic's engineering team reported spending "more time optimizing tool descriptions than the overall prompt." Claude achieved state-of-the-art SWE-bench performance through description refinements alone.

Yet 97.1% of MCP tool descriptions have at least one quality defect (MCP Bench, 2025). Noisy or vague descriptions cause a 22-point accuracy drop.

A good tool description answers four questions: What does this tool do? When should an agent use it? When should an agent not use it? What does the output look like? The "when not to use it" guidance is particularly valuable because it prevents agents from selecting the wrong tool based on superficial name matching, which is one of the most common failure modes when multiple tools have similar names. PI2

Tight schemas, small outputs

Parameter errors are a leading failure type in agent tool use: 44% parameter value mismatch, 45% type mismatch, 19% missing parameters. Your input schema is your primary defense.

Use enums for constrained fields. Set additionalProperties: false. Keep nesting to three levels or fewer. Flattening parameter spaces improved performance by 47% in one study (Anthropic, 2025). OpenAI's strict: true mode achieves 100% schema conformance, but only works with schemas that are already tight. A loose schema cannot be made strict by the caller.

On the output side, one tested MCP server averaged 557,766 tokens per response. Sixteen tools in a single benchmark exceeded 128K tokens. Context overflow accounted for 63.4% of Claude Sonnet 4 failures on SWE-bench Pro. Cloudflare's "Codemode" pattern achieved a 99.9% token reduction (from 1.17M tokens to roughly 1,000) by returning computed summaries instead of raw data.

Paginate by default. Bound response sizes. Offer a concise mode. If your API returns a list, return the first page with a cursor, not the entire dataset. PI4 PI5

Name things so agents can tell them apart

Across the MCP ecosystem, 775 tools share identical names on different servers. "search" appears in 32 servers. When an agent has access to multiple tools called search, it guesses, and it guesses wrong often enough to matter.

Service-prefix your tool names: stripe_create_charge, github_list_issues, neon_create_branch. Use snake_case consistently (GPT-4o's tokenizer handles it best). Make names self-descriptive enough that an agent can infer what the tool does without reading the description. PI7

Tell agents what your tools do to the world

MCP annotations (readOnlyHint, destructiveHint, idempotentHint, openWorldHint) declare whether a tool reads data, modifies it, can be safely retried, or interacts with external systems. ChatGPT uses these annotations to decide whether to show a confirmation dialog. OpenAI's Apps SDK makes them required; "incorrect labels are a common cause of rejection."

Set all four annotations on every tool. Agents and agent runtimes use them to make safety decisions before your tool is ever called. A tool with no annotations looks identical to a destructive tool with no annotations, and cautious runtimes will treat it accordingly. PI8

Defend your interface against injection

Security audits of MCP servers have found high rates of command injection and path traversal vulnerabilities. Separate research has demonstrated high attack success rates against coding agents via prompt injection through tool outputs. These are not exotic attacks. They exploit the gap between "this input came from a user" and "this input came from an LLM that may have been influenced by untrusted content."

Your primary defenses are the ones you should already have: strict input schemas with type checking and additionalProperties: false, parameterized queries (never string concatenation), and a WAF or CDN in front of your API. Beyond that, keep tool descriptions concise and self-contained (no narrative, no cross-tool references that an attacker could exploit), structure outputs with clear field boundaries so untrusted content cannot escape into control flow, and document your input validation practices.

This is a binary gate in the Zaira Standard: a tool that exposes a programmatic interface with no evidence of input sanitization cannot be certified at any tier, regardless of total score. PI15 PI16

4. Errors are recovery instructions

When an agent hits an error, it has three options: retry, try something different, or give up. The quality of your error response determines which one it picks.

Research on agent error recovery shows that with unstructured error messages, recovery rates sit around 20%. With structured, actionable error responses, recovery rates reach 95%. The PALADIN framework for autonomous error recovery demonstrated 89.68% success compared to a 32.76% baseline, driven almost entirely by the quality of error information available to the agent.

What structured means

An HTML error page is invisible to an agent. A generic {"error": "Something went wrong"} tells the agent nothing useful. A structured error response gives the agent enough information to diagnose the problem and attempt recovery without human help.

At minimum, an error response needs:

A machine-readable error code (not just an HTTP status)
A human-readable message (agents use these too)
Field-level identification for validation errors (which parameter was wrong, and why)

RFC 9457 (Problem Details for HTTP APIs) defines a standard structure: type, title, status, detail. Most major agent frameworks parse it natively.

The reference implementation is Stripe's error hierarchy: type (the broadest category), code (specific error), decline_code (payment-specific), plus doc_url linking to documentation for every error type. This three-level taxonomy lets agents make retry decisions without human intervention. At the highest tier, errors include is_retriable booleans, retry_after_seconds, and suggested alternative actions. NS1

Rate limits are errors you can prevent

Rate limiting is consistently one of the largest degradation factors in agent performance benchmarks. Limits are fine. Invisible limits are the problem. Agents operating at machine speed burn through allocations in seconds if they have no signals to pace themselves.

The fix is straightforward: include rate limit headers on every response, not just on 429s. X-RateLimit-Remaining, X-RateLimit-Limit, and X-RateLimit-Reset give agents the information they need to self-throttle. On 429 responses, include a Retry-After header with a specific duration.

Only 2.4% of MCP servers implement rate limiting at all. This means agents calling most MCP tools have no mechanism for pacing, and the tool's only defense is to reject requests after the damage is done. NS2

5. Authentication without a browser

Authentication remains one of the least-solved problems in agent-tool interaction. A tool with perfect documentation, a clean API, and structured errors is completely useless to an agent that cannot authenticate. And most authentication flows assume a human sitting in front of a browser.

OAuth redirects, "Approve" buttons, CAPTCHA challenges, email verification links, and SMS-based 2FA all assume a human with a browser. Some agents can drive browsers, but the experience is fragile and not universally supported. If your tool's only authentication path requires browser interaction, you are depending on the least reliable capability in the agent stack.

The non-interactive path

At minimum, your tool needs one authentication method that agents can complete without human involvement:

API keys are the simplest path. An agent (or the human configuring it) sets an environment variable, and the agent authenticates on every request. Most tools already support this. The gap is usually in documentation and key management, not in the mechanism itself.

Client Credentials grant (OAuth 2.0) is the standard for machine-to-machine authentication. The agent presents a client ID and secret, receives a token, and uses it. No browser, no redirect, no human.

Device Flow (OAuth 2.0) handles the case where an agent needs to act on behalf of a human. The agent displays a URL and code, the human approves in their browser, and the agent receives a token. This is a one-time approval, not a per-request interaction.

A survey of 492 MCP servers found zero authentication implementations. Among MCP servers that do implement auth, only 8.5% use OAuth; 53% rely on static secrets with no rotation capability. AU1

Scope down, not up

Industry surveys report that over-privileged agents have significantly higher security incident rates than properly scoped agents. Yet a majority of organizations give agents more access than equivalent humans.

The fix is permission granularity. Read/write separation is the minimum. Per-resource scoped keys are better. Per-resource, per-operation scoping with deny-by-default for destructive operations is best. When an agent requests an operation it lacks permission for, the error should tell it exactly which scope is required. AU2

Credentials as infrastructure

Non-human identities outnumber human identities 144:1 in typical enterprise environments (Okta, 2025). Credential management for agents is not a nice-to-have. It is infrastructure.

Your tool should support programmatic credential rotation (not just manual key regeneration in a dashboard), refresh tokens with documented lifetimes, per-key revocation (not just "revoke all keys"), and expiry metadata so agents can proactively refresh before credentials expire. Okta's research recommends 5-15 minute token lifetimes for agents with automated refresh, a significant departure from the multi-month lifetimes common in human-oriented systems. AU3

6. Make destruction difficult by default

Documented incidents from 2024-2025 include AI agents wiping production databases, deleting live records and fabricating replacement data to mask the damage, and sending duplicate communications dozens of times. These are not hypothetical scenarios.

Agents make destructive mistakes for the same reason humans do: the operation was available, and the guardrails were insufficient. The difference is that agents make mistakes faster, at higher volume, and without the gut-check moment where a human hovers over the "Delete All" button and reconsiders.

Structural prevention over confirmation dialogs

"Are you sure? (y/N)" does not work when the user is software. An agent will answer "y" because it decided to run the command before it saw the prompt. Confirmation dialogs are a human safety mechanism. Agent safety requires structural prevention.

The most effective patterns make destructive outcomes impossible through architecture, not policy:

Auth-capture separation (Stripe's model): creating a payment intent and capturing funds are separate API calls. You cannot accidentally charge a customer in a single operation. The agent creates the intent, a human (or a separate approval flow) captures it. Double-charging is structurally impossible.

Plan-apply separation (Terraform's model): terraform plan shows what will change, terraform apply executes it. The planning step has no side effects. An agent can plan as many times as it needs, review the output, and only apply when the plan matches expectations. This pattern is, according to HashiCorp, "heavily used by automation agents."

Interface exclusion (Railway's model): Railway's MCP server intentionally omits destructive operations. You cannot delete a project through the MCP interface. The operation exists in the dashboard, but the agent-facing surface does not expose it. The most effective guardrail is the one you never have to enforce.

Beyond these architectural patterns, layered defenses add depth: read-only modes as the default, lexical blocklists for SQL operations (DROP, DELETE, TRUNCATE), soft delete instead of hard delete, and human confirmation gates for operations above a risk threshold. WO1

Dry-run and idempotency

Two supporting practices make destructive operations safer even when they cannot be structurally prevented.

Dry-run modes let agents validate a request without executing it. Google Cloud standardized this with validate_only: true across their APIs (AIP-163). Kubernetes supports server-side dry-run that exercises the full validation chain. The pattern is uncommon but high-value: agents can test their understanding of an operation before committing to it.

Idempotency keys prevent duplicate side effects when agents retry failed requests. Without them, a network timeout on a payment request can result in a double charge when the agent retries. Stripe retains idempotency keys for 24 hours with parameter validation (reusing a key with different parameters returns an error rather than silently processing the new request). OpenAI's Instant Checkout requires idempotency as mandatory for all payment operations. WO2 WO3

7. Build a CLI agents can drive

Coding agents run CLI tools constantly. They install packages, run builds, execute tests, deploy services, and manage infrastructure, all through shell commands. Terminal-Bench, the largest evaluation of agent CLI usage to date, found that across 32,155+ trials, the best agent achieved only 78.4% task resolution. CLI tools are harder for agents than APIs, and most of the difficulty comes from design choices that assume a human is watching the terminal.

Non-interactive execution

Interactive prompts are a common hard blocker for agent CLI usage. The confirmation dialog problem from Section 6 applies here too, but CLIs have additional traps: pagers, editors, and TTY-dependent output. The 2019 AWS CLI pager incident broke thousands of CI pipelines when aws started piping output through less by default, requiring a TTY that automated systems did not have.

Every interactive prompt in your CLI needs a flag to bypass it: --yes, --no-input, --non-interactive, or equivalent. The best tools auto-detect non-TTY environments and suppress prompts automatically. When --json is passed, treat it as an implicit non-interactive flag (this is the Terraform pattern: --json implies --input=false). CLI1

Structured output

Without --json or equivalent, agents parse your CLI output with regular expressions. This works until you change a column header, adjust spacing, add a color code, or localize a message. It is fragile by construction.

The fix is a structured output flag on every command that produces output. JSON is the standard. Git's --porcelain flag takes this further by providing a stability guarantee: the output format will not change across versions, giving agents a reliable parsing target. Semantic exit codes (distinct codes for distinct failure modes, not just 0 and 1) let agents diagnose failures without parsing stderr.

Watch for stream mixing. npm sends error messages to stdout. Docker has five or more documented issues where stdout and stderr are interleaved. When an agent parses stdout as JSON and gets an error message mixed in, the parse fails silently or produces garbage. Keep stderr and stdout cleanly separated. CLI2

Cross-platform consistency

Agents trained primarily on Linux and macOS generate commands that fail on Windows. sed -i has different syntax between macOS and Linux. Path separators, line endings, shell quoting, and temp directory locations all vary. Early SWE-bench Windows benchmarks showed significant performance degradation for coding agents in Windows container environments, though cross-platform support is improving.

Single static binary distribution (the Go and Rust pattern) is one of the most agent-friendly patterns for CLIs, no runtime dependencies, no PATH issues, no version conflicts. If your tool is not a single binary, test on all three major platforms in CI and document any platform-specific behavior. CLI3

Configuration format safety

YAML's whitespace sensitivity and implicit type coercion create silent failures when agents generate config files. The "Norway problem" (NO coerced to false) corrupts data without a syntax error. A single-space indentation error changes data structure without any parse error.

JSON and TOML do not have these problems. If your tool uses YAML, provide a JSON alternative. Better: publish a JSON Schema for your configuration format. This lets agents validate config before applying it, catching errors at generation time rather than at runtime. Tools like tsconfig.json and package.json succeed reliably with agents; Helm charts (YAML plus Go templates) fail frequently. CLI4

8. Health signals agents can verify

An agent recommending your tool is making a bet on your future, not just your present. A tool that works today but gets abandoned next year is a poor recommendation. Agents (and the systems that configure them) factor health signals into selection decisions, and most of these signals are already public. The question is whether yours tell a coherent story.

Verify your supply chain

The Postmark MCP impersonation incident (2025) was the first confirmed malicious MCP server: a package named postmark-mcp on npm that silently BCC'd all emails to an attacker, accumulating over 1,600 downloads before removal. The OpenClaw audit found 1,184 malicious AI agent skills. Supply chain attacks targeting agent tool infrastructure are no longer theoretical.

What to do: establish a verified identity chain from your domain to your repository to your published artifact. Sign your releases. Add SLSA provenance attestation if your build system supports it. At the more accessible end, verify your publisher status on npm or PyPI, verify your GitHub organization, and add a security.txt file (RFC 9116) at /.well-known/security.txt. These cost almost nothing and signal operational maturity. B8 B9

Prove your health, do not declare it

The Zaira Standard measures health, not activity. A project with zero open issues and no recent commits is healthy. A project with 50 unanswered issues and no recent commits is abandoned. The distinction is measurable.

The concrete indicators: issues getting responses, known vulnerabilities patched within 30 days, clean installs on current LTS runtimes, dependencies pinned to non-vulnerable versions. For open-source tools, run the OpenSSF Scorecard (19 automated security checks, 0-10 scale). The single most important check is Code-Review (whether changes are reviewed before merge). A high Scorecard score is a strong, machine-verifiable health signal that agents can interpret without human help. B11 B14

Reduce your bus factor

Critical open-source dependencies have a 36% chance of losing their only contributor annually. Sixty percent of maintainers are unpaid, and 60% have quit or considered quitting (Eghbal, "Working in Public," 2020; Tidelift maintainer survey, 2024). On the commercial side, Google has sunset 293+ products, and Openbase (YC S20, $3.6M raised, 500K monthly users) shut down with no migration path.

For open-source tools: diversify your contributor base and pursue foundation governance (CNCF, Apache, Linux Foundation) when the project reaches the scale to support it. A bus factor of 1 (one contributor accounting for more than 50% of commits) is the single most predictive risk factor for project abandonment. For commercial tools: publish a sunset policy, provide a data export API, and make it clear that the product is a core revenue line, not a side project. B10 B13

Stabilize your terms

An identifiable pattern preceded every major open-source license change in recent years (MongoDB, Redis, Terraform): single company controlling more than 80% of commits, a broad CLA, no foundation governance, and cloud competition pressure. A license change can invalidate an entire integration overnight.

For open-source tools: if you control more than 80% of commits and have a broad CLA, you carry the license-change risk pattern whether you intend to change or not. Foundation governance or an irrevocable license grant resolves the ambiguity. For commercial tools: publish your pricing history, commit to notice periods for material changes, and grandfather existing customers. B15

9. Sandbox everything

Every agent mistake should be a reversible mistake. This is the single principle that unites test environments, environment separation, simulation fidelity, and data portability. If an agent can cause irreversible harm in your system without explicitly entering a production context, your tool is not agent-ready.

Test environments as first-class infrastructure

Stripe's test infrastructure is the reference implementation: structurally distinct test and live API keys (prefixed sk_test_ and sk_live_), over 50 test card numbers covering specific decline codes and card brands, a Test Clocks API for simulating time-dependent flows like subscriptions and trials, and CLI-based event triggering for testing webhook handlers. An agent can develop, test, and validate an entire payment integration without touching real money.

The key design choice is structural distinction. If test and production credentials look the same, agents will mix them up. If they are structurally different (different prefixes, different base URLs, different response metadata indicating the current mode), the confusion is preventable. NS5

Environment separation

Development, staging, and production should be architecturally separate: distinct credentials, distinct URLs, and clear indicators in every API response showing which environment the agent is operating in.

Lack of dev/prod separation has directly caused documented production data loss incidents. Audits of AI-built applications have found open database defaults exposing tens of thousands of user records. An estimated 20% of AI-built applications have serious vulnerabilities or configuration errors, many stemming from the absence of environment boundaries. NS6

Domain-specific simulation

Different tool categories have different sandbox requirements.

Payment platforms need test cards, decline code simulation, dispute and refund flows, and time simulation for subscriptions. Paddle's MCP server includes 11 event/simulation tools specifically for agent testing.

Communication platforms need sandbox modes that validate request format without delivering messages. SendGrid's sandbox_mode.enable: true validates the full request without sending. Twilio's loop prevention circuit breaker (30 messages per 30 seconds) acts as a safety brake for runaway agents.

Database platforms need branching or copy-on-write environments where agents can experiment with schema changes and query patterns without risking production data. Neon provisions copy-on-write branches in approximately 500ms with reset_from_parent for agents that make schema mistakes. The branch is disposable. The production database is not.

Hosting platforms need preview deployments, branch-based environments, and rollback mechanisms accessible via API. An agent that can deploy but not rollback has a dangerous capability gap.

The pattern across all of these: agents need a space where mistakes are cheap, fast to create, and fast to discard. The closer your sandbox is to production behavior, the more useful it is, and the less likely an agent is to discover differences the hard way in production. (Zaira Standard: domain modules PM2, CM1, DB1, HI3)

Where to start

If you have read this far and are wondering where to begin, here are three high-impact changes that apply to almost every tool:

Add llms.txt and AGENTS.md. These two files take an afternoon to create and immediately make your tool visible and usable to agents. The cost-to-impact ratio is hard to beat.
Make your errors structured and actionable. If your API returns HTML error pages or generic messages, fix that first. Structured errors with machine-readable codes and retryability signals consistently produce large, measurable improvements in agent success rates.
Reduce your tool's context footprint. If your API has dozens of endpoints, consider grouping them into outcome-oriented tools. If your responses are unbounded, paginate by default. Agents working in constrained context windows fail when your tool consumes too much of the budget.

These are starting points, not the whole picture. The right priorities depend on your tool's complexity surface: a payment API's highest-priority work is different from a CLI tool's. Read the sections that match what you build, and use the Zaira Standard criteria references to see exactly how each practice maps to the evaluation framework.

The agent economy is not a future state. It is the current state, and the tools that work well with agents will be the tools agents recommend. This guide, and the Zaira Standard it accompanies, exist to make "works well with agents" a measurable, improvable property rather than a marketing claim.

The standard defines the criteria. This guide describes the practices. Both are living documents that will evolve as the research advances and as we learn from evaluating real tools. If you have feedback, corrections, or evidence that contradicts something here, we want to hear it: team@zairalabs.ai.