Tool Use — How AI Agents Reach Out of the Model

Tool Use

TL;DR: Tool use is the AI capability that lets a model call external functions — search, code execution, calculators, APIs, databases — rather than generating answers from training data alone. It is the technical foundation that turns chatbots into agents. The model decides when to call a tool, which tool, and with what arguments; the tool returns a result; the model uses the result to continue its work. Tool use is the mechanism under all 2026 agentic work, from Anthropic Claude Managed Agents to OpenAI Codex to LangChain agents. The discipline of designing tools well determines whether an agent is reliable or theatrical.

Simple explanation

A chatbot generates answers from what it learned during training. If you ask “what’s the weather in Tokyo right now?”, the chatbot guesses based on patterns it saw — it doesn’t actually know.

An agent with tool use can ask. It has access to a “get_weather” function. When the question comes in, the agent recognizes “this needs current data,” calls the tool with the argument “Tokyo,” gets back the actual current temperature, and uses that in its response. The model didn’t generate the weather; it generated the decision to look up the weather.

The shift from generation-only to generation + tool-use is the technical foundation of every real agent. Without tools, an AI is limited to what was in its training data. With tools, it can reach out to anything the tools connect to — search engines, customer databases, payment APIs, code execution environments, IoT devices.

Why it matters for business

Three operational consequences:

Reliability on factual claims. A model asked to “look up our customer’s recent orders” can call the CRM API directly instead of guessing. This collapses glossary/hallucination for any task where the underlying data exists in a reachable system.
Action, not just analysis. Tool use is the difference between “here are three hotels in Rome” (information) and “I booked the Westin for Tuesday” (action). The glossary/ai-agent category exists because of tool use.
Compound capabilities. Tools can call other tools. An agent with a “search,” “summarize,” “send email,” and “schedule meeting” toolset can chain them: search for the latest news, summarize the relevant article, draft an email about it, schedule a follow-up meeting. The leverage compounds with each well-designed tool.

The business framing: tool use is what makes AI work scale beyond conversation. Conversation-only AI is a research assistant. Tool-using AI is an operator.

How tool use works (the basic mechanics)

The standard pattern:

The model is given a list of available tools with their names, descriptions, and parameter schemas. Description is critical — the model decides whether to use a tool based largely on the description.
The user makes a request. The model decides whether the request requires a tool. If yes, it generates a tool call — a structured request specifying which tool and which arguments.
The tool executes. The application code runs the function (an API call, a database query, code execution, etc.) and returns the result.
The model continues with the result. The tool output becomes part of the context; the model uses it to produce the final response, possibly calling additional tools along the way.

The model never runs tools — it generates structured requests that the application code runs. This is important for security and reliability: the application controls what tools exist, what permissions they have, and what happens if a tool fails.

Tool design is the discipline

Whether tool use actually works in production depends almost entirely on tool design. The same model with well-designed tools dramatically outperforms the same model with poorly-designed tools.

What good tool design looks like:

Clear descriptions. The tool description tells the model when to use it. Vague descriptions (“does various things with users”) produce inconsistent usage. Specific descriptions (“Returns the last 10 orders for a given customer_id; use for any question about a customer’s recent purchase history”) produce reliable usage.
Narrow, composable scope. Many small tools beat a few large ones. A “get_customer” tool and a “get_orders” tool are easier for the model to use correctly than a single “get_customer_with_recent_orders_and_billing_history” mega-tool.
Structured parameters with validation. Free-form text parameters invite hallucination; typed parameters (enums, IDs, dates with explicit formats) prevent it.
Clear error semantics. When a tool fails, the error message should tell the model what to do next, not just what went wrong. “Customer not found — retry with a valid customer_id from the previous get_customers call” is operational; “Error 404” is not.
Idempotency and side-effect explicitness. Destructive tools (send_email, delete_record) should require explicit confirmation patterns; read tools (get_customer) can be called freely. Per automation/ai-agent-organization, destructive-action confirmations are one of the six security layers.

Tool use in 2026 (the production landscape)

Function calling (OpenAI, Anthropic native APIs) — the basic primitive. Tools defined inline; model returns structured tool calls; application executes.
Model Context Protocol (MCP) — emerging cross-platform standard for tool definition and access. Anthropic-driven but increasingly multi-vendor. Lets agents access services (Notion, GitHub, Slack, file systems) through a standard interface.
Managed-platform tool infrastructure — Anthropic Claude Managed Agents, OpenAI Codex, and similar platforms provide tool execution as part of the managed runtime. The practitioner defines what tools exist; the platform handles invocation and sandboxing.
Custom tool ecosystems — practitioners building domain-specific tool libraries (LangChain tools, custom MCP servers, internal API wrappers).

Karpathy’s “neural networks as operating systems” prediction (glossary/agent-engineering) describes the architectural inversion: tools are the deterministic substrate that the LLM-as-host-process calls into. The future-state model is that tools are the OS primitives; the LLM is the application layer choosing which primitives to invoke.

Connection to wiki frameworks

glossary/agent-engineering — Tool use is the basic primitive under Karpathy’s professional discipline. Coordinating multiple agents with multiple tools reliably and safely is what agent engineering operationalizes.
glossary/ai-agent — Tool use is what makes something an agent rather than a chatbot. The category is constituted by tool-use capability.
glossary/guardrails — The discipline of bounding what tools can do, when, and with what permissions. Tool use and guardrails are paired — every powerful tool needs a corresponding guardrail.
automation/ai-agent-organization — Tool use shows up explicitly in technique #3 (model specialization for different tasks) and #8 (layered security for destructive tools). Practitioner playbook.
tools/claude-managed-agents — Anthropic’s managed-platform realization of tool infrastructure.
comparisons/managed-agents-vs-diy — The make-or-buy decision for tool-execution infrastructure.
automation/multi-agent-patterns — Tool use generalizes to agent use — calling a sub-agent is structurally equivalent to calling a tool.
glossary/llm — The model that does the tool-calling decision.

Honest limits

Tool use doesn’t eliminate hallucination. The model can still hallucinate by calling the wrong tool, passing wrong arguments, or misinterpreting tool output. Tools shift the failure mode rather than eliminating it.
Tool design takes real engineering work. The vendor-provided tools are starting points; production-quality tools for a specific business need custom design, error handling, and security review.
Latency compounds. Every tool call adds round-trip time. Agents that call many tools sequentially are slow; concurrent tool calling helps but requires more careful design.
Permissions are a hard problem. A tool that can read customer data and a tool that can send emails together can leak data via email. Permission composition across multiple tools is one of the underappreciated risks of production agent deployment.
Tool description quality is the bottleneck. The same underlying tool can work great with one description and fail with another. This is a soft prompt-engineering layer that most practitioners under-invest in.

glossary/ai-agent — The category constituted by tool use
glossary/agent-engineering — The professional discipline that includes tool design
glossary/guardrails — Paired discipline; every powerful tool needs corresponding guardrails
glossary/agentic-memory — Memory and tools are the two main “reach outside the model” mechanisms
automation/ai-agent-organization — Practitioner playbook including tool security
automation/multi-agent-patterns — Sub-agent calls are structurally tool calls
tools/claude-managed-agents — Managed-platform tool infrastructure
comparisons/managed-agents-vs-diy — Make-or-buy decision for tool runtime
glossary/hallucination — The failure mode tool use partially addresses
glossary/llm — The underlying technology
glossary/llm-evals — How tool-use behavior is measured

Key Takeaways

Tool use is what turns a chatbot into an agent. A chatbot generates from training; an agent reaches out to tools (search, APIs, databases, code execution, other agents).
The mechanics: model receives tool descriptions → user makes request → model decides whether to call a tool and which one → application executes → result flows back to model → model continues.
The model never executes tools — it generates structured tool-call requests that the application code runs. This is what makes tool use safe and auditable.
Tool design is the discipline. Clear descriptions, narrow scope, structured parameters, operational error messages, explicit side-effect semantics — all matter more than which model you use.
Tool use is the foundation under all 2026 agentic work — Managed Agents, MCP, Codex, LangChain, custom platforms. Different surfaces, same primitive.
Tool use partially addresses hallucination but doesn’t eliminate it. It shifts the failure mode (now the model can call the wrong tool or misread output) rather than removing it.

Sources

Anthropic Function Calling and Tool Use documentation (2024–2026)
OpenAI Function Calling and Assistants API documentation (2023–2026)
Model Context Protocol (MCP) specification — Anthropic-driven, increasingly cross-vendor
Practitioner consensus from automation/ai-agent-organization (Pimenov), glossary/agent-engineering (Karpathy Sequoia 2026), and the LangChain / LangGraph / Letta documentation
Karpathy, A. (May 2026). From Vibe Coding to Agentic Engineering (Sequoia AI Ascent) — the NN-as-OS prediction frames tool use as the deterministic substrate