What Is Tool Use? How LLMs Call External APIs and Functions

Tool Use - featured image

What Is Tool Use?

Tool Use (also called function calling) is the pattern that lets a large language model (LLM) call external functions, APIs, or databases during a conversation rather than answering purely from its internal knowledge. With Tool Use, a model can fetch today’s weather, query an internal CRM, run a calculator, send an email, or trigger any other deterministic system—and then weave the result back into a natural-language reply.

Think of it like a chef who knows plenty of recipes but still needs an oven, a blender, and a refrigerator. The recipe book is the LLM’s training data; the appliances are the tools. Without them, the chef can talk about cooking but cannot actually cook. Tool Use is how you give an LLM the appliances it needs to execute real work.

How to Pronounce Tool Use

tool yoos (/tuːl juːs/)

function calling (/ˈfʌŋk.ʃən ˈkɔːl.ɪŋ/) — used interchangeably in most docs

Anthropic’s docs say “tool use”. OpenAI’s docs say “function calling”. Note that both refer to the same mechanism in current APIs, and many engineers use the terms interchangeably. It is important to match your provider’s vocabulary when reading documentation, but you do not need to worry that they describe different technologies.

How Tool Use Works

Tool Use is a multi-turn protocol. The model does not execute anything itself; instead it issues a structured request to the application, which runs the tool and returns the result. Keep this loop in mind when debugging.

The Tool Use lifecycle

1. Define tools
(name, input_schema)
2. Send to model
+ user question
3. Model returns
tool_use block
4. App runs
the actual tool
5. Feed result back
(tool_result)
6. Model gives
final answer

Steps 1–2: Declare tools and send the request

You declare each available tool as JSON Schema: its name, a short description of when to use it, and the arguments it accepts. Models rely heavily on those descriptions—a vague description almost always produces the wrong tool choice.

Step 3: The tool_use block

If the model decides a tool is needed, it stops text generation and returns a tool_use block (Anthropic) or a tool_calls object (OpenAI) that includes the chosen tool name and a JSON argument object.

Steps 4–6: Execute and hand back

Your application runs the tool (HTTP call, DB query, shell command) and feeds the result back to the model as a tool_result. The model then produces the final answer. For complex agents this loop repeats many times; Claude Code, for example, runs dozens of tool calls in a typical session.

Tool Use Usage and Examples

Here is a minimal Anthropic-API example. The application defines a fake get_weather tool and lets Claude decide when to use it.

import anthropic, json

client = anthropic.Anthropic()

tools = [{
    "name": "get_weather",
    "description": "Get current weather for a major city. Only major US/JP cities are supported.",
    "input_schema": {
        "type": "object",
        "properties": {
            "city": {"type": "string", "description": "City name, e.g. Tokyo, San Francisco"}
        },
        "required": ["city"]
    }
}]

messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]

resp = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    tools=tools,
    messages=messages,
)

# stop_reason == "tool_use" signals Claude wants to invoke a tool.
if resp.stop_reason == "tool_use":
    tool_use = next(b for b in resp.content if b.type == "tool_use")
    # 1) Run the tool for real (HTTP call, DB query, etc.)
    result = {"city": tool_use.input["city"], "temp_c": 18, "condition": "sunny"}

    # 2) Append the assistant turn + tool_result and call again.
    messages.append({"role": "assistant", "content": resp.content})
    messages.append({
        "role": "user",
        "content": [{
            "type": "tool_result",
            "tool_use_id": tool_use.id,
            "content": json.dumps(result)
        }]
    })
    final = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        tools=tools,
        messages=messages,
    )
    print(final.content[0].text)

How OpenAI differs

OpenAI still uses the keyword function calling in much of its documentation, but the modern tools parameter works almost identically to Anthropic’s API. LangChain, LlamaIndex, and similar frameworks abstract the differences away if you need portability.

Advantages and Disadvantages of Tool Use

Advantages

  • Fresh data: fetch prices, weather, or inventory that live outside the model’s training window
  • Exact computation: hand arithmetic or unit conversion to a calculator rather than trusting model output
  • Agentic behavior: send emails, update records, control hardware—side effects become possible
  • Reduced hallucination: grounded answers when tools return authoritative data

Disadvantages

  • Latency and cost: multi-turn loops burn more tokens and wall-clock time
  • Security surface: any tool that mutates state (send money, delete rows) deserves extra safeguards
  • Bad arguments: models occasionally hallucinate argument values—always validate server-side
  • Prompt injection: attacker-controlled tool output can try to redirect the model

Tool Use vs RAG

Both let an LLM use external information, but their architectures differ—this matters when picking an approach.

Aspect Tool Use RAG
Who initiates Model chooses a tool App runs retrieval before the model
Target Arbitrary functions and APIs Document / vector search
Typical use Agents, automation Internal knowledge Q&A
Relationship Can call a RAG retriever as one of its tools Becomes one tool inside a broader agent

A Short History of Tool Use

The concept of letting language models invoke external actions predates GPT-4. Early prototypes in 2022 asked the model to emit special string tokens (like “SEARCH(…)”) that the host application parsed out of the reply. Toolformer, a 2023 research paper from Meta AI, showed that LLMs could learn when and how to call tools by training on synthetic examples of tool-annotated text. That work sparked the mainstream adoption of Tool Use patterns in production APIs.

OpenAI shipped the first widely-used production API under the name function calling in mid-2023. Anthropic followed with its own Tool Use API in 2024, adopting a near-identical JSON-schema-based protocol. By 2025 every major provider—Google Gemini, Mistral, Cohere, and open-weights frameworks like Ollama—offered a compatible surface. The rapid convergence means that most new applications can pick a provider and switch later with modest code changes.

A parallel innovation worth noting is the Model Context Protocol (MCP), introduced by Anthropic in late 2024. MCP standardizes how hosts expose tools to any AI client. Instead of hardwiring tool definitions into every application, teams can run MCP servers and attach the same tools to Claude Code, Cursor, or custom agents. This is especially important in enterprises that want to reuse access-controlled internal APIs across many AI products.

Best Practices for Tool Use

Designing reliable Tool Use systems is more engineering than prompt craft. Keep the following conventions in mind.

1. Make tool names verb-first and specific

Prefer getWeatherByCity over weather. The model uses the name as a primary signal, and vague names lead to mis-calls. You should also keep names stable across versions; renaming a tool can silently break prompts that reference it through fine-tuning.

2. Write descriptions that explain both when and when not to call

Every description should include two parts: a positive rule (“Call this when the user asks for weather in a major US city”) and a negative rule (“Do not call this for historical weather or forecasts beyond 3 days”). It is important to include examples of both kinds of user input.

3. Validate arguments server-side

Never assume the arguments you receive are safe. Check types, bounds, and allow-lists before executing anything. Remember that LLMs can hallucinate plausible-looking but wrong values, and prompt injection can smuggle malicious parameters through untrusted input. A good habit is to treat tool arguments like HTML form data: never trust, always validate.

4. Gate destructive actions with confirmation

Any tool that writes to state—sending email, creating tickets, charging credit cards—should require an explicit user confirmation step before execution. Claude Code, for example, asks the user before running shell commands that could cause harm. You should mirror that pattern in your own agents.

5. Log everything for replay

Production agents accumulate failures. Log every tool call, its arguments, and its response so you can replay sessions during debugging. Include a timestamp, the model version, and the user ID. When a bug report comes in at 2 a.m., that log is the only way to understand what happened.

6. Budget your turns

A surprisingly common failure mode is agents that loop: the model calls a tool, gets a confusing result, calls it again, and again. Set a hard cap on tool-call turns (often 10–30 depending on the task) and surface a graceful error if exceeded. You should also monitor for the cost tail—each additional turn doubles both latency and spend.

Parallel Tool Calls and Batching

Modern Tool Use APIs support returning multiple tool calls in one assistant turn, which is often called parallel tool use. Instead of three sequential round-trips, the model can say “fetch the order, look up the customer, and check the refund policy” all at once, letting the application execute them concurrently. This is especially useful for latency-sensitive agents like customer support bots. It is important to note that not every provider supports parallel calls the same way; check your target API’s current limits before designing around this pattern.

Common Misconceptions

Misconception 1: The model calls the API itself

No. The model only emits a JSON request for a tool. Your application decides whether to actually execute it. All security and authorization live on your side.

Misconception 2: Function calling and Tool Use are different features

They are the same idea. OpenAI labels it “function calling” for historical reasons, Anthropic labels it “tool use”. The runtime behavior is equivalent.

Misconception 3: Declaring a tool forces the model to use it

Models will skip tools when they think they already know the answer. If you need to force a specific tool, set tool_choice in the request.

Real-World Use Cases

  • Coding agents: Claude Code and Cursor run Read/Edit/Bash as tools
  • Customer support bots: look up order status, create refunds, escalate tickets
  • Ops automation: Slack notifications, JIRA ticket creation, GitHub PR comments driven by AI
  • Data analysis copilots: generate and execute SQL, then narrate results
  • Research agents: call web search, download PDFs, summarize across sources
  • Robotics and IoT: expose sensors and actuators as tools with strict schemas

Frequently Asked Questions (FAQ)

Q1. Do all Claude models support Tool Use?

A. Most Claude 3 and later models (Haiku, Sonnet, Opus families) support it. Check Anthropic’s model reference for your exact model.

Q2. How is billing handled?

A. Tool definitions count as input tokens, and every turn (initial call, tool_result, final answer) is billed. Complex agents can use significantly more tokens than a single-turn chat.

Q3. Can I call multiple tools at once?

A. Yes. Both Claude and OpenAI support parallel tool calls in a single assistant response.

Q4. How long should my tool description be?

A. Vague descriptions cause misuse. Aim for 2–5 sentences that cover what the tool does, when to call it, and when not to call it.

Q5. How do I defend against malicious tool output?

A. Validate arguments server-side, require explicit approval for destructive actions, log every call, and follow the principle of least privilege in API credentials.

Conclusion

  • Tool Use lets an LLM request external functions mid-conversation
  • Function calling is essentially the same pattern under a different brand name
  • The model only proposes calls; the app executes them, so security is the app’s job
  • It unlocks fresh data, accurate computation, and genuine agentic workflows
  • Always validate arguments and gate destructive actions

Advanced Tool Use Patterns for Production Systems

Designing Robust Tool Schemas

The quality of tool use depends heavily on how precisely you describe each tool’s purpose, inputs, and outputs. You should treat the tool schema as user-facing API documentation: include concrete examples of when the tool should be invoked and when it should not. Note that parameter enumerations are especially valuable for reducing malformed calls. Keep in mind that the model reads descriptions as hints, not hard constraints, so phrasing matters as much as schema typing.

Important: name parameters consistently across tools. If one tool uses user_id and another uses userId, the model may confuse them and produce invalid calls. Adopt a project-wide naming convention and apply it uniformly.

Error Handling and Retry Strategies

Production tools fail for many reasons: downstream API outages, permission errors, invalid parameters, rate limits. You should encode errors in a structured format (status code, error type, human-readable message) and return them as the tool result. The model can then decide whether to retry, escalate, or ask the user for more information. It is important to enforce retry caps at the tool layer rather than relying on the model to stop retrying, because a misaligned model may enter an infinite retry loop that burns tokens.

Parallel Tool Execution

Claude’s support for parallel tool calls allows multiple independent tools to execute concurrently, significantly reducing latency for tasks that need to gather information from several sources. You should carefully map dependencies: if tool B requires output from tool A, the model must invoke them sequentially. Keep in mind that most dashboards and status pages combine data from multiple independent sources, making them natural fits for parallel tool use. Important: avoid parallelizing state-mutating operations unless you have strong consistency guarantees.

Security and Access Controls

Tool use expands the surface area of your system. You should restrict the tools a particular agent can access based on user role, tenant, or context. Note that secrets should never be embedded in the tool description. Authentication tokens must be injected by the server-side executor at call time, not passed through the model. Important: log every tool invocation with its input arguments (redacted) and outcome, and review these logs regularly to detect abuse patterns.

Cost Optimization for Tool Use Workflows

Tool-use conversations tend to accumulate long context: every tool call and tool result is added to the conversation. You should design tools to return compact, structured summaries rather than raw payloads. Keep in mind that large JSON bodies inflate token usage without necessarily helping the model. A two-tier pattern works well: tools return a summary by default and expose an explicit detail-fetching tool for cases where the model needs the full data.

Operational Patterns for Tool-Using Agents

Agent Loop Construction

Tool use forms the backbone of modern autonomous agents. You should design the agent loop to clearly separate reasoning, tool invocation, and result integration. A typical structure: the loop gathers current context, calls the model, inspects the response for tool calls, executes those calls in a sandbox, appends results, and iterates until a stopping condition. Important: implement a hard cap on iterations, total tokens, and wall-clock time so that runaway loops cannot drain budget or hang the system. Keep in mind that observability here is essential: emit structured logs for each loop iteration.

Testing and Evaluation for Tool Use

Tool-using agents require tests beyond what traditional software demands. You should maintain a library of scenarios covering happy paths, missing data, ambiguous instructions, malicious prompts, and partial failures. Note that regression testing often uses deterministic mocks for tool responses so that model behavior can be compared across versions. Important: include latency budgets in your tests. An agent that produces correct outputs in ten seconds may be useless if the user expects a response in two.

Advanced teams adopt evaluation frameworks that can replay production traces offline with different models or prompt variations. Keep in mind that this enables data-driven iteration rather than guesswork. You should version every evaluation scenario alongside the code so that regressions are immediately traceable to specific changes.

Combining Tool Use with Retrieval

Many production systems blend tool use with RAG patterns: a retrieval tool returns relevant documents, and the model uses those documents to craft its response. You should treat retrieval as just another tool, with clear schemas and logging. Important: monitor retrieval precision and recall because hallucinations often trace back to poor retrieval rather than model limitations. Keep in mind that hybrid architectures (keyword + vector search) tend to outperform pure vector search for enterprise content.

Responsible Tool Use in User-Facing Systems

When tools have real-world side effects (sending emails, executing trades, posting to social media), you should insert human-in-the-loop confirmations at critical decision points. Important: never let the agent perform irreversible actions without an explicit confirmation signal. Keep in mind that users increasingly expect transparency: expose what tools the agent is about to call, show intermediate reasoning, and provide a mechanism to cancel or correct the agent mid-flight.

Future Outlook for Tool Use

Near-Term Evolution

Over the next twelve to twenty-four months, Tool Use is expected to evolve along several dimensions. You should anticipate deeper integration with surrounding developer tooling, improved reliability, and expanded ecosystems of third-party extensions. Important: teams that invest early in the operational fundamentals (observability, cost controls, evaluation) will be positioned to adopt new capabilities faster than teams that retrofit them later. Keep in mind that the pace of change in this space tends to compress traditional planning horizons, so roadmaps should include explicit review checkpoints.

Note that many organizations underestimate the operational maturity required to make new AI capabilities durable. You should budget explicitly for evaluation datasets, human-in-the-loop review workflows, and incident response capacity alongside the headline feature work.

Workforce and Skills Implications

Adoption of Tool Use changes the skill profile organizations need. You should invest in training programs that help practitioners reason about model behavior, craft effective prompts, and evaluate outputs critically. Important: technical training alone is insufficient. Build rituals (weekly showcases, monthly retrospectives, quarterly policy reviews) so that learning compounds across the organization. Keep in mind that senior engineers and subject-matter experts are often the most impactful early adopters because they can recognize subtle output quality issues that less experienced reviewers might miss.

Strategic Considerations for Leaders

Leaders evaluating Tool Use should consider both upside (productivity, new product surfaces, customer experience) and downside (regulatory exposure, reliability risk, vendor concentration). You should develop scenario plans that cover vendor pricing changes, capability leaps by competitors, and regulatory restrictions. Important: maintain optionality where possible by abstracting provider-specific details behind internal interfaces and maintaining relationships with multiple vendors. Keep in mind that AI platform bets made today will shape organizational capabilities for years, so these decisions deserve board-level attention in many organizations.

Recommended Next Steps

Teams beginning or expanding their use of Tool Use should start with a small number of high-signal pilots, instrument them thoroughly, and iterate in public within the organization. You should document what worked, what did not, and why, so that knowledge accumulates rather than evaporating. Important: appoint a clear owner for the Tool Use program who is accountable for both outcomes and risk posture. Keep in mind that small, disciplined deployments that prove value tend to win sustained executive support, while sprawling exploratory efforts often stall before reaching production impact.

References

📚 References

🇯🇵 日本語版あり

この記事には日本語版もあります。

日本語で読む →