NVIDIA NIM Tool Calling: Building a 60-Line Agent Loop Without a Framework

Most agent frameworks hide the control loop behind abstractions. NVIDIA NIM lets you build the loop yourself in under 60 lines of Python, exposing exactly how the model decides which tool to call and when to stop.

This is the final piece of a five-part series that started with a basic NIM chat call and added retrieval, guardrails, and self-hosting. Now the model gets two tools (a clock and a retriever) and decides which one to use based on the user’s question.

What the Loop Actually Does

The agent loop is a state machine with three outcomes per iteration:

Model returns a final answer (stop)
Model requests a tool call (execute, append result, loop)
Loop limit reached (stop with error)

The conversation history grows with each turn. Tool results get appended as assistant messages with a tool_calls field, then user messages with a role: tool that carry the function output.

messages = [
    {"role": "system", "content": "You are a USC campus assistant."},
    {"role": "user", "content": "What time is it?"}
]

response = client.chat.completions.create(
    model="meta/llama-3.1-70b-instruct",
    messages=messages,
    tools=tools_schema,
    max_tokens=512
)

# Model returns tool_calls instead of content
if response.choices[0].message.tool_calls:
    for call in response.choices[0].message.tool_calls:
        result = execute_tool(call.function.name, call.function.arguments)
        messages.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": result
        })
    # Loop again with updated messages

The model sees the tool schema in the tools parameter, decides whether to call one, and returns structured JSON in tool_calls if it does. Your code executes the function, appends the result, and calls NIM again.

Tool Schema Contract

NIM expects tools in OpenAI function-calling format. Each tool is a JSON object with a name, description, and parameters schema.

tools_schema = [
    {
        "type": "function",
        "function": {
            "name": "get_current_time",
            "description": "Returns the current time in Los Angeles",
            "parameters": {
                "type": "object",
                "properties": {},
                "required": []
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_usc_knowledge",
            "description": "Search USC campus information using semantic retrieval",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query"
                    }
                },
                "required": ["query"]
            }
        }
    }
]

The model uses the description to decide when to call the tool. If you write “Returns the current time,” the model will call it for time-related questions. If you write “Use this when the user asks about time,” you bias the model toward calling it more often.

Decision Logic Inside the Model

The model sees the tools array and decides in a single forward pass whether to:

Answer directly (no tool needed)
Call one tool
Call multiple tools in parallel

NIM models trained for tool calling (like Llama 3.1 70B Instruct) output a special token sequence that the API parses into the tool_calls field. The model does not execute anything. It returns a structured request.

Your orchestration code maps function.name to a Python function and calls it. The model never sees your function implementation, only the schema and the result you append.

Execution and Error Handling

The simplest execution dispatcher is a dictionary:

def execute_tool(name, arguments_json):
    args = json.loads(arguments_json)
    
    tools = {
        "get_current_time": lambda: datetime.now(timezone("America/Los_Angeles")).strftime("%I:%M %p"),
        "search_usc_knowledge": lambda args: retriever.search(args["query"])
    }
    
    try:
        return tools[name](args) if args else tools[name]()
    except KeyError:
        return f"Error: Unknown tool {name}"
    except Exception as e:
        return f"Error executing {name}: {str(e)}"

When a tool fails, you have three options:

Return the error as a string (model sees it, may retry or apologize)
Retry the tool call (add retry logic in your dispatcher)
Stop the loop (raise an exception, surface to user)

The first option is simplest. The model sees “Error: Database timeout” and can tell the user “I couldn’t retrieve that information right now.”

Loop Termination and Guardrails

You need two guardrails:

Maximum iterations (prevent infinite loops)
Maximum tool calls per turn (prevent runaway parallel calls)

MAX_ITERATIONS = 5
MAX_TOOL_CALLS_PER_TURN = 3

for iteration in range(MAX_ITERATIONS):
    response = client.chat.completions.create(...)
    
    if not response.choices[0].message.tool_calls:
        return response.choices[0].message.content
    
    if len(response.choices[0].message.tool_calls) > MAX_TOOL_CALLS_PER_TURN:
        return "Error: Too many tool calls requested"
    
    # Execute tools, append results, continue

return "Error: Loop limit reached"

The model can get stuck in a loop if it keeps calling the same tool with slightly different arguments. The iteration limit prevents this. The per-turn limit prevents the model from requesting 50 parallel searches.

State Management

The conversation history is your state. Every tool call and result gets appended:

messages.append(response.choices[0].message.model_dump())  # Assistant message with tool_calls

for call in response.choices[0].message.tool_calls:
    result = execute_tool(call.function.name, call.function.arguments)
    messages.append({
        "role": "tool",
        "tool_call_id": call.id,
        "name": call.function.name,
        "content": result
    })

The tool_call_id links the result back to the request. The model uses this to match results to calls when it requested multiple tools in parallel.

If you want to persist state across sessions, serialize the messages array to a database. When the user returns, load it and continue the loop.

Observability Hooks

You want to log:

Which tool the model chose
The arguments it passed
The result returned
How many iterations the loop took

def execute_tool_with_logging(name, arguments_json, call_id):
    logger.info(f"Tool call: {name}", extra={
        "call_id": call_id,
        "arguments": arguments_json
    })
    
    start = time.time()
    result = execute_tool(name, arguments_json)
    duration = time.time() - start
    
    logger.info(f"Tool result: {name}", extra={
        "call_id": call_id,
        "duration_ms": duration * 1000,
        "result_length": len(result)
    })
    
    return result

This gives you a trace of every decision the model made. You can replay failed loops by re-running the same messages array. If your tools return sensitive data (PII, credentials), truncate or mask the logged result before writing it to disk.

When to Use This vs. a Framework

Scenario	Raw Loop	Framework
Two to five tools, simple logic	Use raw loop	Overkill
Need human approval per tool	Use raw loop (insert approval step)	Framework may abstract it away
Multi-agent orchestration	Use raw loop (you control routing)	Framework helps if it matches your pattern
Complex state machines	Framework (if it fits)	Otherwise raw loop
Debugging tool selection	Raw loop (full visibility)	Framework adds indirection
Production at scale	Framework (if you trust it)	Raw loop (if you need control)

Frameworks like LangGraph and AutoGen add state persistence, retry logic, and visual debugging. They also add abstraction layers that hide what the model actually returned.

Start with the raw loop. Add a framework when you need features you don’t want to build yourself.

Failure Modes

Model refuses to call a tool: The description was too vague or the user question didn’t match. Rewrite the description to be more explicit.

Model calls the wrong tool: The descriptions overlap. Make them mutually exclusive or add a routing tool that picks the right one.

Infinite loop: The model keeps calling the same tool. The MAX_ITERATIONS guardrail stops runaway loops globally. For finer control, add a check that stops if the same tool is called twice in a row with identical arguments.

Tool returns too much data: The result exceeds the context window. Truncate the result or summarize it before appending.

Parallel tool calls fail: One tool succeeds, one fails. The model sees partial results. Either retry the failed call or let the model handle the partial data.

Technical Verdict

Use this pattern if:

You control all tool implementations. The 60-line loop assumes synchronous execution. If your tools are third-party APIs with unpredictable latency (>500ms), you need async handling or a framework that manages it.
Tool execution is fast (<500ms per call). The loop blocks on each tool. Slow tools will time out user requests. Frameworks like LangGraph handle async tool execution and timeouts.
You need to inspect every model decision. The raw loop gives you full visibility into tool selection, arguments, and results. Frameworks add indirection that makes debugging harder.
Your team understands state machines. The loop is simple but requires reasoning about message history, termination conditions, and error propagation. If your team is unfamiliar with these patterns, a framework provides guardrails.

Avoid this pattern if:

Tools require human approval between calls. You can insert an approval step, but frameworks like AutoGen and LangChain have built-in human-in-the-loop primitives.
You need automatic retries across multiple servers. The raw loop runs in a single process. Distributed retry logic requires state persistence and orchestration that frameworks provide.
Tool latency is unpredictable. The synchronous loop will block. Frameworks handle async execution, timeouts, and fallback strategies.
You need visual debugging. LangGraph and similar tools provide DAG visualizations of agent execution. The raw loop requires manual log inspection.

The loop is simple. The model returns a structured request. You execute it. You append the result. You call the model again. That’s the entire pattern. Build it once so you know what frameworks are doing under the hood.

Source Links

Primary article: From Chatbot to Agent with NVIDIA NIM