mech.app
AI Agents

Cloud vs. Device Agents: What Hybrid Multi-Agent Systems Reveal About Cost, Latency, and Orchestration Trade-offs

Routing tasks between cloud LLMs and on-device SLMs requires careful orchestration. Here's what breaks at the boundary and when each approach wins.

Source: arxiv.org
Cloud vs. Device Agents: What Hybrid Multi-Agent Systems Reveal About Cost, Latency, and Orchestration Trade-offs

Every production agent builder faces the same question: when do you pay for a cloud frontier model, and when do you run a cheaper small language model on the device? A new ArXiv paper (2605.30102v1) from Rainone et al. examines hybrid multi-agent systems that route tasks between cloud LLMs and on-device SLMs. The findings are specific: more compute does not always win, and the optimal architecture depends heavily on task structure.

The paper adapts two representative multi-agent architectures to support hybrid inference and measures how design choices shift the Pareto frontier of power, cost, and performance. The results matter because they expose the orchestration boundary where most hybrid systems break.

The Orchestration Boundary Problem

Hybrid systems introduce a state handoff problem. When you route a task from a local SLM to a cloud LLM (or vice versa), you need to serialize agent context: conversation history, tool call results, intermediate reasoning steps, and any domain-specific state.

The naive approach duplicates orchestration logic on both sides. You end up with two separate agent loops, one on-device and one in the cloud, and a brittle handoff layer that tries to keep them synchronized. When the handoff fails, you lose context or repeat expensive work.

The paper’s approach uses a unified orchestration layer that treats cloud and device models as interchangeable backends. The orchestrator maintains a single state machine and routes individual inference calls based on task characteristics and current resource constraints.

Routing Strategies and When They Work

The paper evaluates three routing strategies:

Static routing assigns tasks to cloud or device models based on predefined rules. Simple tasks (fact lookup, basic formatting) go to the SLM. Complex reasoning or multi-step planning goes to the LLM. This works when task complexity is predictable, but it fails when tasks have variable difficulty or when the SLM can handle more than you expect.

Dynamic routing uses a lightweight classifier to predict task difficulty and route accordingly. The classifier runs on-device and looks at input length, vocabulary complexity, and whether the task requires external tool calls. This adds latency (the classifier itself takes time) but reduces unnecessary cloud calls by 30-40% in the paper’s experiments.

Cascading execution starts every task on the SLM and escalates to the cloud LLM only when confidence falls below a threshold. The SLM returns both an answer and a confidence score. If confidence is high, you’re done. If not, you send the original input plus the SLM’s attempt to the cloud model. This minimizes cloud costs but doubles latency on escalated tasks.

Cost and Latency Trade-offs

The paper measures three dimensions: task accuracy, monetary cost (cloud API calls), and edge energy consumption (on-device inference). The results show a tight coupling that depends on task type.

Routing StrategyCloud Cost ReductionLatency PenaltyAccuracy LossBest For
Static (rule-based)40-60%None5-15%Predictable task distribution (fact lookup, data extraction)
Dynamic (classifier)30-50%+50-100ms2-8%Variable task complexity (mixed query types)
Cascading50-70%+200-500ms (on escalation)0-3%Cost-sensitive, latency-tolerant (batch processing)

The latency penalty on cascading execution is the killer. When you escalate, you pay for both the local inference and the cloud round-trip. Network latency dominates: even a weak local model that returns in 50ms beats a strong cloud model that takes 300ms for the round-trip, assuming the local model is accurate enough.

The paper finds that SLMs can effectively benefit from LLM assistance, but the optimal architecture is highly task-dependent. For structured data extraction, static routing works fine. For open-ended reasoning, cascading execution preserves accuracy while cutting costs. For real-time interactions, you need dynamic routing with aggressive local-first bias.

State Management Across the Boundary

The hardest part is serializing agent state for handoff. The paper’s hybrid architectures maintain a shared context object that includes:

  • Conversation history (user messages and assistant responses)
  • Tool call log (which tools were invoked, with what arguments, and what they returned)
  • Intermediate reasoning steps (chain-of-thought fragments from the SLM)
  • Confidence scores and routing metadata

When escalating from SLM to LLM, the orchestrator packages this context into a structured prompt. The LLM sees the original user input, the SLM’s attempt, and the full tool call history. This prevents the LLM from repeating work the SLM already did.

The serialization format matters. JSON works for structured data but bloats token counts. The paper uses a compact text format that interleaves tool calls with reasoning steps. This format keeps token counts low compared to JSON while remaining human-readable for debugging:

User: What's the weather in Paris and the current EUR/USD rate?

SLM attempt (confidence: 0.6):
Thought: Need to call weather API and forex API
Tool: weather_api(location="Paris")
Result: 18°C, partly cloudy
Tool: forex_api(pair="EUR/USD")
Result: API timeout

LLM continuation:
Thought: SLM got weather but forex failed. Retry forex with fallback endpoint.
Tool: forex_api_backup(pair="EUR/USD")
Result: 1.08
Answer: Paris is 18°C and partly cloudy. EUR/USD is 1.08.

Observability and Failure Modes

Hybrid systems fail in predictable ways:

Context loss during handoff. If the serialization layer drops tool call results or reasoning steps, the cloud LLM hallucinates or repeats work. The fix is to log every handoff and validate that the serialized context matches the orchestrator’s internal state.

Confidence miscalibration. SLMs often overestimate their confidence, leading to incorrect answers that never escalate. The paper uses a separate calibration dataset to tune confidence thresholds per task type.

Network partitions. When the device loses connectivity, cloud-routed tasks fail. The orchestrator needs a fallback path: either queue the task for retry or downgrade to the SLM with a warning to the user.

Cost runaway. Without rate limiting, a bug in the routing logic can send every task to the cloud. The paper recommends per-user and per-task-type quotas, plus circuit breakers that force local execution when cloud costs exceed a threshold.

Observability requires tracking routing decisions, handoff success rates, and cost per task type. The paper’s architectures emit structured logs for every routing decision:

{
  "task_id": "abc123",
  "input_tokens": 45,
  "routing_decision": "escalate_to_cloud",
  "slm_confidence": 0.58,
  "slm_latency_ms": 62,
  "llm_latency_ms": 340,
  "total_cost_usd": 0.0012,
  "accuracy": 1.0
}

These logs feed into dashboards that show cost per task type, escalation rate, and latency distributions. When escalation rate spikes, you know the SLM is struggling. When cost spikes without accuracy gains, you know the routing logic is broken.

Architecture: Unified Orchestrator with Pluggable Backends

The paper’s hybrid architectures use a single orchestrator that treats cloud and device models as pluggable inference backends. The orchestrator maintains the agent loop: parse user input, decide which tools to call, invoke tools, route inference to the appropriate backend, and return results.

The key abstraction is an InferenceBackend interface:

class InferenceBackend:
    def infer(self, prompt: str, context: AgentContext) -> InferenceResult:
        pass

    def estimate_cost(self, prompt: str) -> float:
        pass

    def estimate_latency(self, prompt: str) -> float:
        pass

class CloudLLMBackend(InferenceBackend):
    def infer(self, prompt, context):
        # Serialize context, call cloud API, parse response
        serialized = self.serialize_context(context)
        response = self.api_client.complete(prompt, serialized)
        return InferenceResult(response.text, response.confidence)

class DeviceSLMBackend(InferenceBackend):
    def infer(self, prompt, context):
        # Run local model, return result with confidence
        result = self.local_model.generate(prompt)
        confidence = self.calibrator.score(result)
        return InferenceResult(result.text, confidence)

The orchestrator uses a routing policy to select the backend:

class HybridOrchestrator:
    def __init__(self, cloud_backend, device_backend, router):
        self.cloud = cloud_backend
        self.device = device_backend
        self.router = router

    def execute_task(self, user_input):
        context = AgentContext()
        backend = self.router.select_backend(user_input, context)
        
        result = backend.infer(user_input, context)
        
        if result.confidence < ESCALATION_THRESHOLD and backend == self.device:
            # Escalate to cloud
            context.add_attempt(result)
            result = self.cloud.infer(user_input, context)
        
        return result

This design keeps orchestration logic in one place and makes it easy to swap routing policies or add new backends (e.g., a mid-tier model hosted on a local server).

When Greater Compute Does Not Help

The paper’s most surprising finding: greater frontier-level compute does not consistently translate to better performance in hybrid systems. In tool-calling tasks on domain-specific schemas, a well-tuned SLM with access to the right tools outperforms a frontier LLM that lacks domain-specific context.

The reason is tool calling. SLMs can be fine-tuned on domain-specific tool schemas and learn to invoke tools reliably. Frontier LLMs are more general but often hallucinate tool arguments or fail to recognize when a tool is needed. When the SLM knows exactly which tool to call and the LLM has to reason about it, the SLM wins on both latency and accuracy.

This suggests a different hybrid strategy: use the SLM for tool orchestration and the LLM only for complex reasoning within tool calls. The SLM decides which tools to invoke and in what order. When a tool requires natural language understanding (e.g., parsing a user’s ambiguous request), the SLM delegates that subtask to the LLM and continues orchestration.

Technical Verdict

Use hybrid multi-agent systems when:

  • You have a mix of simple and complex tasks with predictable patterns
  • Cloud API costs are a significant fraction of your operating budget
  • You can tolerate 50-100ms of additional per-inference latency for routing decisions (not including network round-trip time for cloud calls)
  • You have the infrastructure to log and monitor routing decisions
  • Your SLM achieves at least 70% accuracy on representative tasks (lower accuracy means excessive escalation)

Avoid hybrid systems when:

  • Task complexity is uniformly high (just use the cloud LLM)
  • End-to-end latency requirements are under 200ms including network round-trip time (cloud calls will exceed this budget)
  • You lack the engineering capacity to build and maintain a unified orchestrator
  • Your task distribution is unpredictable and changes frequently (routing policies will require constant retuning)
  • Network connectivity is unreliable (handoffs will fail frequently)

The orchestration boundary is where hybrid systems break. If you cannot serialize agent context reliably and route tasks based on measured confidence, you will end up with duplicated logic and brittle handoffs. The paper’s unified orchestrator pattern solves this, but it requires careful state management and observability.

For teams building production agents, the takeaway is clear: measure routing decisions, log handoffs, and tune confidence thresholds per task type. The optimal architecture is not static. It shifts as your SLM improves, as cloud costs change, and as your task distribution evolves.

Tags

agentic-ai orchestration infrastructure

Primary Source

arxiv.org