mech.app
AI Agents

Multi-Tenant Agent Architecture: How Amazon Bedrock AgentCore Isolates State, Secrets, and Compute Across SaaS Customers

Deep dive into namespace isolation, credential scoping, per-tenant rate limits, and state partitioning for running customer agents on shared infrastruct...

Source: aws.amazon.com
Multi-Tenant Agent Architecture: How Amazon Bedrock AgentCore Isolates State, Secrets, and Compute Across SaaS Customers

Amazon published official guidance on May 21, 2026 for building multi-tenant agentic applications using Bedrock AgentCore. This is the first major cloud provider to document production patterns for the hardest unsolved problem in agent deployment: running hundreds of customer agents on shared infrastructure without state leakage, credential cross-contamination, or noisy-neighbor resource exhaustion.

The challenge is not running one agent. The challenge is running 500 customer agents on the same compute pool where tenant A’s conversation history never appears in tenant B’s context window, where one customer’s runaway tool loop doesn’t starve everyone else’s quota, and where credential scoping prevents cross-tenant API access.

The Three Isolation Patterns

AWS defines three tenant isolation models for agent runtimes, each with different trade-offs for cost, blast radius, and operational complexity.

Silo: Each tenant gets dedicated infrastructure. One agent runtime per customer, one database instance per customer, one set of credentials per customer. Maximum isolation, maximum cost. You pay for idle capacity when a tenant’s agent isn’t running.

Pool: All tenants share the same infrastructure. One agent runtime handles requests from all customers, multiplexed by tenant ID. Minimum cost, maximum noisy-neighbor risk. One tenant’s infinite loop can degrade service for everyone.

Bridge: Hybrid model. Premium tenants get dedicated runtimes (silo), standard tenants share a pool. This is the practical choice for SaaS products with tiered pricing.

PatternIsolation StrengthCost EfficiencyBlast RadiusOperational Overhead
SiloHighLowSingle tenantHigh (N × infrastructure)
PoolLowHighAll tenantsLow (shared ops)
BridgeMediumMediumTier-specificMedium (mixed ops)

State Partitioning: Conversation History and Memory

Agent state includes conversation history, tool outputs, intermediate reasoning steps, and long-term memory. In a pooled architecture, all of this must be namespaced by tenant ID at the storage layer.

AgentCore uses DynamoDB for state persistence with a composite partition key: tenant_id#session_id. Every read and write includes the tenant ID in the query predicate. This prevents cross-tenant data leakage at the database level, but it requires disciplined key design.

The failure mode: if your application code constructs a query without the tenant ID filter, you’ve just exposed all customer data. Row-level security policies in DynamoDB can enforce tenant scoping, but they add latency and complexity.

For long-term memory (facts the agent should remember across sessions), AgentCore integrates with Amazon MemoryDB with tenant-scoped namespaces. Each tenant’s memory store is logically isolated, but physically pooled for cost efficiency.

Credential Scoping: Per-Tenant API Keys and Service Accounts

Agents call external APIs: CRM systems, payment processors, internal microservices. Each tenant brings their own API keys. The question is where you store them and how you scope access.

AgentCore uses AWS Secrets Manager with a naming convention: /{tenant_id}/api_keys/{service_name}. When an agent needs to call Stripe on behalf of tenant X, it fetches /tenant_x/api_keys/stripe. The IAM policy attached to the agent runtime allows reads only for secrets matching the current tenant’s namespace.

This works until you need cross-tenant service accounts. Example: your SaaS product has a shared Slack integration. All tenants post to the same Slack workspace, but each tenant should only see their own messages. You need a shared credential with tenant-scoped logic in the tool implementation.

The pattern: store shared credentials in a global namespace (/shared/api_keys/slack), but pass the tenant ID as metadata in every API call. The tool implementation must enforce tenant filtering in the query parameters or request body.

# Tool implementation with tenant scoping
def post_to_slack(tenant_id: str, message: str):
    slack_token = secrets_manager.get_secret("/shared/api_keys/slack")
    client = SlackClient(token=slack_token)
    
    # Tenant-scoped channel lookup
    channel_id = get_tenant_channel(tenant_id)
    
    client.post_message(
        channel=channel_id,
        text=message,
        metadata={"tenant_id": tenant_id}  # For audit trail
    )

The failure mode: forgetting to pass tenant_id to the tool. One customer’s agent posts to another customer’s Slack channel.

Rate Limiting and Cost Controls

In a pooled architecture, one tenant’s agent can exhaust the shared model quota. If tenant A runs 10,000 inference requests in an hour, tenant B’s requests start getting throttled.

AgentCore implements per-tenant rate limits using Amazon API Gateway with usage plans. Each tenant gets a quota (requests per second, requests per day) based on their pricing tier. The gateway enforces limits before requests reach the agent runtime.

For cost attribution, every model inference call is tagged with tenant_id. AWS Cost Explorer aggregates spend by tag, so you can bill customers based on actual usage.

The noisy-neighbor problem remains: even with rate limits, a tenant can max out their quota with expensive operations (long context windows, tool loops, multi-step reasoning). AgentCore adds a timeout per agent invocation (default 5 minutes) and a maximum tool call depth (default 10). If an agent exceeds either limit, the runtime kills the session and returns an error.

Routing Layer: Request Multiplexing

In a pooled architecture, the routing layer decides which agent runtime handles which request. AgentCore uses an Application Load Balancer with tenant-aware routing rules.

Incoming requests include a tenant_id header (extracted from the JWT or API key). The ALB routes requests to the appropriate target group:

  • Premium tenants (silo model): dedicated target group with dedicated agent runtimes
  • Standard tenants (pool model): shared target group with autoscaling agent runtimes

For the pool model, the ALB uses least-outstanding-requests routing. This spreads load across healthy instances and prevents one slow agent invocation from blocking other tenants.

The autoscaling policy is tenant-aware. If the pool’s aggregate request rate exceeds 80% of capacity, add more instances. If a single tenant consistently uses more than 20% of pool capacity, flag them for migration to a dedicated silo.

Observability: Per-Tenant Metrics and Traces

You need separate dashboards for each tenant. Tenant A shouldn’t see tenant B’s error rates or latency percentiles.

AgentCore emits CloudWatch metrics with tenant_id as a dimension. You can filter by tenant in CloudWatch Insights or build per-tenant dashboards in Grafana.

For distributed tracing, AgentCore uses AWS X-Ray with tenant ID in the trace metadata. Every agent invocation, tool call, and model inference gets a trace segment tagged with the tenant. You can query traces by tenant to debug customer-specific issues.

The challenge: trace sampling. If you sample 1% of requests, low-volume tenants might not have enough traces for debugging. AgentCore uses adaptive sampling: 100% sampling for tenants with fewer than 100 requests per day, 10% for medium-volume tenants, 1% for high-volume tenants.

Deployment Shape: Serverless vs. Long-Running

AgentCore runs on AWS Lambda for the pool model and ECS Fargate for the silo model. Lambda makes sense for pooled workloads with spiky traffic. You pay per invocation, and cold starts are acceptable for most agent use cases (conversational latency is already measured in seconds).

For silo deployments, Fargate gives you dedicated compute with predictable performance. Each tenant’s agent runs in its own container, isolated at the kernel level. You pay for reserved capacity, but you avoid noisy-neighbor issues.

The trade-off: Lambda has a 15-minute execution limit. If your agent needs to run a long-running workflow (multi-hour data processing, batch API calls), you need Step Functions or Fargate.

Failure Modes and Mitigation

Cross-tenant data leakage: Forgetting to filter by tenant_id in a database query. Mitigation: enforce row-level security policies in DynamoDB, use separate tables per tenant for high-sensitivity data.

Credential leakage: Storing tenant API keys in environment variables instead of Secrets Manager. Mitigation: never pass secrets as environment variables, always fetch from Secrets Manager at runtime.

Quota exhaustion: One tenant’s agent runs an infinite tool loop and exhausts the shared model quota. Mitigation: per-tenant rate limits, per-invocation timeouts, maximum tool call depth.

Noisy neighbor: One tenant’s expensive agent invocations slow down the pool. Mitigation: autoscaling based on aggregate load, tenant-aware routing, flag high-usage tenants for migration to silo.

Prompt injection across tenants: Tenant A crafts a prompt that tricks the agent into revealing tenant B’s data. Mitigation: strict input validation, separate model instances per tenant (silo model), output filtering.

Technical Verdict

Use AgentCore’s multi-tenant patterns if you are building a SaaS product with agentic features and you need to support more than 10 customers on shared infrastructure. The Bridge model (premium tenants in silos, standard tenants in pools) is the right default for most products.

Avoid pooled architectures if your customers have strict compliance requirements (HIPAA, PCI-DSS) or if cross-tenant data leakage would be catastrophic. In those cases, go full silo or run separate AWS accounts per tenant.

Skip AgentCore entirely if you are building a single-tenant application or if you need custom orchestration logic that doesn’t fit the framework’s opinionated design. The overhead of learning AgentCore’s abstractions isn’t worth it for simple use cases.

The hard part is not the infrastructure. The hard part is disciplined tenant scoping in every query, every API call, and every tool implementation. One missing tenant_id filter is a data breach.