AgentCore (Bedrock) Pricing Explained and When Self-Hosting Wins

Goal: help cloud teams pick the right cost model for agentic workloads.
What you get: a clear pricing framework for Amazon Bedrock AgentCore, hidden cost drivers to watch, an illustrative monthly bill, and a decision matrix for Bedrock vs self-hosting.
Need a second opinion on your architecture or a quick cost sanity check? Book a Bedrock Cost Architecture Review with Scalevise.
Why AgentCore pricing feels opaque
Agentic systems are a stack, not a single SKU. With Bedrock AgentCore you pay across multiple layers:
- Orchestration
Agent planning, tool-use calls, function invocations, and guardrails. - Model inference
Per-token pricing for the foundation model(s) your agent uses. - Retrieval and knowledge
Vector search or Kendra, embedding generation, storage and sync. - Observability and safety
Logging metrics, prompt/response traces, content filters. - Networking and VPC
PrivateLink endpoints, cross-AZ traffic, NAT, egress to other AWS services.
The total is the sum of small meters that scale quickly with traffic. Your goal is to keep expensive paths rare and predictable.
Core pricing dimensions to model
1) Requests per minute and concurrency
- Each account and Region has default service quotas for Bedrock APIs and AgentCore.
- Throughput grows your cost in a non-linear way once you add retries, tool calls, and retrieval hops.
- Practical tip: design for max requests per minute (RPM) and max concurrent sessions, not just daily totals.
2) Cold starts and warm pools
- Agents may need warm-up after inactivity.
- A warm pool (or periodic pre-warming) improves latency but increases baseline cost.
- Cache the agent plan and the retrieval results when possible to avoid repeated expensive steps.
3) Memory and session state
- Agent memory can live in several places: Bedrock session state, a vector store, DynamoDB, or S3.
- You pay for storage, reads/writes, and sometimes tokenization calls to keep memory consistent.
- Decide early if memory is ephemeral per run or persistent per user. Persistent memory creates a recurring footprint.
4) Retrieval and embeddings
- Retrieval-augmented generation often doubles the cost path:
embeddings → store → search → model call. - Cap embedding length, batch indexing, and use delta sync for knowledge bases.
5) VPC and data movement
- Private VPC endpoints remove public egress but introduce endpoint hours and cross-AZ costs.
- Place compute, storage, and Bedrock endpoints in the same AZ where possible.
- Avoid chatty architectures that bounce across services and AZs.
6) Observability and guardrails
- Tracing each step is essential for reliability and audits, yet verbose logs can be a surprise line item.
- Sample logs at a fixed rate in production. Keep full traces for a subset of sessions.
Hidden cost drivers most teams miss
- Tool-call storms. Unbounded tool use per turn can triple your per-request cost. Enforce a step budget.
- Over-retrieval. Two or three retriever hops per turn is normal. Ten is not.
- Long contexts by default. Use shorter system prompts and compress history.
- Embeddings at write time. Push embedding creation to off-peak batch windows.
- Cross-Region experiments. Keep POCs in the same Region as your data and endpoints.
- Verbose logging in hot paths. Move full traces behind a feature flag.
Illustrative monthly bill (10k monthly sessions)
This is a worked example to show the categories and how they add up. Use your own unit prices and volumes to project the real total.
Cost line | Unit driver | Est. volume | Unit price (example) | Subtotal |
---|---|---|---|---|
Agent orchestration steps | 6 steps per session | 60,000 steps | $X / 1k steps | $A |
Model inference | 2 calls per session, avg N tokens | 20,000 calls | $Y / 1k tokens | $B |
Retrieval search | 1.5 searches per session | 15,000 searches | $Z per 1k | $C |
Embedding generation (batch) | 5M tokens indexed | — | $E / 1k tokens | $D |
Vector DB storage | 2M vectors | — | $S per GB-month | $E |
Logging and traces | 10% sessions sampled | 1,000 traces | $L per GB | $F |
VPC endpoints | 2 endpoints × 720h | — | $V per hour | $G |
Cross-AZ traffic | 10% of requests cross AZ | — | $T per GB | $H |
Estimated monthly total | $A+B+C+D+E+F+G+H |
How to use this:
- Replace unit prices with current AWS pricing in your Region.
- Stress test with 3 mixes: baseline, peak week, and launch day.
- Add 10 to 20% headroom for growth, retries, and safety rails.
When self-hosting beats Bedrock AgentCore
Bedrock is strong when you need managed scale, compliance anchors, and rapid rollout. Self-hosting can win when:
- Traffic is stable and high.
Fixed GPU nodes plus an inference server can reduce your per-request cost at scale. - You need non-Bedrock models or heavy control.
Advanced routing, custom safety layers, or experimental models may be easier off-platform. - Strict data residency or air-gapped workloads.
Some customers require on-prem or private cloud only. - You want aggressive cost tuning.
Quantized models, speculative decoding, and caching across users are easier when you own the stack.
Break-even thought experiment
Let C_b
be your all-in Bedrock cost per 1k agent turns and C_s
your self-hosted cost per 1k turns.
Include GPU amortization, ops hours, observability, and storage in C_s
.
Self-hosting is attractive when:
Monthly_turns × (C_b − C_s) > switch_costs / payback_months
Many teams see payback within 3 to 6 months once traffic passes a few hundred thousand turns per month.
Want a real model for your board deck? We can build a one-page TCO sheet that sources your actual volumes and unit prices.
See also: Hidden Cost of SaaS AI Agents and Self-Hosting Savings
Decision matrix: Bedrock AgentCore vs self-hosting
Criterion | AgentCore (Bedrock) | Self-hosted |
---|---|---|
Time to value | Fast setup, managed scale | Slower to start, full control |
Per-turn cost at small scale | Usually lower | Often higher |
Per-turn cost at large scale | Can rise with orchestration + retrieval | Drops with tuned models and caching |
Model flexibility | Bedrock catalog | Any model you can serve |
Data residency | Regional, with VPC options | You decide (on-prem, private cloud) |
Observability & safety | Managed guardrails and logs | You build the rails you need |
Vendor lock-in risk | Medium | Low if you abstract with middleware |
Architecture patterns that cut your bill
- Routing tier before the agent. Send low-stakes turns to a smaller model. Save the flagship model for high-value tasks.
- Step budget per session. Cap tool calls and retrieval hops.
- Shorten prompts and memory. Summarize conversation state.
- Cache knowledge. Keep hot chunks close to the agent.
- Batch embeddings. Nightly jobs with rate-limited indexing.
- Observability sampling. Full traces only when debugging or auditing.
- Same-AZ placement. Keep endpoints, storage, and compute together.
- Middleware abstraction. Decouple your app from provider APIs so switching stays cheap.
See: GDPR-compliant AI middleware and AI privacy & security guide.
Implementation checklist
- [ ] Define target RPM and concurrency for peak and baseline.
- [ ] Decide memory model: ephemeral, per user, or per account.
- [ ] Set a step budget per session.
- [ ] Add a routing tier for cheap/expensive paths.
- [ ] Place services in the same AZ and enable VPC endpoints.
- [ ] Batch embeddings and cap chunk sizes.
- [ ] Sample traces and keep full logs for audits only.
- [ ] Build a monthly TCO sheet with your unit prices.
- [ ] Add middleware for portability and policy enforcement.
- [ ] Run a one-week A/B cost test: Bedrock path vs self-hosted path.
What Scalevise can do next
- Bedrock Cost Architecture Review
We map your current flow, identify expensive hops, and deliver a plan to reduce cost per turn. - Middleware for cost and compliance
Routing, caching, consent logging, and audit trails in one control layer.
Read more: AI GDPR, Privacy & Security for Businesses. - Self-hosting pilots
Quantized models, inference servers, and eval harnesses to prove quality at lower cost.
Contact: https://scalevise.com/contact