AgentCore Memory & VPC Costs: The Missing Line Items

AgentCore Memory & VPC Costs
AgentCore Memory & VPC Costs
TL;DR
Most AgentCore bills are driven by memory footprint and VPC design rather than pure model tokens. Make memory explicit (ephemeral vs persistent), keep endpoints and data in the same AZ, sample logs, and cap tool/retrieval steps. If traffic is stable, a hybrid with self-hosting can drop cost per turn.

1) Pricing quick view (what actually shows up on the bill)

Cost category What it includes Why it spikes
Agent orchestration Planning steps, tool invocations, guardrails Unbounded tool loops, retries, multi-hop tasks
Model inference Tokens for your chosen model(s) Long prompts, verbose history, no function calling
Retrieval & embeddings Vector search, embedding generation, storage Over-retrieval, indexing everything, no delta sync
Memory & session state Bedrock session state, vector store, Dynamo/S3 Persistent per-user memory with frequent reads/writes
VPC endpoints PrivateLink endpoints, endpoint hours per AZ Endpoints in multiple AZs or Regions
Cross-AZ / egress Data moving between AZs or out of VPC Chatty architectures crossing services/AZs
Observability & safety Logs, traces, content filters Verbose tracing in hot paths

Key insight: Memory and VPC choices multiply each other. A persistent memory model with multi-AZ endpoints can cost more than the model tokens.


2) Memory models and their cost footprint

a) Ephemeral per run

  • Memory lives only during a single agent turn or session.
  • Lowest storage/read cost. Requires good prompt/state design.
  • Use for: transactional Q&A, deterministic tasks.

b) Persistent per user

  • Memory follows a user across sessions.
  • Costs include storage, reads/writes, and embedding refresh.
  • Use for: account context, multi-step sales cycles, support history.

c) Shared/team memory

  • Topics, FAQs, product snippets shared across users.
  • Cheaper than per-user memory but needs governance to avoid bloat.

Cost controls for memory

  • Summarize conversation history aggressively.
  • Store facts, not full transcripts.
  • Batch embeddings nightly; use delta sync (only new/changed).
  • Cap chunk size and total tokens per document.
  • Add a TTL on low-value memory to auto-expire.

3) VPC design that avoids surprise charges

Place services together

  • Keep AgentCore endpoints, vector DB, and storage in the same AZ.
  • Minimize hops through NAT; prefer VPC endpoints for S3/Bedrock.

Right-size PrivateLink

  • Each endpoint has hourly cost per AZ.
  • Start with one AZ, scale out when concurrency demands it.
  • Avoid “just in case” endpoints in three AZs if your app runs in one.

Limit cross-AZ chat

  • Co-locate agents, DB, caches, and app workers.
  • Consolidate retrieval into one call with top-k chunks rather than multiple micro-calls across services.

Observability discipline

  • Sample traces (for example 5–10%).
  • Keep full payload logs only for audit windows or incident analysis.

4) Worked example: a transparent monthly bill (plug your own unit prices)

Scenario: 50k sessions/month, each with 2 model calls, 1 retrieval, and persistent per-user memory.
Line item Driver (example) Volume Unit price (fill in) Subtotal
Orchestration steps 5 steps/session 250,000 steps ___ / 1k steps =
Model inference 2 calls × N tokens 100,000 calls ___ / 1k tokens =
Retrieval searches 1.2 per session 60,000 searches ___ / 1k =
Embedding generation Nightly delta sync 3M tokens ___ / 1k tokens =
Vector storage 1.5M vectors ___ / GB-month =
VPC endpoints 1 endpoint × 720h ___ / hour =
Cross-AZ traffic 8% of traffic ___ / GB =
Logs & traces 10% sampled, gzip ___ / GB =
Estimated monthly total Σ

How to use this table

  1. Fill unit prices for your Region.
  2. Run three mixes: baseline, peak week, launch day.
  3. Add 10–20% headroom for retries and safety.
  4. Recompute after each optimization to prove savings.

5) A simple cost model you can copy

Let:

  • S = sessions/month
  • k = model calls per session
  • r = retrieval calls per session
  • m_read/m_write = memory read/write ops per session
  • E_h = VPC endpoint hours
  • X_az = GB cross-AZ

Total cost ≈

Orchestration(S)

  • ModelTokens(S × k × avg_tokens)
  • Retrieval(S × r)
  • Embeddings(index_tokens_month)
  • Memory(m_read, m_write, storage_GB)
  • VPC(E_h)
  • CrossAZ(X_az)
  • Logs(log_GB × sample_rate)

Sensitivity

  • Reduce k and r with function calling and better retrieval.
  • Cut avg_tokens via prompt/state compression.
  • Lower E_h and X_az by consolidating in one AZ.
  • Drop log_GB with sampling and short retention.

6) How to cut the bill without hurting outcomes

  • Routing tier before the agent: send low-stakes turns to a smaller/cheaper model, reserve the flagship for high-value paths.
  • Step budgets: cap planning/tools per session and fail fast with a graceful reply.
  • Prompt/state compression: summarize aggressively; switch to function calling for structured steps.
  • Retrieval hygiene: pre-rank, top-k, no duplicate chunks; delta sync for updates.
  • Response caching: cache static answers and policy snippets.
  • Same-AZ placement: pin endpoints, DB, and app workers to one AZ.
  • Observability sampling: 5–10% traces in prod, 100% only during incidents.

See also:


7) When self-hosting beats AgentCore

Self-hosting can win when:

  • Traffic is stable and high.
  • You need custom policies or models outside Bedrock.
  • Data residency mandates private cloud/on-prem.
  • You want aggressive optimizations: quantization, speculative decoding, shared caches.

Break-even intuition
If C_b = Bedrock cost/1k turns and C_s = self-hosted cost/1k turns, then self-hosting is attractive when:

Monthly_turns × (C_b − C_s) > switch_costs / payback_months

Many teams reach payback in 3–6 months at sustained volumes.


8) Implementation checklist

  • [ ] Define RPM and max concurrency for baseline and peak.
  • [ ] Choose memory model: ephemeral, per user, or team.
  • [ ] Put a step budget on planning & tools.
  • [ ] Add a routing tier for cheap vs premium paths.
  • [ ] Place services in one AZ and use VPC endpoints.
  • [ ] Batch embeddings; enable delta sync.
  • [ ] Sample traces; short retention on non-audit logs.
  • [ ] Build a TCO sheet with your Region prices.
  • [ ] Run a one-week A/B: AgentCore vs self-hosted path.
  • [ ] Review results and lock in optimizations.

9) FAQs

What usually drives costs most?
Persistent memory with frequent reads/writes, and VPC design (endpoint hours, cross-AZ traffic). Tokens matter, but plumbing wins or loses the month.

How do I reduce retrieval cost?
Pre-rank, use smaller chunks, cap top-k, and remove duplicates. Batch indexing and update via delta sync.

How do I know if I need multi-AZ?
Only if you require high availability across AZ failures. Otherwise, single-AZ with fast recovery is cheaper and simpler.


10) Let’s make your bill predictable

If you want a second set of eyes on your architecture:

Bedrock Cost Architecture Review

  • We map your cost path, model a baseline/peak/launch bill, and deliver a concrete cut-plan for memory, VPC and retrieval.
  • Typical outcome: −20–30% cost per turn in 30 days without quality loss.