Hosting AI Models on AWS Bedrock and Azure AI Foundry: Cost Control That Scales
Managed inference, governance, and FinOps across AWS and Azure
Executive summary
Enterprise teams increasingly host foundation models in the cloud instead of training everything from scratch. Amazon Bedrock and Microsoft Azure AI Foundry (built on Azure OpenAI Service and a broader model catalog) offer managed APIs, security, and compliance building blocks—but uncontrolled token usage, wrong capacity models, and missing observability can turn a pilot into a surprise invoice. This article outlines how to deploy production workloads on both platforms and keep cost under control without blocking innovation.
Why managed hosting instead of self‑managed GPUs?
Self-hosting can win for steady, high-volume workloads with dedicated platform teams. For many organisations, managed services reduce undifferentiated heavy lifting: patching, regional availability, negotiated SLAs, and integration with identity and data governance. The trade-off is unit economics you must actively manage—especially when usage spikes with adoption.
- Predictable security posture: Private connectivity, encryption, key management, and audit trails align with enterprise policies.
- Faster time-to-value: Swap or compare models through APIs instead of rebuilding clusters for each experiment.
- Elastic scale: Burst for campaigns without long-lived GPU capacity—if you pair elasticity with budgets and alerts.
AWS Bedrock: what to standardise early
Amazon Bedrock exposes multiple foundation models behind a common API surface. Cost control starts with account structure and guardrails, not only model choice.
- Model selection: Match task complexity to model tier—use smaller, faster models for classification, routing, and extraction; reserve the largest multimodal models for tasks that truly need them.
- On-Demand vs Provisioned Throughput: On-Demand fits spiky or exploratory traffic. If you have sustained tokens-per-minute requirements, evaluate Provisioned Throughput to stabilise cost for predictable peaks—always compare to measured usage over a representative window.
- Batch inference: For offline scoring, summarisation backlogs, or dataset labelling, batch APIs amortise work and often improve price-per-token versus interactive chat paths.
- Prompt caching & reuse: Where the platform supports caching long system prompts or retrieved context, reuse reduces billed input tokens for repeated structures.
- Regional strategy: Price and data residency vary by region; centralise workloads where compliance allows to avoid accidental multi-region sprawl.
- Observability: Emit per-application request IDs, log token counts client-side where possible, and correlate with CloudWatch and cost allocation tags for chargeback.
Azure AI Foundry: deployments and spend controls
Azure AI Foundry brings together model deployment, tooling, and governance on Azure. You will typically interact with deployed models (including Azure OpenAI–compatible endpoints) and surrounding services for search, evaluation, and monitoring.
- Deployment types: Understand the difference between consumption-based endpoints and reserved capacity options (such as provisioned throughput units where applicable). Steady production traffic with narrow latency targets often benefits from reserved capacity; bursty internal tools may stay on pay-as-you-go.
- Rate limits and quotas: Set subscription and workspace-level limits so one tenant cannot exhaust shared capacity—pair with retry policies and client-side backoff.
- Content safety and filters: Block categories of abuse early to avoid wasted inference on disallowed prompts—treat safety filters as both risk and cost controls.
- Integration with Azure Monitor: Track latency, token usage, and error rates; export to dashboards for FinOps reviews alongside cost management exports.
- Private networking: Use Private Endpoints and managed identities to reduce attack surface while keeping traffic on predictable paths for compliance.
Enterprise agents on Azure: pricing and where to host
For production AI agents on Microsoft’s stack, plan spend and architecture from the same place: official Microsoft Foundry pricing, agent services, and optional compute for custom sidecars or APIs.
- Microsoft Foundry pricing — hub for platform and feature billing (including the Foundry Models row for foundation-model usage); pick region and currency, then cross-check token and throughput assumptions. The Foundry experience can be explored without a subscription; billable services follow their listed meters (see the page FAQ).
- Azure AI Agent Service pricing — orchestration and hosting for enterprise agent workloads (redirects to the current Agent Service pricing detail page).
- Microsoft Foundry documentation — build, evaluate, deploy, and govern agents and apps in one place.
- Azure pricing calculator — estimate monthly cost before wide rollout; sign in for programme-specific rates.
- Azure Container Apps — serverless-style hosting for HTTP or queue-driven agent services that sit beside Foundry APIs.
- Azure Kubernetes Service (AKS) — when you need cluster-level control (network policies, GPU node pools, multi-tenant namespaces) for bespoke agent runtimes.
Together, the Microsoft Foundry pricing hub and Agent Service rates give finance and platform teams a defensible baseline for enterprise agent roadmaps.
Cross-cloud cost control playbook
Regardless of vendor, the same FinOps patterns apply.
1. Token budgets and ownership
Assign cost centres to each workload—customer support bot, internal copilot, batch pipeline—with monthly token or spend caps. Surface near-real-time usage to product owners; engineers optimise what product measures.
2. Routing and smaller models
Insert a router layer (rules-based or lightweight classifier) to send simple queries to smaller, cheaper models and escalate only when confidence is low. This single pattern often yields the largest savings without hurting quality.
3. Retrieval instead of giant prompts
Prefer RAG with concise retrieved chunks over stuffing entire documents into the context window. Fewer input tokens directly lowers cost and often improves accuracy.
4. Caching answers
Cache deterministic or near-deterministic responses at the application edge (Redis, CDN, API gateway) for FAQs and repeated analytics questions—do not re-infer identical prompts every time.
5. Batch and off-peak workloads
Schedule summarisation, indexing, and evaluation jobs in batch or off-peak windows when discounts or lower contention apply.
6. Structured outputs
Ask for JSON or schema-constrained outputs to shorten follow-up turns and reduce multi-step chat loops.
7. Continuous evaluation
Run periodic benchmarks when switching models—if a cheaper model matches quality on your evaluation set, promote it in routing rules.
Governance without gridlock
Cost control is not only technical. Establish lightweight approval for new high-spend endpoints, document model choices in architecture decision records, and train teams on prompt hygiene (verbosity is expensive). Align security reviews with cost reviews so private endpoints and logging stay in place as you scale.
How Workstation can help
Workstation helps teams design multi-cloud AI platforms: landing zones, identity, observability, and safe patterns for Bedrock and Azure AI Foundry—including FinOps dashboards and governance that developers will actually follow. For architecture reviews or delivery support, contact info@workstation.co.uk.