Deployment Decision Guide

Framework for choosing between SaaS API, Azure AI Foundry, AWS Bedrock, Google Vertex AI, or self-hosted deployment.

This guide provides a systematic framework for choosing how to deploy AI modelsβ€”via direct API, enterprise cloud platforms (Azure AI Foundry, AWS Bedrock, Google Vertex AI), or self-hosted infrastructure. The right deployment model profoundly affects cost, control, compliance, and operational responsibility.

Deployment Models Overview

Direct API (SaaS)

What it is: Call the provider’s cloud API directly (e.g., OpenAI API, Claude API, Gemini API).

Key Characteristics:

  • Data sent to provider’s infrastructure
  • Provider manages everything (updates, scaling, security)
  • Pay-per-use pricing
  • Fast setup (hours to days)
  • Limited control over infrastructure

Typical Cost: Lowest initial; scales linearly with usage

Enterprise Cloud Platforms

What it is: AI models deployed through cloud provider’s managed service (Azure AI Foundry, AWS Bedrock, Google Vertex AI).

Key Characteristics:

  • Data processed within your cloud tenancy
  • Provider manages model infrastructure
  • You control security, networking, IAM integration
  • Enterprise SLA and support
  • Unified platform for multiple models

Typical Cost: Moderate premium over direct API; enterprise features included

Self-Hosted (On-Premise or Private Cloud)

What it is: Run AI models on your own infrastructure (data center servers or dedicated cloud VMs).

Key Characteristics:

  • Complete data control (never leaves your infrastructure)
  • You manage everything (deployment, scaling, security, updates)
  • Requires technical expertise and ongoing operations
  • High fixed costs; low variable costs

Typical Cost: High initial investment; economical at very high volumes

Decision Framework

Step 1: Assess Data Sensitivity and Compliance

Question: Can your data be processed on third-party infrastructure?

Scenario A: Data Cannot Leave Your Infrastructure

Requirements indicating this:

  • Classified government/defense data
  • Explicit data sovereignty mandates prohibiting cloud processing
  • Air-gapped environment requirements
  • Industry regulations prohibiting third-party access
  • Extreme competitive sensitivity (trade secrets, M&A)

Your only option: Self-hosted deployment

Models available: Llama 4, Mistral, DeepSeek (open-source/downloadable models only)

Stop hereβ€”proceed to Self-Hosted Implementation section below.


Scenario B: Cloud Processing Acceptable with Controls

Requirements:

  • GDPR (EU data protection)
  • HIPAA (US healthcare)
  • Financial regulations (GLBA, PCI DSS)
  • General corporate data governance
  • Data residency preferences (but not absolute mandates)

Your options: Enterprise Cloud Platforms (Azure, AWS, Google) or Self-Hosted

Proceed to Step 2.


Scenario C: Standard Business Data (Low Sensitivity)

Characteristics:

  • Public or internal-only data
  • No regulatory restrictions on third-party processing
  • General business content (non-confidential)

Your options: All deployment models (Direct API, Cloud Platforms, Self-Hosted)

Proceed to Step 2.

Step 2: Evaluate Volume and Cost

Question: What is your expected monthly token usage?

Low Volume (<10M tokens/month = <$100-300/month API cost)

Recommendation: Direct API

Rationale:

  • Self-hosted infrastructure costs ($2-10K/month) far exceed API costs
  • Enterprise platform premiums not justified at low volume
  • Simplicity and speed matter more than cost optimization

Exception: Even at low volume, use Cloud Platform if compliance requires (HIPAA, data residency).


Medium Volume (10M-1B tokens/month = $100-10K/month API cost)

Recommendation: Direct API or Cloud Platform based on other factors

Rationale:

  • API costs meaningful but not prohibitive
  • Cloud platform premium (typically 10-30% over direct API) justified for:
    • Enterprise compliance needs
    • Integration with cloud infrastructure
    • Need for unified governance
  • Self-hosted not yet economical (unless specific constraints require it)

Decision factors:

  • Compliance needs β†’ Cloud Platform
  • Cost-sensitive, no compliance requirements β†’ Direct API
  • Existing cloud investment β†’ Cloud Platform

High Volume (>1B tokens/month = >$10K/month API cost)

Recommendation: Evaluate self-hosted alongside Cloud Platforms

Rationale:

  • API costs substantial ($10K-100K+/month)
  • Self-hosted infrastructure investment justifiable
  • Break-even typically at $100-300/day in API costs (~400K predictions/month)

Calculate TCO:

API (e.g., Gemini Flash at $0.075/$0.30 per 1M tokens):

  • 1B tokens/month input: $75
  • 1B tokens/month output: $300
  • Total: $375/month

Self-Hosted (e.g., Llama 4 on cloud GPU):

  • GPU cloud VMs: $2,000-2,500/month
  • Engineering (0.5 FTE): $4,000-7,000/month
  • Total: $6,000-9,500/month

At this volume, API still cheaper. Self-hosting makes sense at 5-10x this volume or when API options eliminated by other constraints.

Very High Volume (>10B tokens/month): Self-hosted becomes economically attractive. Investment in infrastructure and team justified by savings.

Step 3: Assess Technical Capability

Question: Do you have AI/ML infrastructure expertise?

Limited Technical Capability

Characteristics:

  • No dedicated ML engineering team
  • Limited cloud infrastructure experience
  • Small or no DevOps team

Recommendation: Direct API or Cloud Platform with managed services

Rationale:

  • Self-hosting requires significant expertise (ML ops, GPU infrastructure, model serving)
  • Managed services minimize operational burden
  • Cloud platforms abstract complexity while providing controls

Choose Direct API if: Cost-sensitive, low compliance needs Choose Cloud Platform if: Enterprise requirements, integration needs


Moderate Technical Capability

Characteristics:

  • Cloud infrastructure team exists
  • DevOps capabilities
  • Can learn ML-specific tools
  • No dedicated ML engineers (yet)

Recommendation: Cloud Platform

Rationale:

  • Can leverage cloud’s managed AI services
  • Build expertise gradually
  • Cloud platform provides training wheels for eventually self-hosting if needed

Strong Technical Capability

Characteristics:

  • Dedicated ML engineering team
  • Proven experience with model deployment
  • Strong DevOps and infrastructure capabilities
  • GPU infrastructure experience

Recommendation: Self-hosted or Cloud Platform, based on other factors

Rationale:

  • Have capability to self-host successfully
  • Decision depends on cost, control priorities, and volume
  • Can implement hybrid: self-hosted for volume, APIs for variety

Step 4: Infrastructure and Ecosystem

Question: What is your existing cloud infrastructure?

Heavily Microsoft-Centric

Indicators:

  • Azure infrastructure
  • Microsoft 365, Active Directory
  • .NET development stack

Recommendation: Azure AI Foundry

Models Available:

  • OpenAI (GPT-4, GPT-5, o-series) - primary partnership
  • DeepSeek (R1)
  • Llama, Mistral, and 1,800+ model catalog

Benefits:

  • Unified Microsoft ecosystem
  • Single procurement relationship
  • Integrated billing, IAM, security
  • Strong OpenAI relationship (latest models first)

Heavily AWS-Centric

Indicators:

  • AWS infrastructure dominates
  • Heavy use of Lambda, S3, DynamoDB
  • AWS security/compliance frameworks

Recommendation: AWS Bedrock

Models Available:

  • Claude (Anthropic) - primary partnership
  • Llama, Cohere, AI21, Stability AI, Amazon Titan
  • Multi-vendor model marketplace

Benefits:

  • Deep AWS ecosystem integration
  • Managed service simplicity
  • Multiple model options
  • Claude preferred for coding use cases

Note: For OpenAI models, use Azure OpenAI Service (not available on AWS).


Heavily Google Cloud-Centric

Indicators:

  • Google Cloud Platform infrastructure
  • BigQuery, Google Workspace
  • Data-heavy ML workflows

Recommendation: Google Vertex AI

Models Available:

  • Gemini (2.5 Pro, Flash) - primary offering
  • Claude, Llama, Mistral via Model Garden

Benefits:

  • Native Gemini access (1M context, multimodal)
  • Strong MLOps capabilities
  • Unified data and AI platform
  • Best-in-class fine-tuning suite

Multi-Cloud or Cloud-Agnostic

Indicators:

  • No dominant cloud provider
  • Intentional multi-cloud strategy
  • Avoiding vendor lock-in priority

Recommendation: Direct APIs or Self-Hosted (open-source models)

Rationale:

  • Direct APIs avoid cloud platform lock-in
  • Self-hosted Llama/Mistral provides maximum portability
  • Can deploy across multiple clouds as needed

Hybrid Approach:

  • Use each cloud’s native AI where already invested
  • Maintain abstraction layer for model switching

Step 5: Compliance and Risk

Question: What are your regulatory and compliance requirements?

HIPAA (US Healthcare)

Requirement: Business Associate Agreement (BAA)

Options:

  • βœ“ Azure OpenAI Service (OpenAI with BAA)
  • βœ“ AWS Bedrock (Claude, others with BAA)
  • βœ“ Google Vertex AI (Gemini with BAA)
  • βœ“ Self-hosted (full control, you manage compliance)
  • βœ— Direct APIs (typically no BAA for direct relationships)

Recommendation: Use cloud platforms with BAA, not direct APIs


GDPR (EU Data Protection)

Requirement: EU data residency, Data Processing Agreement (DPA)

Options:

  • βœ“ Mistral (European company, GDPR-native)
  • βœ“ Cloud platforms in EU regions (Azure EU, AWS EU, Google EU)
  • βœ“ Self-hosted in EU
  • β–³ Direct APIs with DPA (check data processing locations)

Recommendation:

  • First choice: Mistral (European provider)
  • Alternative: Cloud platforms deployed in EU regions
  • Maximum control: Self-hosted in EU

Government / Defense

Requirement: FedRAMP, ITAR, classified data handling

Options:

  • βœ“ Self-hosted on approved infrastructure
  • β–³ FedRAMP-certified cloud platforms (for appropriate classification levels)
  • βœ— Commercial APIs (typically prohibited)

Recommendation: Self-hosted for classified; FedRAMP cloud for unclassified government


Financial Services

Requirement: SOC 2, data controls, audit trails

Options:

  • βœ“ All major cloud platforms (SOC 2 certified)
  • βœ“ Self-hosted (full control)
  • β–³ Direct APIs (verify SOC 2 certification)

Recommendation: Cloud platforms for managed compliance; self-hosted for maximum control

Deployment Model Comparison Matrix

FactorDirect APIAzure AI FoundryAWS BedrockGoogle Vertex AISelf-Hosted
Setup TimeHours-daysDays-weeksDays-weeksDays-weeksMonths
Initial CostVery lowLow-moderateLow-moderateLow-moderateVery high
Ongoing CostUsage-basedSlightly higher than APISlightly higher than APISlightly higher than APIFixed (infrastructure)
Data ControlLowHighHighHighMaximum
ComplianceLimitedExcellent (BAA, HIPAA)Excellent (BAA, HIPAA)Excellent (BAA, HIPAA)Full (you manage)
ScalabilityAutomaticAutomaticAutomaticAutomaticManual
MaintenanceProviderProviderProviderProviderYou
CustomizationLimitedModerateModerateHigh (fine-tuning)Maximum
Model SelectionSingle provider1,800+ modelsMulti-vendorGemini + Model GardenOpen-source only
Vendor Lock-InHigh (to model)Moderate (to Azure)Moderate (to AWS)Moderate (to Google)None
SupportBasicEnterpriseEnterpriseEnterpriseSelf-support

Decision Tree Summary

START
  β”‚
  β”œβ”€ Data MUST stay on-premise?
  β”‚   └─ YES β†’ Self-Hosted (Llama, Mistral)
  β”‚   └─ NO β†’ Continue
  β”‚
  β”œβ”€ HIPAA/regulated healthcare?
  β”‚   └─ YES β†’ Cloud Platform with BAA (Azure/AWS/Google)
  β”‚   └─ NO β†’ Continue
  β”‚
  β”œβ”€ Volume > 10B tokens/month?
  β”‚   └─ YES β†’ Evaluate Self-Hosted (economical at scale)
  β”‚   └─ NO β†’ Continue
  β”‚
  β”œβ”€ Strong cloud investment?
  β”‚   β”œβ”€ Azure β†’ Azure AI Foundry (OpenAI, DeepSeek)
  β”‚   β”œβ”€ AWS β†’ AWS Bedrock (Claude primary)
  β”‚   β”œβ”€ Google β†’ Vertex AI (Gemini primary)
  β”‚   └─ None β†’ Continue
  β”‚
  β”œβ”€ Volume < 10M tokens/month?
  β”‚   └─ YES β†’ Direct API (simplicity, low cost)
  β”‚   └─ NO β†’ Continue
  β”‚
  β”œβ”€ Compliance needs (GDPR, SOC 2)?
  β”‚   └─ YES β†’ Cloud Platform (managed compliance)
  β”‚   └─ NO β†’ Direct API (lowest cost)

Deployment Recommendations by Scenario

Scenario 1: Startup MVP (Seed Stage)

Context: Building quickly, limited budget, exploring use cases

Recommendation: Direct API (GPT-4o, Claude, or Gemini Flash)

Rationale:

  • Speed to market critical
  • Low volume (API costs minimal)
  • No infrastructure team
  • Can migrate later if needed

Model Choice:

  • Quality-first: GPT-4o or Claude
  • Cost-first: Gemini Flash or DeepSeek-V3

Scenario 2: Scale-Up (Series B, Growing Volume)

Context: Product-market fit achieved, scaling usage, need reliability

Recommendation: Cloud Platform (based on existing cloud investment)

Rationale:

  • Volume increasing (API costs meaningful)
  • Need enterprise SLA and support
  • Growing compliance requirements
  • Can leverage existing cloud relationship

Implementation:

  • Azure if Microsoft-heavy
  • AWS if AWS-heavy
  • Google if data/ML-heavy on Google Cloud

Scenario 3: Enterprise (Global 2000)

Context: Multiple use cases, high volume, strict compliance, multi-cloud

Recommendation: Hybrid: Cloud Platforms + Self-Hosted

Strategy:

  • Cloud platforms for managed critical workloads (customer-facing, mid-volume)
  • Self-hosted Llama/Mistral for highest-volume or most sensitive data
  • Direct APIs for experimentation and non-critical tools

Implementation:

  • Azure AI Foundry: OpenAI for customer apps
  • Self-hosted Llama 4: High-volume internal processing
  • AWS Bedrock: Claude for coding workflows
  • Gemini API: Multimodal experiments

Scenario 4: Regulated Industry (Healthcare, Finance, Government)

Context: Strict compliance, audit requirements, data sovereignty

Recommendation:

  • Healthcare (HIPAA): Cloud Platform with BAA (Azure/AWS/Google)
  • Finance: Cloud Platform or self-hosted
  • EU: Mistral or cloud platforms in EU regions
  • Defense/Classified: Self-hosted only

Critical: Verify BAA, data residency, and compliance certifications before deployment.


Scenario 5: High-Volume Cost Optimization

Context: Processing >10B tokens/month, cost is primary concern

Recommendation: Self-Hosted Llama 4 or Mistral

Rationale:

  • API costs $10K-100K+/month
  • Infrastructure investment (€2-2.5K/month GPU + engineering) cheaper at scale
  • No per-token costs after infrastructure
  • Can fine-tune for specific needs

Break-even: Typically 10-30x higher token throughput vs API pricing

Platform-Specific Guidance

When to Choose Azure AI Foundry

Best for:

  • Microsoft-centric organizations
  • Need OpenAI models with enterprise controls
  • Want 1,800+ model catalog
  • Require HIPAA compliance with OpenAI
  • Already using Azure infrastructure

Models: OpenAI (primary), DeepSeek, Llama, Mistral, 1,800+ others

Strengths: Largest model catalog, OpenAI partnership, Microsoft ecosystem integration

When to Choose AWS Bedrock

Best for:

  • AWS-centric organizations
  • Claude preferred (best coding, nuanced responses)
  • Multi-model strategy
  • Serverless and managed services preference

Models: Claude (primary), Llama, Cohere, AI21, Stability AI, Amazon Titan

Strengths: Claude access, multi-vendor flexibility, deep AWS integration

Note: No OpenAI models (use Azure for GPT)

When to Choose Google Vertex AI

Best for:

  • Google Cloud organizations
  • Gemini preferred (1M context, multimodal, price-performance)
  • Data-heavy ML workflows
  • Advanced fine-tuning needs

Models: Gemini (primary), Claude, Llama, Mistral (via Model Garden)

Strengths: Best fine-tuning suite, Gemini 1M context, unified data+AI platform

When to Self-Host

Best for:

  • Data cannot leave infrastructure (sovereignty, classification)
  • Volume >10B tokens/month (economical at scale)
  • Need full customization (fine-tuning, model modification)
  • Avoiding vendor lock-in strategic priority
  • Strong ML engineering capability exists

Models: Llama 4, Mistral, DeepSeek (open-source only)

Requirements: GPU infrastructure, ML ops expertise, ongoing maintenance

Cost Comparison at Different Volumes

Example: 100M Tokens/Month (50M input, 50M output)

Direct API (Gemini Flash):

  • Input: 50M Γ— $0.075 = $3.75
  • Output: 50M Γ— $0.30 = $15
  • Total: $18.75/month

Cloud Platform Premium (~20% higher):

  • ~$22.50/month

Self-Hosted (Llama 4 on cloud GPU):

  • GPU VM: $2,000-2,500/month
  • Engineering (0.25 FTE): $2,000-3,500/month
  • Total: $4,000-6,000/month

Winner: Direct API (self-hosted 200x more expensive at this volume)


Example: 10B Tokens/Month (5B input, 5B output)

Direct API (Gemini Flash):

  • Input: 5B Γ— $0.075 = $375
  • Output: 5B Γ— $0.30 = $1,500
  • Total: $1,875/month

Cloud Platform Premium (~20%):

  • ~$2,250/month

Self-Hosted:

  • GPU infrastructure (multiple): $5,000-7,000/month
  • Engineering (0.5-1 FTE): $4,000-7,000/month
  • Total: $9,000-14,000/month

Winner: Still API, but self-hosted gap narrowing. At 2-3x this volume, self-hosted becomes competitive.


Example: 100B Tokens/Month (50B input, 50B output)

Direct API (Gemini Flash):

  • Input: 50B Γ— $0.075 = $3,750
  • Output: 50B Γ— $0.30 = $15,000
  • Total: $18,750/month

Self-Hosted:

  • GPU infrastructure (scaled): $10,000-15,000/month
  • Engineering (1-2 FTE): $8,000-14,000/month
  • Total: $18,000-29,000/month

Winner: Competitive. Self-hosted now justifiable, especially for data sovereignty benefits. At higher volumes, self-hosted wins clearly.

Implementation Checklist

Direct API Deployment

  • Select provider (OpenAI, Claude, Gemini, DeepSeek)
  • Review terms of service and data usage policies
  • Obtain API keys
  • Implement authentication and rate limiting
  • Set up billing alerts
  • Test with sample requests
  • Implement error handling and retries
  • Monitor usage and costs

Cloud Platform Deployment

  • Choose platform (Azure, AWS, Google)
  • Provision AI service (AI Foundry, Bedrock, Vertex AI)
  • Configure IAM and access controls
  • Set up networking (VPCs, private endpoints if needed)
  • Integrate with existing cloud services
  • Configure monitoring and logging
  • Establish cost controls and budgets
  • Sign BAA if HIPAA required
  • Verify compliance certifications
  • Test deployment with sample workloads

Self-Hosted Deployment

  • Select model (Llama 4, Mistral, DeepSeek)
  • Provision GPU infrastructure (cloud VMs or on-premise)
  • Install model serving framework (vLLM, TensorRT-LLM, Ollama)
  • Deploy and test model
  • Implement load balancing and scaling
  • Set up monitoring (performance, errors, resource usage)
  • Establish security controls (access, encryption)
  • Plan update and maintenance procedures
  • Train team on operations
  • Document runbooks for common issues

Common Pitfalls to Avoid

Choosing self-hosted too early: Infrastructure costs far exceed API costs at low-medium volumes. Only self-host when volume justifies or constraints require.

Ignoring compliance until late: HIPAA, GDPR, data residency requirements eliminate or constrain deployment options. Address early.

Underestimating self-hosted operational burden: Self-hosting requires ongoing engineering timeβ€”minimum 0.25-0.5 FTE, often more.

Not calculating full TCO: Compare apples-to-apples including infrastructure, engineering time, support, and opportunity costs.

Vendor lock-in without realizing: Deep cloud platform integration creates switching costs. Maintain abstraction layer if portability matters.

Direct API for regulated data: HIPAA, classified data, strict GDPR often require cloud platforms with BAAs or self-hosting, not direct APIs.

Summary

Choose Direct API when:

  • Volume low (<10M tokens/month)
  • No strict compliance requirements
  • Speed and simplicity prioritized
  • Limited technical resources

Choose Cloud Platform when:

  • HIPAA, GDPR, or compliance frameworks required
  • Medium-high volume (10M-10B tokens/month)
  • Enterprise support and SLA needed
  • Existing cloud investment to leverage

Choose Self-Hosted when:

  • Data must stay on-premise (sovereignty, classification)
  • Very high volume (>10-50B tokens/month)
  • Need maximum customization
  • Strong ML ops capability exists

For most organizations: Start with Direct API (learn fast, low risk), graduate to Cloud Platforms as volume and compliance needs grow, consider self-hosted only when volume justifies or constraints require.

The optimal strategy is often hybrid: Use different deployment models for different use cases based on sensitivity, volume, and requirements.