Deployment Decision Guide

This guide provides a systematic framework for choosing how to deploy AI models—via direct API, enterprise cloud platforms (Azure AI Foundry, AWS Bedrock, Google Vertex AI), or self-hosted infrastructure. The right deployment model profoundly affects cost, control, compliance, and operational responsibility.

Deployment Models Overview

Direct API (SaaS)

What it is: Call the provider’s cloud API directly (e.g., OpenAI API, Claude API, Gemini API).

Key Characteristics:

Data sent to provider’s infrastructure
Provider manages everything (updates, scaling, security)
Pay-per-use pricing
Fast setup (hours to days)
Limited control over infrastructure

Typical Cost: Lowest initial; scales linearly with usage

Enterprise Cloud Platforms

What it is: AI models deployed through cloud provider’s managed service (Azure AI Foundry, AWS Bedrock, Google Vertex AI).

Key Characteristics:

Data processed within your cloud tenancy
Provider manages model infrastructure
You control security, networking, IAM integration
Enterprise SLA and support
Unified platform for multiple models

Typical Cost: Moderate premium over direct API; enterprise features included

Self-Hosted (On-Premise or Private Cloud)

What it is: Run AI models on your own infrastructure (data center servers or dedicated cloud VMs).

Key Characteristics:

Complete data control (never leaves your infrastructure)
You manage everything (deployment, scaling, security, updates)
Requires technical expertise and ongoing operations
High fixed costs; low variable costs

Typical Cost: High initial investment; economical at very high volumes

Decision Framework

Step 1: Assess Data Sensitivity and Compliance

Question: Can your data be processed on third-party infrastructure?

Scenario A: Data Cannot Leave Your Infrastructure

Requirements indicating this:

Classified government/defense data
Explicit data sovereignty mandates prohibiting cloud processing
Air-gapped environment requirements
Industry regulations prohibiting third-party access
Extreme competitive sensitivity (trade secrets, M&A)

Your only option: Self-hosted deployment

Models available: Llama 4, Mistral, DeepSeek (open-source/downloadable models only)

Stop here—proceed to Self-Hosted Implementation section below.

Scenario B: Cloud Processing Acceptable with Controls

Requirements:

GDPR (EU data protection)
HIPAA (US healthcare)
Financial regulations (GLBA, PCI DSS)
General corporate data governance
Data residency preferences (but not absolute mandates)

Your options: Enterprise Cloud Platforms (Azure, AWS, Google) or Self-Hosted

Proceed to Step 2.

Scenario C: Standard Business Data (Low Sensitivity)

Characteristics:

Public or internal-only data
No regulatory restrictions on third-party processing
General business content (non-confidential)

Your options: All deployment models (Direct API, Cloud Platforms, Self-Hosted)

Proceed to Step 2.

Step 2: Evaluate Volume and Cost

Question: What is your expected monthly token usage?

Low Volume (<10M tokens/month = <$100-300/month API cost)

Recommendation: Direct API

Rationale:

Self-hosted infrastructure costs ($2-10K/month) far exceed API costs
Enterprise platform premiums not justified at low volume
Simplicity and speed matter more than cost optimization

Exception: Even at low volume, use Cloud Platform if compliance requires (HIPAA, data residency).

Medium Volume (10M-1B tokens/month = $100-10K/month API cost)

Recommendation: Direct API or Cloud Platform based on other factors

Rationale:

API costs meaningful but not prohibitive
Cloud platform premium (typically 10-30% over direct API) justified for:
- Enterprise compliance needs
- Integration with cloud infrastructure
- Need for unified governance
Self-hosted not yet economical (unless specific constraints require it)

Decision factors:

Compliance needs → Cloud Platform
Cost-sensitive, no compliance requirements → Direct API
Existing cloud investment → Cloud Platform

High Volume (>1B tokens/month = >$10K/month API cost)

Recommendation: Evaluate self-hosted alongside Cloud Platforms

Rationale:

API costs substantial ($10K-100K+/month)
Self-hosted infrastructure investment justifiable
Break-even typically at $100-300/day in API costs (~400K predictions/month)

Calculate TCO:

API (e.g., Gemini Flash at $0.075/$0.30 per 1M tokens):

1B tokens/month input: $75
1B tokens/month output: $300
Total: $375/month

Self-Hosted (e.g., Llama 4 on cloud GPU):

GPU cloud VMs: $2,000-2,500/month
Engineering (0.5 FTE): $4,000-7,000/month
Total: $6,000-9,500/month

At this volume, API still cheaper. Self-hosting makes sense at 5-10x this volume or when API options eliminated by other constraints.

Very High Volume (>10B tokens/month): Self-hosted becomes economically attractive. Investment in infrastructure and team justified by savings.

Step 3: Assess Technical Capability

Question: Do you have AI/ML infrastructure expertise?

Limited Technical Capability

Characteristics:

No dedicated ML engineering team
Limited cloud infrastructure experience
Small or no DevOps team

Recommendation: Direct API or Cloud Platform with managed services

Rationale:

Self-hosting requires significant expertise (ML ops, GPU infrastructure, model serving)
Managed services minimize operational burden
Cloud platforms abstract complexity while providing controls

Choose Direct API if: Cost-sensitive, low compliance needs Choose Cloud Platform if: Enterprise requirements, integration needs

Moderate Technical Capability

Characteristics:

Cloud infrastructure team exists
DevOps capabilities
Can learn ML-specific tools
No dedicated ML engineers (yet)

Recommendation: Cloud Platform

Rationale:

Can leverage cloud’s managed AI services
Build expertise gradually
Cloud platform provides training wheels for eventually self-hosting if needed

Strong Technical Capability

Characteristics:

Dedicated ML engineering team
Proven experience with model deployment
Strong DevOps and infrastructure capabilities
GPU infrastructure experience

Recommendation: Self-hosted or Cloud Platform, based on other factors

Rationale:

Have capability to self-host successfully
Decision depends on cost, control priorities, and volume
Can implement hybrid: self-hosted for volume, APIs for variety

Step 4: Infrastructure and Ecosystem

Question: What is your existing cloud infrastructure?

Heavily Microsoft-Centric

Indicators:

Azure infrastructure
Microsoft 365, Active Directory
.NET development stack

Recommendation: Azure AI Foundry

Models Available:

OpenAI (GPT-4, GPT-5, o-series) - primary partnership
DeepSeek (R1)
Llama, Mistral, and 1,800+ model catalog

Benefits:

Unified Microsoft ecosystem
Single procurement relationship
Integrated billing, IAM, security
Strong OpenAI relationship (latest models first)

Heavily AWS-Centric

Indicators:

AWS infrastructure dominates
Heavy use of Lambda, S3, DynamoDB
AWS security/compliance frameworks

Recommendation: AWS Bedrock

Models Available:

Claude (Anthropic) - primary partnership
Llama, Cohere, AI21, Stability AI, Amazon Titan
Multi-vendor model marketplace

Benefits:

Deep AWS ecosystem integration
Managed service simplicity
Multiple model options
Claude preferred for coding use cases

Note: For OpenAI models, use Azure OpenAI Service (not available on AWS).

Heavily Google Cloud-Centric

Indicators:

Google Cloud Platform infrastructure
BigQuery, Google Workspace
Data-heavy ML workflows

Recommendation: Google Vertex AI

Models Available:

Gemini (2.5 Pro, Flash) - primary offering
Claude, Llama, Mistral via Model Garden

Benefits:

Native Gemini access (1M context, multimodal)
Strong MLOps capabilities
Unified data and AI platform
Best-in-class fine-tuning suite

Multi-Cloud or Cloud-Agnostic

Indicators:

No dominant cloud provider
Intentional multi-cloud strategy
Avoiding vendor lock-in priority

Recommendation: Direct APIs or Self-Hosted (open-source models)

Rationale:

Direct APIs avoid cloud platform lock-in
Self-hosted Llama/Mistral provides maximum portability
Can deploy across multiple clouds as needed

Hybrid Approach:

Use each cloud’s native AI where already invested
Maintain abstraction layer for model switching

Step 5: Compliance and Risk

Question: What are your regulatory and compliance requirements?

HIPAA (US Healthcare)

Requirement: Business Associate Agreement (BAA)

Options:

✓ Azure OpenAI Service (OpenAI with BAA)
✓ AWS Bedrock (Claude, others with BAA)
✓ Google Vertex AI (Gemini with BAA)
✓ Self-hosted (full control, you manage compliance)
✗ Direct APIs (typically no BAA for direct relationships)

Recommendation: Use cloud platforms with BAA, not direct APIs

Requirement: EU data residency, Data Processing Agreement (DPA)

Options:

✓ Mistral (European company, GDPR-native)
✓ Cloud platforms in EU regions (Azure EU, AWS EU, Google EU)
✓ Self-hosted in EU
△ Direct APIs with DPA (check data processing locations)

Recommendation:

First choice: Mistral (European provider)
Alternative: Cloud platforms deployed in EU regions
Maximum control: Self-hosted in EU

Government / Defense

Requirement: FedRAMP, ITAR, classified data handling

Options:

✓ Self-hosted on approved infrastructure
△ FedRAMP-certified cloud platforms (for appropriate classification levels)
✗ Commercial APIs (typically prohibited)

Recommendation: Self-hosted for classified; FedRAMP cloud for unclassified government

Financial Services

Requirement: SOC 2, data controls, audit trails

Options:

✓ All major cloud platforms (SOC 2 certified)
✓ Self-hosted (full control)
△ Direct APIs (verify SOC 2 certification)

Recommendation: Cloud platforms for managed compliance; self-hosted for maximum control

Deployment Model Comparison Matrix

Factor	Direct API	Azure AI Foundry	AWS Bedrock	Google Vertex AI	Self-Hosted
Setup Time	Hours-days	Days-weeks	Days-weeks	Days-weeks	Months
Initial Cost	Very low	Low-moderate	Low-moderate	Low-moderate	Very high
Ongoing Cost	Usage-based	Slightly higher than API	Slightly higher than API	Slightly higher than API	Fixed (infrastructure)
Data Control	Low	High	High	High	Maximum
Compliance	Limited	Excellent (BAA, HIPAA)	Excellent (BAA, HIPAA)	Excellent (BAA, HIPAA)	Full (you manage)
Scalability	Automatic	Automatic	Automatic	Automatic	Manual
Maintenance	Provider	Provider	Provider	Provider	You
Customization	Limited	Moderate	Moderate	High (fine-tuning)	Maximum
Model Selection	Single provider	1,800+ models	Multi-vendor	Gemini + Model Garden	Open-source only
Vendor Lock-In	High (to model)	Moderate (to Azure)	Moderate (to AWS)	Moderate (to Google)	None
Support	Basic	Enterprise	Enterprise	Enterprise	Self-support

Decision Tree Summary

START
  │
  ├─ Data MUST stay on-premise?
  │   └─ YES → Self-Hosted (Llama, Mistral)
  │   └─ NO → Continue
  │
  ├─ HIPAA/regulated healthcare?
  │   └─ YES → Cloud Platform with BAA (Azure/AWS/Google)
  │   └─ NO → Continue
  │
  ├─ Volume > 10B tokens/month?
  │   └─ YES → Evaluate Self-Hosted (economical at scale)
  │   └─ NO → Continue
  │
  ├─ Strong cloud investment?
  │   ├─ Azure → Azure AI Foundry (OpenAI, DeepSeek)
  │   ├─ AWS → AWS Bedrock (Claude primary)
  │   ├─ Google → Vertex AI (Gemini primary)
  │   └─ None → Continue
  │
  ├─ Volume < 10M tokens/month?
  │   └─ YES → Direct API (simplicity, low cost)
  │   └─ NO → Continue
  │
  ├─ Compliance needs (GDPR, SOC 2)?
  │   └─ YES → Cloud Platform (managed compliance)
  │   └─ NO → Direct API (lowest cost)

Deployment Recommendations by Scenario

Scenario 1: Startup MVP (Seed Stage)

Context: Building quickly, limited budget, exploring use cases

Recommendation: Direct API (GPT-4o, Claude, or Gemini Flash)

Rationale:

Speed to market critical
Low volume (API costs minimal)
No infrastructure team
Can migrate later if needed

Model Choice:

Quality-first: GPT-4o or Claude
Cost-first: Gemini Flash or DeepSeek-V3

Scenario 2: Scale-Up (Series B, Growing Volume)

Context: Product-market fit achieved, scaling usage, need reliability

Recommendation: Cloud Platform (based on existing cloud investment)

Rationale:

Volume increasing (API costs meaningful)
Need enterprise SLA and support
Growing compliance requirements
Can leverage existing cloud relationship

Implementation:

Azure if Microsoft-heavy
AWS if AWS-heavy
Google if data/ML-heavy on Google Cloud

Scenario 3: Enterprise (Global 2000)

Context: Multiple use cases, high volume, strict compliance, multi-cloud

Recommendation: Hybrid: Cloud Platforms + Self-Hosted

Strategy:

Cloud platforms for managed critical workloads (customer-facing, mid-volume)
Self-hosted Llama/Mistral for highest-volume or most sensitive data
Direct APIs for experimentation and non-critical tools

Implementation:

Azure AI Foundry: OpenAI for customer apps
Self-hosted Llama 4: High-volume internal processing
AWS Bedrock: Claude for coding workflows
Gemini API: Multimodal experiments

Scenario 4: Regulated Industry (Healthcare, Finance, Government)

Context: Strict compliance, audit requirements, data sovereignty

Recommendation:

Healthcare (HIPAA): Cloud Platform with BAA (Azure/AWS/Google)
Finance: Cloud Platform or self-hosted
EU: Mistral or cloud platforms in EU regions
Defense/Classified: Self-hosted only

Critical: Verify BAA, data residency, and compliance certifications before deployment.

Scenario 5: High-Volume Cost Optimization

Context: Processing >10B tokens/month, cost is primary concern

Recommendation: Self-Hosted Llama 4 or Mistral

Rationale:

API costs $10K-100K+/month
Infrastructure investment (€2-2.5K/month GPU + engineering) cheaper at scale
No per-token costs after infrastructure
Can fine-tune for specific needs

Break-even: Typically 10-30x higher token throughput vs API pricing

Platform-Specific Guidance

When to Choose Azure AI Foundry

Best for:

Microsoft-centric organizations
Need OpenAI models with enterprise controls
Want 1,800+ model catalog
Require HIPAA compliance with OpenAI
Already using Azure infrastructure

Models: OpenAI (primary), DeepSeek, Llama, Mistral, 1,800+ others

Strengths: Largest model catalog, OpenAI partnership, Microsoft ecosystem integration

When to Choose AWS Bedrock

Best for:

AWS-centric organizations
Claude preferred (best coding, nuanced responses)
Multi-model strategy
Serverless and managed services preference

Models: Claude (primary), Llama, Cohere, AI21, Stability AI, Amazon Titan

Strengths: Claude access, multi-vendor flexibility, deep AWS integration

Note: No OpenAI models (use Azure for GPT)

When to Choose Google Vertex AI

Best for:

Google Cloud organizations
Gemini preferred (1M context, multimodal, price-performance)
Data-heavy ML workflows
Advanced fine-tuning needs

Models: Gemini (primary), Claude, Llama, Mistral (via Model Garden)

Strengths: Best fine-tuning suite, Gemini 1M context, unified data+AI platform

When to Self-Host

Best for:

Data cannot leave infrastructure (sovereignty, classification)
Volume >10B tokens/month (economical at scale)
Need full customization (fine-tuning, model modification)
Avoiding vendor lock-in strategic priority
Strong ML engineering capability exists

Models: Llama 4, Mistral, DeepSeek (open-source only)

Requirements: GPU infrastructure, ML ops expertise, ongoing maintenance

Cost Comparison at Different Volumes

Example: 100M Tokens/Month (50M input, 50M output)

Direct API (Gemini Flash):

Input: 50M × $0.075 = $3.75
Output: 50M × $0.30 = $15
Total: $18.75/month

Cloud Platform Premium (~20% higher):

~$22.50/month

Self-Hosted (Llama 4 on cloud GPU):

GPU VM: $2,000-2,500/month
Engineering (0.25 FTE): $2,000-3,500/month
Total: $4,000-6,000/month

Winner: Direct API (self-hosted 200x more expensive at this volume)

Example: 10B Tokens/Month (5B input, 5B output)

Direct API (Gemini Flash):

Input: 5B × $0.075 = $375
Output: 5B × $0.30 = $1,500
Total: $1,875/month

Cloud Platform Premium (~20%):

~$2,250/month

Self-Hosted:

GPU infrastructure (multiple): $5,000-7,000/month
Engineering (0.5-1 FTE): $4,000-7,000/month
Total: $9,000-14,000/month

Winner: Still API, but self-hosted gap narrowing. At 2-3x this volume, self-hosted becomes competitive.

Example: 100B Tokens/Month (50B input, 50B output)

Direct API (Gemini Flash):

Input: 50B × $0.075 = $3,750
Output: 50B × $0.30 = $15,000
Total: $18,750/month

Self-Hosted:

GPU infrastructure (scaled): $10,000-15,000/month
Engineering (1-2 FTE): $8,000-14,000/month
Total: $18,000-29,000/month

Winner: Competitive. Self-hosted now justifiable, especially for data sovereignty benefits. At higher volumes, self-hosted wins clearly.

Implementation Checklist

Direct API Deployment

Select provider (OpenAI, Claude, Gemini, DeepSeek)
Review terms of service and data usage policies
Obtain API keys
Implement authentication and rate limiting
Set up billing alerts
Test with sample requests
Implement error handling and retries
Monitor usage and costs

Cloud Platform Deployment

Choose platform (Azure, AWS, Google)
Provision AI service (AI Foundry, Bedrock, Vertex AI)
Configure IAM and access controls
Set up networking (VPCs, private endpoints if needed)
Integrate with existing cloud services
Configure monitoring and logging
Establish cost controls and budgets
Sign BAA if HIPAA required
Verify compliance certifications
Test deployment with sample workloads

Self-Hosted Deployment

Select model (Llama 4, Mistral, DeepSeek)
Provision GPU infrastructure (cloud VMs or on-premise)
Install model serving framework (vLLM, TensorRT-LLM, Ollama)
Deploy and test model
Implement load balancing and scaling
Set up monitoring (performance, errors, resource usage)
Establish security controls (access, encryption)
Plan update and maintenance procedures
Train team on operations
Document runbooks for common issues

Common Pitfalls to Avoid

Choosing self-hosted too early: Infrastructure costs far exceed API costs at low-medium volumes. Only self-host when volume justifies or constraints require.

Ignoring compliance until late: HIPAA, GDPR, data residency requirements eliminate or constrain deployment options. Address early.

Underestimating self-hosted operational burden: Self-hosting requires ongoing engineering time—minimum 0.25-0.5 FTE, often more.

Not calculating full TCO: Compare apples-to-apples including infrastructure, engineering time, support, and opportunity costs.

Vendor lock-in without realizing: Deep cloud platform integration creates switching costs. Maintain abstraction layer if portability matters.

Direct API for regulated data: HIPAA, classified data, strict GDPR often require cloud platforms with BAAs or self-hosting, not direct APIs.

Summary

Choose Direct API when:

Volume low (<10M tokens/month)
No strict compliance requirements
Speed and simplicity prioritized
Limited technical resources

Choose Cloud Platform when:

HIPAA, GDPR, or compliance frameworks required
Medium-high volume (10M-10B tokens/month)
Enterprise support and SLA needed
Existing cloud investment to leverage

Choose Self-Hosted when:

Data must stay on-premise (sovereignty, classification)
Very high volume (>10-50B tokens/month)
Need maximum customization
Strong ML ops capability exists

For most organizations: Start with Direct API (learn fast, low risk), graduate to Cloud Platforms as volume and compliance needs grow, consider self-hosted only when volume justifies or constraints require.

The optimal strategy is often hybrid: Use different deployment models for different use cases based on sensitivity, volume, and requirements.

Deployment Decision Guide

Deployment Models Overview

Direct API (SaaS)

Enterprise Cloud Platforms

Self-Hosted (On-Premise or Private Cloud)

Decision Framework

Step 1: Assess Data Sensitivity and Compliance

Scenario A: Data Cannot Leave Your Infrastructure

Scenario B: Cloud Processing Acceptable with Controls

Scenario C: Standard Business Data (Low Sensitivity)

Step 2: Evaluate Volume and Cost

Low Volume (<10M tokens/month = <$100-300/month API cost)

Medium Volume (10M-1B tokens/month = $100-10K/month API cost)

High Volume (>1B tokens/month = >$10K/month API cost)

Step 3: Assess Technical Capability

Limited Technical Capability

Moderate Technical Capability

Strong Technical Capability

Step 4: Infrastructure and Ecosystem

Heavily Microsoft-Centric

Heavily AWS-Centric

Heavily Google Cloud-Centric

Multi-Cloud or Cloud-Agnostic

Step 5: Compliance and Risk

HIPAA (US Healthcare)

GDPR (EU Data Protection)

Government / Defense

Financial Services

Deployment Model Comparison Matrix

Decision Tree Summary

Deployment Recommendations by Scenario

Scenario 1: Startup MVP (Seed Stage)

Scenario 2: Scale-Up (Series B, Growing Volume)

Scenario 3: Enterprise (Global 2000)

Scenario 4: Regulated Industry (Healthcare, Finance, Government)

Scenario 5: High-Volume Cost Optimization

Platform-Specific Guidance

When to Choose Azure AI Foundry

When to Choose AWS Bedrock

When to Choose Google Vertex AI

When to Self-Host

Cost Comparison at Different Volumes

Example: 100M Tokens/Month (50M input, 50M output)

Example: 10B Tokens/Month (5B input, 5B output)

Example: 100B Tokens/Month (50B input, 50B output)

Implementation Checklist

Direct API Deployment

Cloud Platform Deployment

Self-Hosted Deployment

Common Pitfalls to Avoid

Summary