AI Model Deployment with Azure Developer CLI

Chapter Navigation:

📚 Course Home: AZD For Beginners
📖 Current Chapter: Chapter 2 - AI-First Development
⬅️ Previous: Microsoft Foundry Integration
➡️ Next: AI Workshop Lab
🚀 Next Chapter: Chapter 3: Configuration

This guide provides comprehensive instructions for deploying AI models using AZD templates, covering everything from model selection to production deployment patterns.

Validation note (2026-03-25): The AZD workflow in this guide was checked against azd 1.23.12. For AI deployments that take longer than the default service deployment window, current AZD releases support azd deploy --timeout <seconds>.

Model Selection Strategy
AZD Configuration for AI Models
Deployment Patterns
Model Management
Production Considerations
Monitoring and Observability

Model Selection Strategy

Microsoft Foundry Models Models

Choose the right model for your use case:

# azure.yaml - Model configuration
services:
  ai-service:
    project: ./infra
    host: containerapp
    config:
      AZURE_OPENAI_MODELS: |
        [
          {
            "name": "gpt-4.1-mini",
            "version": "2024-07-18",
            "deployment": "gpt-4.1-mini",
            "capacity": 10,
            "format": "OpenAI"
          },
          {
            "name": "text-embedding-3-large",
            "version": "1",
            "deployment": "text-embedding-3-large", 
            "capacity": 30,
            "format": "OpenAI"
          }
        ]

Model Capacity Planning

Model Type	Use Case	Recommended Capacity	Cost Considerations
gpt-4.1-mini	Chat, Q&A	10-50 TPM	Cost-effective for most workloads
gpt-4.1	Complex reasoning	20-100 TPM	Higher cost, use for premium features
text-embedding-3-large	Search, RAG	30-120 TPM	Strong default choice for semantic search and retrieval
Whisper	Speech-to-text	10-50 TPM	Audio processing workloads

AZD Configuration for AI Models

Bicep Template Configuration

Create model deployments through Bicep templates:

// infra/main.bicep
@description('OpenAI model deployments')
param openAiModelDeployments array = [
  {
    name: 'gpt-4.1-mini'
    model: {
      format: 'OpenAI'
      name: 'gpt-4.1-mini'
      version: '2024-07-18'
    }
    sku: {
      name: 'Standard'
      capacity: 10
    }
  }
  {
    name: 'text-embedding-3-large'
    model: {
      format: 'OpenAI'
      name: 'text-embedding-3-large'
      version: '1'
    }
    sku: {
      name: 'Standard'
      capacity: 30
    }
  }
]

resource openAi 'Microsoft.CognitiveServices/accounts@2023-05-01' = {
  name: openAiAccountName
  location: location
  kind: 'OpenAI'
  properties: {
    customSubDomainName: openAiAccountName
    networkAcls: {
      defaultAction: 'Allow'
    }
    publicNetworkAccess: 'Enabled'
  }
  sku: {
    name: 'S0'
  }
}

@batchSize(1)
resource deployment 'Microsoft.CognitiveServices/accounts/deployments@2023-05-01' = [for deployment in openAiModelDeployments: {
  parent: openAi
  name: deployment.name
  properties: {
    model: deployment.model
  }
  sku: deployment.sku
}]

Environment Variables

Configure your application environment:

# .env configuration
AZURE_OPENAI_ENDPOINT=https://your-openai-resource.openai.azure.com/
AZURE_OPENAI_API_VERSION=2024-02-15-preview
AZURE_OPENAI_CHAT_DEPLOYMENT=gpt-4.1-mini
AZURE_OPENAI_EMBED_DEPLOYMENT=text-embedding-3-large

Deployment Patterns

Pattern 1: Single-Region Deployment

# azure.yaml - Single region
services:
  ai-app:
    project: ./src
    host: containerapp
    config:
      AZURE_OPENAI_ENDPOINT: ${AZURE_OPENAI_ENDPOINT}
      AZURE_OPENAI_CHAT_DEPLOYMENT: gpt-4.1-mini

Best for:

Development and testing
Single-market applications
Cost optimization

Pattern 2: Multi-Region Deployment

// Multi-region deployment
param regions array = ['eastus2', 'westus2', 'francecentral']

resource openAiMultiRegion 'Microsoft.CognitiveServices/accounts@2023-05-01' = [for region in regions: {
  name: '${openAiAccountName}-${region}'
  location: region
  // ... configuration
}]

Best for:

Global applications
High availability requirements
Load distribution

Pattern 3: Hybrid Deployment

Combine Microsoft Foundry Models with other AI services:

// Hybrid AI services
resource cognitiveServices 'Microsoft.CognitiveServices/accounts@2023-05-01' = {
  name: cognitiveServicesName
  location: location
  kind: 'CognitiveServices'
  properties: {
    customSubDomainName: cognitiveServicesName
  }
  sku: {
    name: 'S0'
  }
}

resource documentIntelligence 'Microsoft.CognitiveServices/accounts@2023-05-01' = {
  name: documentIntelligenceName
  location: location
  kind: 'FormRecognizer'
  properties: {
    customSubDomainName: documentIntelligenceName
  }
  sku: {
    name: 'S0'
  }
}

Model Management

Version Control

Track model versions in your AZD configuration:

{
  "models": {
    "chat": {
      "name": "gpt-4.1-mini",
      "version": "2024-07-18",
      "fallback": "gpt-4.1"
    },
    "embedding": {
      "name": "text-embedding-3-large",
      "version": "1"
    }
  }
}

Model Updates

Use AZD hooks for model updates:

#!/bin/bash
# hooks/predeploy.sh

echo "Checking model availability..."
az cognitiveservices account list-models \
  --name $AZURE_OPENAI_ACCOUNT_NAME \
  --resource-group $AZURE_RESOURCE_GROUP \
  --query "[?name=='gpt-4.1-mini']"

# If the deployment takes longer than the default timeout
azd deploy --timeout 1800

A/B Testing

Deploy multiple model versions:

param enableABTesting bool = false

resource chatDeployment 'Microsoft.CognitiveServices/accounts/deployments@2023-05-01' = {
  parent: openAi
  name: 'gpt-4.1-mini-${enableABTesting ? 'v1' : 'prod'}'
  properties: {
    model: {
      format: 'OpenAI'
      name: 'gpt-4.1-mini'
      version: '2024-07-18'
    }
  }
  sku: {
    name: 'Standard'
    capacity: enableABTesting ? 5 : 10
  }
}

Production Considerations

Capacity Planning

Calculate required capacity based on usage patterns:

# Capacity calculation example
def calculate_required_capacity(
    requests_per_minute: int,
    avg_prompt_tokens: int,
    avg_completion_tokens: int,
    safety_margin: float = 0.2
) -> int:
    """Calculate required TPM capacity."""
    total_tokens_per_request = avg_prompt_tokens + avg_completion_tokens
    total_tpm = requests_per_minute * total_tokens_per_request
    return int(total_tpm * (1 + safety_margin))

# Example usage
required_capacity = calculate_required_capacity(
    requests_per_minute=10,
    avg_prompt_tokens=500,
    avg_completion_tokens=200,
    safety_margin=0.3
)
print(f"Required capacity: {required_capacity} TPM")

Auto-scaling Configuration

Configure auto-scaling for Container Apps:

resource containerApp 'Microsoft.App/containerApps@2024-03-01' = {
  name: containerAppName
  properties: {
    template: {
      scale: {
        minReplicas: 1
        maxReplicas: 10
        rules: [
          {
            name: 'http-rule'
            http: {
              metadata: {
                concurrentRequests: '10'
              }
            }
          }
          {
            name: 'cpu-rule'
            custom: {
              type: 'cpu'
              metadata: {
                type: 'Utilization'
                value: '70'
              }
            }
          }
        ]
      }
    }
  }
}

Cost Optimization

Implement cost controls:

@description('Enable cost management alerts')
param enableCostAlerts bool = true

resource budgetAlert 'Microsoft.Consumption/budgets@2023-05-01' = if (enableCostAlerts) {
  name: 'ai-budget-alert'
  properties: {
    timePeriod: {
      startDate: '2024-01-01'
      endDate: '2024-12-31'
    }
    timeGrain: 'Monthly'
    amount: 1000
    category: 'Cost'
    notifications: {
      Actual_GreaterThan_80_Percent: {
        enabled: true
        operator: 'GreaterThan'
        threshold: 80
        contactEmails: [
          'admin@yourcompany.com'
        ]
      }
    }
  }
}

Monitoring and Observability

Application Insights Integration

Configure monitoring for AI workloads:

resource applicationInsights 'Microsoft.Insights/components@2020-02-02' = {
  name: applicationInsightsName
  location: location
  kind: 'web'
  properties: {
    Application_Type: 'web'
    WorkspaceResourceId: logAnalyticsWorkspace.id
  }
}

// Custom metrics for AI models
resource aiMetrics 'Microsoft.Insights/components/analyticsItems@2020-02-02' = {
  parent: applicationInsights
  name: 'ai-model-metrics'
  properties: {
    content: '''
      customEvents
      | where name == "AI_Model_Request"
      | extend model = tostring(customDimensions.model)
      | extend tokens = toint(customDimensions.tokens)
      | extend latency = toint(customDimensions.latency_ms)
      | summarize 
          requests = count(),
          avg_tokens = avg(tokens),
          avg_latency = avg(latency)
        by model, bin(timestamp, 5m)
    '''
    type: 'query'
    scope: 'shared'
  }
}

Custom Metrics

Track AI-specific metrics:

# Custom telemetry for AI models
import logging
from applicationinsights import TelemetryClient

class AITelemetry:
    def __init__(self, instrumentation_key: str):
        self.client = TelemetryClient(instrumentation_key)
    
    def track_model_request(self, model: str, tokens: int, latency_ms: int, success: bool):
        """Track AI model request metrics."""
        self.client.track_event(
            'AI_Model_Request',
            {
                'model': model,
                'tokens': str(tokens),
                'latency_ms': str(latency_ms),
                'success': str(success)
            }
        )
        
    def track_model_error(self, model: str, error_type: str, error_message: str):
        """Track AI model errors."""
        self.client.track_exception(
            type=error_type,
            value=error_message,
            properties={
                'model': model,
                'component': 'ai_model'
            }
        )

Health Checks

Implement AI service health monitoring:

# Health check endpoints
from fastapi import FastAPI, HTTPException
import httpx

app = FastAPI()

@app.get("/health/ai-models")
async def check_ai_models():
    """Check AI model availability."""
    try:
        # Test OpenAI connection
        async with httpx.AsyncClient() as client:
            response = await client.get(
                f"{AZURE_OPENAI_ENDPOINT}/openai/deployments",
                headers={"api-key": AZURE_OPENAI_API_KEY}
            )
            
        if response.status_code == 200:
            return {"status": "healthy", "models": response.json()}
        else:
            raise HTTPException(status_code=503, detail="AI models unavailable")
            
    except Exception as e:
        raise HTTPException(status_code=503, detail=f"Health check failed: {str(e)}")

Next Steps

Review the Microsoft Foundry Integration Guide for service integration patterns
Complete the AI Workshop Lab for hands-on experience
Implement Production AI Practices for enterprise deployments
Explore the AI Troubleshooting Guide for common issues

Resources

Chapter Navigation:

📚 Course Home: AZD For Beginners
📖 Current Chapter: Chapter 2 - AI-First Development
⬅️ Previous: Microsoft Foundry Integration
➡️ Next: AI Workshop Lab
🚀 Next Chapter: Chapter 3: Configuration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI Model Deployment with Azure Developer CLI

Table of Contents

Model Selection Strategy

Microsoft Foundry Models Models

Model Capacity Planning

AZD Configuration for AI Models

Bicep Template Configuration

Environment Variables

Deployment Patterns

Pattern 1: Single-Region Deployment

Pattern 2: Multi-Region Deployment

Pattern 3: Hybrid Deployment

Model Management

Version Control

Model Updates

A/B Testing

Production Considerations

Capacity Planning

Auto-scaling Configuration

Cost Optimization

Monitoring and Observability

Application Insights Integration

Custom Metrics

Health Checks

Next Steps

Resources

FilesExpand file tree

ai-model-deployment.md

Latest commit

History

ai-model-deployment.md

File metadata and controls

AI Model Deployment with Azure Developer CLI

Table of Contents

Model Selection Strategy

Microsoft Foundry Models Models

Model Capacity Planning

AZD Configuration for AI Models

Bicep Template Configuration

Environment Variables

Deployment Patterns

Pattern 1: Single-Region Deployment

Pattern 2: Multi-Region Deployment

Pattern 3: Hybrid Deployment

Model Management

Version Control

Model Updates

A/B Testing

Production Considerations

Capacity Planning

Auto-scaling Configuration

Cost Optimization

Monitoring and Observability

Application Insights Integration

Custom Metrics

Health Checks

Next Steps

Resources