S5 Slidefactory - Complete Azure Setup Guide¶
Last Updated: 2025-12-02 Version: Production-ready Environment: Azure Container Apps
Table of Contents¶
- Architecture Overview
- Azure Resources
- Network Configuration
- Container Apps Configuration
- Database Setup
- Redis Configuration
- Storage Configuration
- N8N Queue Mode Setup
- CI/CD Pipeline
- Environment Variables
- Deployment Process
- Monitoring & Observability
- Security Configuration
- Cost Analysis
- Disaster Recovery
- Troubleshooting
Architecture Overview¶
High-Level Architecture¶
┌─────────────────────────────────────────────────────────────────────────┐
│ AZURE CLOUD │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Azure Container Apps Environment │ │
│ │ Virtual Network: 10.0.0.0/16 │ │
│ ├──────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ ┌────────────────┐ │ │
│ │ │ Web (Preview) │ │ Web (Prod) │ │ N8N Main │ │ │
│ │ │ slidefactory- │ │ slidefactory- │ │ slidefactory- │ │ │
│ │ │ web-preview │ │ web │ │ n8n │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ FastAPI │ │ FastAPI │ │ Queue Mode │ │ │
│ │ │ Port 8000 │ │ Port 8000 │ │ UI/API/Hooks │ │ │
│ │ │ 1-3 replicas │ │ 2-5 replicas │ │ 1 replica │ │ │
│ │ └─────────────────┘ └─────────────────┘ └────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ ┌────────────────┐ │ │
│ │ │ Worker (Preview)│ │ Worker (Prod) │ │ N8N Workers │ │ │
│ │ │ slidefactory- │ │ slidefactory- │ │ slidefactory- │ │ │
│ │ │ worker-preview │ │ worker │ │ n8n-worker │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ Celery │ │ Celery │ │ Queue Workers │ │ │
│ │ │ Background │ │ Background │ │ 2-10 replicas │ │ │
│ │ │ 1-2 replicas │ │ 1-3 replicas │ │ Auto-scaling │ │ │
│ │ └─────────────────┘ └─────────────────┘ └────────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Private Endpoints (10.0.2.0/24) │ │
│ ├──────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ ┌────────────────┐ │ │
│ │ │ PostgreSQL │ │ Redis Cache │ │ Blob Storage │ │ │
│ │ │ 10.0.2.4:5432 │ │ 10.0.2.5:6380 │ │ 10.0.2.6 │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ - slidefactory │ │ DB 2: Celery │ │ presentations │ │ │
│ │ │ - n8n │ │ DB 6: N8N │ │ templates │ │ │
│ │ │ pgvector ext. │ │ │ │ documents │ │ │
│ │ └─────────────────┘ └─────────────────┘ └────────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Azure Container Registry (ACR) │ │
│ │ slidefactoryacr.azurecr.io │ │
│ │ - slidefactory:preview-{sha} │ │
│ │ - slidefactory:prod-{sha} │ │
│ │ - n8n-custom:1.121.3 │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
▲
│
┌───────┴────────┐
│ GitHub Actions │
│ CI/CD Pipeline │
└────────────────┘
Deployment Flow¶
Developer → GitHub (preview/main) → GitHub Actions → Build Docker Image →
ACR Push → Container App Update → Rolling Deployment
Data Flow¶
User Request → FastAPI (Web) → PostgreSQL/Redis/Storage
→ N8N API (for workflows)
→ Celery (via Worker) → Background Tasks
N8N Workflow → N8N Main (queue mode) → Redis Queue (DB 6) →
N8N Workers (2-10 replicas) → Execute Workflows
Azure Resources¶
Resource Group Structure¶
Production Resource Group: rg-slidefactory-prod-001
All resources (both preview and production environments) are in a single resource group.
Complete Resource List¶
| Resource Name | Type | Environment | Purpose |
|---|---|---|---|
slidefactory-web-preview | Container App | Preview | FastAPI web application (preview) |
slidefactory-worker-preview | Container App | Preview | Celery background worker (preview) |
slidefactory-web | Container App | Production | FastAPI web application (prod) |
slidefactory-worker | Container App | Production | Celery background worker (prod) |
slidefactory-n8n | Container App | Shared | N8N main instance (queue mode) |
slidefactory-n8n-worker | Container App | Shared | N8N worker pool (2-10 replicas) |
slidefactory-postgres | PostgreSQL Flexible Server | Shared | Database (both envs + n8n) |
slidefactory-redis | Azure Cache for Redis | Shared | Cache + Celery + N8N queue |
slidefactoryprod | Storage Account | Shared | Blob storage (all envs) |
slidefactoryacr | Container Registry | Shared | Docker images |
log-slidefactory | Log Analytics Workspace | Shared | Centralized logging |
appi-slidefactory | Application Insights | Shared | Application telemetry |
Azure Subscription¶
- Subscription ID:
022ab726-cdb5-4a02-bf2b-bea8d87d8e83 - Region: West Europe
- Resource Group:
rg-slidefactory-prod-001
Network Configuration¶
Virtual Network¶
VNET: vnet-slidefactory (10.0.0.0/16)
Subnets: - Apps Subnet: snet-apps (10.0.1.0/24) - Container Apps - Data Subnet: snet-data (10.0.2.0/24) - Private endpoints
Private Endpoints¶
| Service | IP Address | DNS Zone |
|---|---|---|
| PostgreSQL | 10.0.2.4:5432 | privatelink.postgres.database.azure.com |
| Redis | 10.0.2.5:6380 | privatelink.redis.cache.windows.net |
| Blob Storage | 10.0.2.6 | privatelink.blob.core.windows.net |
Firewall Rules¶
PostgreSQL: - Allow Azure services: Yes - Allow Container Apps subnet: 10.0.1.0/24
Redis: - Private endpoint only - TLS required - Port: 6380 (SSL)
Storage Account: - Public access: Enabled (with SAS tokens) - Allow Azure services: Yes - Allow Container Apps subnet: 10.0.1.0/24
DNS Configuration¶
- Private DNS zones automatically created for private endpoints
- Azure-provided DNS for Container Apps
- Custom domain not configured (using Azure-provided URLs)
Container Apps Configuration¶
Container Apps Environment¶
Environment ID: Shared environment for all container apps Virtual Network Integration: Yes (10.0.1.0/24) Log Analytics: Enabled Dapr: Not used
Web Service (Preview)¶
Name: slidefactory-web-preview URL: https://slidefactory-web-preview.thankfulsmoke-fef50a06.westeurope.azurecontainerapps.io
Image: slidefactoryacr.azurecr.io/slidefactory:preview-{sha}
Resources: - CPU: 0.5 cores - Memory: 1.0 GB - Min Replicas: 1 - Max Replicas: 3
Scaling Rules: - CPU Utilization: >75% triggers scale up - HTTP Requests: >100 concurrent requests
Health Probes: - Liveness: GET /health (30s interval, 3 retries) - Readiness: GET /health (10s interval, 3 retries) - Startup: GET /health (5s interval, 60s timeout)
Ingress: - External ingress enabled - Target port: 8000 - Transport: HTTP/2 - Allow insecure: No (HTTPS only)
Command: /code/scripts/start-web-azure.sh
Web Service (Production)¶
Name: slidefactory-web URL: https://slidefactory-web.thankfulsmoke-fef50a06.westeurope.azurecontainerapps.io
Image: slidefactoryacr.azurecr.io/slidefactory:prod-{sha}
Resources: - CPU: 1.0 cores - Memory: 2.0 GB - Min Replicas: 2 - Max Replicas: 5
Scaling Rules: - CPU Utilization: >70% triggers scale up - HTTP Requests: >200 concurrent requests
Health Probes: Same as preview
Ingress: Same as preview
Command: /code/scripts/start-web-azure.sh
Worker Service (Preview)¶
Name: slidefactory-worker-preview
Image: slidefactoryacr.azurecr.io/slidefactory:preview-{sha}
Resources: - CPU: 0.5 cores - Memory: 1.0 GB - Min Replicas: 1 - Max Replicas: 2
Scaling Rules: - CPU Utilization: >80% triggers scale up
Ingress: Internal only (port 8080 for health checks)
Command: /code/scripts/start-worker-azure-healthcheck.sh
Worker Service (Production)¶
Name: slidefactory-worker
Image: slidefactoryacr.azurecr.io/slidefactory:prod-{sha}
Resources: - CPU: 1.0 cores - Memory: 2.0 GB - Min Replicas: 1 - Max Replicas: 3
Scaling Rules: - CPU Utilization: >75% triggers scale up
Ingress: Internal only (port 8080 for health checks)
Command: /code/scripts/start-worker-azure-healthcheck.sh
Database Setup¶
PostgreSQL Flexible Server¶
Name: slidefactory-postgres (internal reference) Host: 10.0.2.4 Port: 5432 Version: PostgreSQL 15
Tier: - SKU: General Purpose (D2s_v3) - vCores: 2 - Memory: 8 GB - Storage: 128 GB (auto-grow enabled)
Databases: 1. slidefactory - Main application database 2. n8n - N8N workflow database
Extensions: - pgvector - Vector similarity search (for RAG/context system)
Configuration Parameters: - max_connections: 100 - work_mem: 16MB - shared_buffers: 2GB - effective_cache_size: 6GB
Backup: - Automated backups: Enabled - Retention: 30 days - Geo-redundant: Yes (production) - Point-in-time restore: Available
Networking: - Public access: Disabled - Private endpoint: 10.0.2.4 - SSL/TLS: Required - Firewall: Allow Azure services
Maintenance: - Window: Sunday 02:00-06:00 UTC - Minor version updates: Automatic - Major version updates: Manual
Admin User: dbadmin Connection String Format: postgresql://dbadmin:{password}@10.0.2.4:5432/{database}?sslmode=require
Database Schema (Slidefactory)¶
Managed via Alembic migrations. Key tables:
Users & Auth: - users - User accounts - api_keys - API authentication tokens - sessions - User sessions (in-memory/Redis)
Templates & Presentations: - templates - PowerPoint template metadata - n8n_processes - N8N workflow execution tracking - presentations - Generated presentation metadata
Context/RAG System: - context_documents - Uploaded documents - context_chunks - Document chunks with pgvector embeddings - contexts - Context collections for RAG
Migrations: - Location: alembic/versions/ - Applied automatically on web container startup via start-web-azure.sh
Database Initialization¶
For Fresh Database:
# Run init.sql to create databases and extensions
psql -U postgres -f init.sql
# Mark as Alembic-managed
alembic stamp head
# Apply any pending migrations
alembic upgrade head
For Existing Database (production migration):
Redis Configuration¶
Azure Cache for Redis¶
Name: slidefactory-redis Host: slidefactory-redis.redis.cache.windows.net Port: 6380 (SSL) Version: Redis 6.x
Tier: - SKU: Standard C2 - Memory: 2.5 GB - Replicas: 1
Database Allocation: - DB 0: Default (unused) - DB 1: Application cache (unused) - DB 2: Celery broker + result backend (Slidefactory) - DB 6: N8N queue mode (Bull queue)
Configuration: - TLS: Required - Port: 6380 (SSL only, 6379 disabled) - Eviction policy: allkeys-lru - Persistence: RDB snapshots (hourly) - Max memory policy: Evict least recently used
Networking: - Public access: Disabled - Private endpoint: 10.0.2.5 - TLS: Required
Connection String Format: - Slidefactory (DB 2): redis://:password@slidefactory-redis.redis.cache.windows.net:6380/2?ssl=true - N8N Queue (DB 6): redis://:password@slidefactory-redis.redis.cache.windows.net:6380/6?ssl=true
Monitoring: - Metrics: CPU, Memory, Connections, Hit rate - Alerts: High memory (>80%), Connection errors
Storage Configuration¶
Azure Blob Storage¶
Account Name: slidefactoryprod Type: StorageV2 (general purpose v2) Performance: Standard Replication: GRS (Geo-redundant storage) Access Tier: Hot
Containers:
| Container | Purpose | Access Level | Lifecycle Policy |
|---|---|---|---|
presentations | Generated presentations | Private | Cool tier after 90 days, delete after 2 years |
templates | PowerPoint templates | Private | No auto-deletion |
documents | RAG/context documents | Private | No auto-deletion |
static | S5 branding assets | Blob (public read) | No auto-deletion |
Networking: - Public access: Enabled (with SAS tokens) - Private endpoint: 10.0.2.6 - Firewall: Allow Azure services + Container Apps subnet
Features: - Soft delete: Enabled (14 days for blobs, 7 days for containers) - Versioning: Disabled - Change feed: Disabled - Blob indexing: Disabled
Access Methods:
-
Connection String (used by app):
-
Presigned URLs (for downloads):
- Generated via
get_presigned_url()in storage client - Expiration: 1-24 hours (configurable)
- Used for direct downloads (no proxy through FastAPI)
Storage Client Factory: - Location: app/filemanager/storage/factory.py - Backend selection via STORAGE_PROVIDER env var - Supports: Azure Blob, MinIO (local dev)
N8N Queue Mode Setup¶
Architecture¶
N8N runs in distributed queue mode with: - 1 Main Instance: Handles UI, API, webhooks, scheduling (does NOT execute workflows) - 2-10 Worker Instances: Execute workflows in parallel (auto-scaling) - Redis Queue: Job distribution via Bull queue (DB 6) - Shared PostgreSQL: Workflow definitions and execution history (10.0.2.4:5432/n8n)
Main Instance¶
Name: slidefactory-n8n URL: https://slidefactory-n8n.thankfulsmoke-fef50a06.westeurope.azurecontainerapps.io
Image: slidefactoryacr.azurecr.io/n8n-custom:1.121.3
Custom image includes pre-installed community nodes: - n8n-nodes-document-generator - Document generation - Additional nodes as needed
Resources: - CPU: 0.5 cores - Memory: 1.0 GB - Replicas: 1 (fixed, no auto-scaling)
Key Environment Variables:
EXECUTIONS_MODE=queue
QUEUE_BULL_REDIS_HOST=slidefactory-redis.redis.cache.windows.net
QUEUE_BULL_REDIS_PORT=6380
QUEUE_BULL_REDIS_DB=6
QUEUE_BULL_REDIS_TLS=true
N8N_ENCRYPTION_KEY=<must-be-identical-across-all-instances>
DB_TYPE=postgresdb
DB_POSTGRESDB_HOST=10.0.2.4
DB_POSTGRESDB_DATABASE=n8n
N8N_BASIC_AUTH_USER=admin
N8N_BASIC_AUTH_PASSWORD=<secret>
EXECUTIONS_TIMEOUT=3600
EXECUTIONS_TIMEOUT_MAX=7200
QUEUE_WORKER_LOCK_DURATION=7200000
QUEUE_WORKER_MAX_STALLED_COUNT=10
QUEUE_WORKER_STALLED_INTERVAL=60000
N8N_COMMUNITY_PACKAGES_ENABLED=true
Persistent Storage: - Azure File Share: n8n-nodes-storage - Mount path: /home/node/.n8n - Stores: Community nodes, credentials, settings
Ingress: - External ingress enabled - Target port: 5678 (N8N default) - HTTPS only
Worker Instances¶
Name: slidefactory-n8n-worker
Image: slidefactoryacr.azurecr.io/n8n-custom:1.121.3 (same as main)
Resources: - CPU: 1.0 cores - Memory: 2.0 GB - Min Replicas: 2 (can be 1 for persistent nodes, 0 for cost savings) - Max Replicas: 10
Auto-Scaling Rules: - CPU Utilization: >70% - Memory Utilization: >80% - Cool-down period: 5 minutes
Command: n8n worker
Environment Variables: Same as main instance (except N8N_HOST, WEBHOOK_URL, etc.)
Critical: N8N_ENCRYPTION_KEY MUST be identical to main instance!
Deployment¶
Automated (via GitHub Actions):
# Trigger workflow
# Go to: Actions → "N8N - Deploy N8N Queue Mode to Azure"
# Select action: deploy-all, deploy-main, deploy-workers, rollback
# Set worker replicas (min/max)
# Set N8N version (default: 1.121.3)
Manual (via script):
# Set required environment variables
export REDIS_PASSWORD="..."
export DB_PASSWORD="..."
export N8N_ENCRYPTION_KEY="..."
export N8N_ADMIN_PASSWORD="..."
# Deploy
./scripts/deploy-n8n-queue-mode.sh deploy-all
# Or deploy components separately
./scripts/deploy-n8n-queue-mode.sh deploy-main
./scripts/deploy-n8n-queue-mode.sh deploy-workers
# Verify
./scripts/deploy-n8n-queue-mode.sh verify
# Rollback
./scripts/deploy-n8n-queue-mode.sh rollback
Monitoring N8N Queue¶
Check Worker Count:
az containerapp revision list \
--name slidefactory-n8n-worker \
--resource-group rg-slidefactory-prod-001 \
--query "[?properties.active].{name:name, replicas:properties.replicas}"
Check Redis Queue:
# Connect to Redis
redis-cli -h slidefactory-redis.redis.cache.windows.net \
-p 6380 -a "<password>" --tls -n 6
# Check queue depth
LLEN "bull:n8n:waiting"
LLEN "bull:n8n:active"
LLEN "bull:n8n:completed"
View Worker Logs:
az containerapp logs show \
--name slidefactory-n8n-worker \
--resource-group rg-slidefactory-prod-001 \
--follow
CI/CD Pipeline¶
GitHub Actions Workflows¶
Location: .github/workflows/
Preview Deployment¶
File: .github/workflows/preview.yml Trigger: Push to preview branch Target: Preview environment
Steps: 1. Checkout code 2. Login to Azure (using AZURE_CREDENTIALS secret) 3. Login to Azure Container Registry 4. Auto-detect core version from wheel file 5. Build Docker image with build arg CORE_VERSION 6. Push image with tag preview-{sha} 7. Update slidefactory-web-preview container app 8. Update slidefactory-worker-preview container app
Secrets Required: - AZURE_CREDENTIALS - Azure service principal JSON - Environment variables are configured in Container App
Duration: ~5-10 minutes
Production Deployment¶
File: .github/workflows/production.yml Trigger: Push to main branch Target: Production environment
Steps: Same as preview, but: - Uses AZURE_CREDENTIALS secret (same resource group) - Pushes image with tag prod-{sha} - Updates slidefactory-web and slidefactory-worker
Duration: ~5-10 minutes
N8N Custom Image Build¶
File: .github/workflows/build-n8n-custom.yml Trigger: Manual workflow dispatch or push to specific path Target: Custom N8N image with community nodes
Steps: 1. Build custom N8N image from n8nio/n8n:{version} 2. Install community nodes via npm 3. Push to ACR as n8n-custom:{version}
N8N Queue Mode Deployment¶
File: .github/workflows/deploy-n8n-queue-mode.yml Trigger: Manual workflow dispatch Target: N8N main + worker instances
Inputs: - action: deploy-main, deploy-workers, deploy-all, rollback - worker_min_replicas: Min worker count (default: 1) - worker_max_replicas: Max worker count (default: 10) - n8n_version: N8N image version (default: 1.121.3)
Secrets Required: - AZURE_CREDENTIALS - PROD_REDIS_PASSWORD - PROD_DB_PASSWORD - N8N_ENCRYPTION_KEY - N8N_ADMIN_PASSWORD
Environment Variables¶
Web & Worker Services¶
All environment variables are configured in Container App configuration (encrypted at rest).
Database:
Redis:
REDIS_HOST=slidefactory-redis.redis.cache.windows.net
REDIS_PORT=6380
REDIS_PASSWORD={secret}
REDIS_DB=2
REDIS_SSL=true
CELERY_BROKER_URL=redis://:{password}@slidefactory-redis.redis.cache.windows.net:6380/2?ssl=true
CELERY_RESULT_BACKEND=redis://:{password}@slidefactory-redis.redis.cache.windows.net:6380/2?ssl=true
Storage:
STORAGE_PROVIDER=azure
AZURE_STORAGE_CONNECTION_STRING=DefaultEndpointsProtocol=https;AccountName=slidefactoryprod;AccountKey={key};EndpointSuffix=core.windows.net
AZURE_STORAGE_ACCOUNT_NAME=slidefactoryprod
AZURE_STORAGE_ACCOUNT_KEY={secret}
AI Providers (examples):
AI_PROVIDER=openai
AI_MODEL=gpt-4o
OPENAI_API_KEY={secret}
AZURE_OPENAI_API_KEY={secret}
AZURE_OPENAI_ENDPOINT=https://....openai.azure.com/
ANTHROPIC_API_KEY={secret}
N8N Integration:
N8N_API_URL=https://slidefactory-n8n.thankfulsmoke-fef50a06.westeurope.azurecontainerapps.io
N8N_API_KEY={secret}
Context/RAG:
Authentication:
Application:
N8N Environment Variables¶
See N8N Queue Mode Setup section above.
Deployment Process¶
Standard Deployment Flow¶
Preview: 1. Developer commits to preview branch 2. GitHub Actions workflow triggers automatically 3. Docker image built with core package + S5 branding 4. Image pushed to ACR with tag preview-{sha} 5. Container Apps updated with new image 6. Rolling deployment (zero downtime) 7. Health checks verify successful deployment
Production: 1. Merge preview → main (after testing) 2. GitHub Actions workflow triggers automatically 3. Same build process as preview 4. Image tagged as prod-{sha} 5. Production Container Apps updated 6. Rolling deployment with health checks
Manual Deployment¶
Update Container App:
# Login
az login
# Update preview web
az containerapp update \
--name slidefactory-web-preview \
--resource-group rg-slidefactory-prod-001 \
--image slidefactoryacr.azurecr.io/slidefactory:preview-abc123
# Update production web
az containerapp update \
--name slidefactory-web \
--resource-group rg-slidefactory-prod-001 \
--image slidefactoryacr.azurecr.io/slidefactory:prod-abc123
Rollback Procedures¶
Quick Rollback (switch to previous revision):
# List revisions
az containerapp revision list \
--name slidefactory-web \
--resource-group rg-slidefactory-prod-001 \
--output table
# Activate previous revision
az containerapp revision activate \
--name slidefactory-web \
--resource-group rg-slidefactory-prod-001 \
--revision slidefactory-web--previous-revision
Git Rollback (trigger redeployment):
Pre-Deployment Checklist¶
Before production deployment: - [ ] Preview deployment successful - [ ] All tests passing (run python scripts/test_deploy.py) - [ ] Smoke tests pass on preview - [ ] Database migrations tested (if any) - [ ] N8N workflows tested - [ ] Performance acceptable (<2s response times) - [ ] No errors in logs - [ ] S5 branding displays correctly - [ ] Azure AD login working
Monitoring & Observability¶
Application Insights¶
Resource: appi-slidefactory Instrumentation Key: Available in portal
Telemetry: - HTTP requests/responses (timing, status codes) - Dependencies (database, Redis, storage, external APIs) - Exceptions and errors (with stack traces) - Custom events and metrics - User sessions and page views
Retention: 90 days
Key Metrics: - Request rate (requests/sec) - Response time (avg, p95, p99) - Failure rate (%) - Dependency duration - Exception count
Log Analytics¶
Resource: log-slidefactory Workspace ID: Available in portal
Data Sources: - Container Apps logs (stdout/stderr) - Application Insights telemetry - Azure resource logs (PostgreSQL, Redis, Storage)
Retention: 90 days
Sample Queries:
// Container App errors
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "slidefactory-web"
| where Log_s contains "ERROR"
| order by TimeGenerated desc
| take 100
// N8N worker activity
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "slidefactory-n8n-worker"
| where Log_s contains "Workflow executed"
| summarize count() by bin(TimeGenerated, 5m)
// Response time trends
requests
| where name contains "/api/"
| summarize avg(duration), percentile(duration, 95) by bin(timestamp, 5m)
Alerts¶
Configured alerts (email notifications):
Container App Health: - Health probe failures (>3 consecutive) - App not responding (503 errors) - High restart rate (>5 restarts/hour)
Resource Utilization: - CPU >80% for 5 minutes - Memory >85% for 5 minutes - Disk space >90%
Database: - Connection failures - High connection count (>80% of max) - Long-running queries (>30s)
Redis: - Connection errors - High memory usage (>80%) - High eviction rate
Storage: - Throttling errors (429) - High latency (>1s avg)
Dashboards¶
Azure Portal: - Container App metrics (CPU, memory, requests) - Database metrics (connections, queries, storage) - Redis metrics (memory, connections, hit rate)
Application Insights: - Live metrics stream - Application map (dependencies) - Performance blade (requests, dependencies) - Failures blade (exceptions, failed requests)
Health Endpoints¶
Web Service: GET /health - Returns: {"status": "healthy"} (200 OK) - Checks: Database connection, Redis connection
Worker Service: GET /health (port 8080) - Returns: {"status": "healthy"} (200 OK) - Checks: Redis connection, Celery broker
Security Configuration¶
Identity & Access Management¶
Service Principal (for GitHub Actions): - Name: github-actions-slidefactory - Role: Contributor on rg-slidefactory-prod-001 - Used for: Automated deployments
Managed Identity (Container Apps): - System-assigned identity for each container app - Used for: Accessing ACR (image pull)
API Keys (application-level): - Stored in api_keys table - Scopes: * (admin), specific endpoints - Created via CLI: slidefactory api-key create
Authentication¶
User Authentication: - Azure AD integration (OAuth 2.0) - Session-based (stored in Redis) - Local accounts (for testing)
API Authentication: - Bearer token: Authorization: Bearer sf_xxxxx - API key validation via database lookup
Network Security¶
TLS/HTTPS: - All external traffic: HTTPS only (enforced by Container Apps) - PostgreSQL: TLS required - Redis: TLS required (port 6380) - Storage: HTTPS required
Firewall: - Database: Only accessible from Container Apps subnet - Redis: Only accessible via private endpoint - Storage: Accessible from Container Apps + management IPs
Private Endpoints: - Database, Redis, Storage all use private endpoints - No public internet access to data services
Secrets Management¶
GitHub Secrets (for CI/CD): - AZURE_CREDENTIALS - Service principal JSON - PROD_REDIS_PASSWORD - Redis password - PROD_DB_PASSWORD - PostgreSQL password - N8N_ENCRYPTION_KEY - N8N encryption key - N8N_ADMIN_PASSWORD - N8N admin password
Container App Configuration: - Environment variables stored encrypted at rest - Not logged in Container App logs - Accessed only by application code
Not Using: - Azure Key Vault (future consideration) - Customer-managed encryption keys
Data Protection¶
Encryption at Rest: - All Azure services use Microsoft-managed keys - Blob Storage: Encrypted - PostgreSQL: Encrypted - Redis: Encrypted
Encryption in Transit: - TLS 1.2+ required for all connections - HTTPS only for web traffic - Database and Redis require TLS
Backup & Recovery: - Database: Automated backups (30 days retention) - Blob Storage: Soft delete (14 days) - Container Images: Retained in ACR
Cost Analysis¶
Monthly Cost Estimate¶
Production Environment (~$500-600/month):
| Service | Cost/Month | Notes |
|---|---|---|
| Container Apps (Web) | $100-150 | 2-5 replicas, 1 vCPU, 2GB RAM |
| Container Apps (Worker) | $50-80 | 1-3 replicas, 1 vCPU, 2GB RAM |
| N8N Main | $30-40 | 1 replica, 0.5 vCPU, 1GB RAM |
| N8N Workers | $150-200 | 2-10 replicas, 1 vCPU, 2GB RAM |
| PostgreSQL | $80-100 | General Purpose, 2 vCores, 8GB RAM |
| Redis | $40-50 | Standard C2, 2.5GB |
| Blob Storage | $20-30 | GRS, Hot tier, ~1TB |
| Container Registry | $5 | Standard SKU |
| Log Analytics | $10-20 | 90 day retention |
| Application Insights | $5-10 | Based on telemetry volume |
| Total | ~$500-600 |
Preview Environment (~$200-250/month):
| Service | Cost/Month | Notes |
|---|---|---|
| Container Apps (Web) | $40-60 | 1-3 replicas, 0.5 vCPU, 1GB RAM |
| Container Apps (Worker) | $30-50 | 1-2 replicas, 0.5 vCPU, 1GB RAM |
| Shared services | - | Uses production DB, Redis, Storage |
| Total | ~$70-110 |
Combined Total: ~$600-750/month
Cost Optimization¶
Implemented: - Auto-scaling for Container Apps (scale down when idle) - Shared database, Redis, storage across environments - Storage lifecycle policies (move to cool tier after 90 days)
Potential Savings: - Scale preview to 0 outside business hours: ~$30/month - Use Burstable PostgreSQL for preview: ~$20/month - Reserved Instances for production database: ~$30/month (30% discount) - Optimize N8N worker min replicas (set to 1 instead of 2): ~$50/month
Not Recommended (would impact availability): - Disable geo-redundancy for production - Reduce backup retention - Scale production to fewer replicas
Disaster Recovery¶
Backup Strategy¶
Database: - Automated backups: Every hour - Retention: 30 days (production), 7 days (preview) - Geo-redundant: Yes (production) - Point-in-time restore: Yes (any time within retention)
Blob Storage: - Soft delete: 14 days (blobs), 7 days (containers) - Geo-redundant: Yes (production) - Manual export: Quarterly to offline storage
Container Images: - Retention: Last 10 images per tag prefix - Stored in ACR with geo-replication option
Configuration: - GitHub repository: All code and configuration - Container App configuration: Exported via Azure CLI - Environment variables: Documented + stored in secrets
Recovery Procedures¶
Database Restore (point-in-time):
az postgres flexible-server restore \
--resource-group rg-slidefactory-prod-001 \
--name slidefactory-postgres-restored \
--source-server slidefactory-postgres \
--restore-time "2025-12-02T10:00:00Z"
Container App Rollback (to previous revision):
az containerapp revision activate \
--name slidefactory-web \
--resource-group rg-slidefactory-prod-001 \
--revision slidefactory-web--previous-revision
Blob Storage Undelete:
Complete Environment Recreation: 1. Restore database from backup 2. Recreate Container Apps from GitHub Actions 3. Restore blob storage from geo-redundant replica 4. Reconfigure environment variables from documentation 5. Redeploy latest code from GitHub
RTO/RPO¶
Recovery Time Objective (RTO): 1 hour - Container App rollback: ~1 minute - Database restore: ~30 minutes - Complete environment recreation: ~1 hour
Recovery Point Objective (RPO): - Database: 1 hour (automated backups) - Blob Storage: 24 hours (geo-replication lag) - Code: 0 (Git repository)
Troubleshooting¶
Common Issues¶
Issue: Container App Not Starting¶
Symptoms: App shows "Provisioning" or "Failed" status
Diagnosis:
# Check logs
az containerapp logs show \
--name slidefactory-web \
--resource-group rg-slidefactory-prod-001 \
--tail 100
# Check revision status
az containerapp revision list \
--name slidefactory-web \
--resource-group rg-slidefactory-prod-001
Common Causes: - Database migration failure (check logs for Alembic errors) - Missing environment variables - Invalid database connection string - Image pull failure (check ACR credentials)
Solutions: - Verify all environment variables are set - Check database is accessible from Container Apps subnet - Manually run migrations: az containerapp exec --command "alembic upgrade head" - Verify ACR credentials in Container App configuration
Issue: N8N Workers Not Picking Up Jobs¶
Symptoms: Workflows queued but not executing
Diagnosis:
# Check worker logs
az containerapp logs show \
--name slidefactory-n8n-worker \
--resource-group rg-slidefactory-prod-001 \
--tail 100
# Check Redis queue
redis-cli -h slidefactory-redis.redis.cache.windows.net \
-p 6380 -a "<password>" --tls -n 6 \
LLEN "bull:n8n:waiting"
Common Causes: - N8N_ENCRYPTION_KEY mismatch between main and workers - Redis connection issues - Worker instances not running
Solutions: - Verify N8N_ENCRYPTION_KEY is identical in both main and worker - Check Redis connectivity from worker: az containerapp exec --command "redis-cli ping" - Restart workers: az containerapp restart --name slidefactory-n8n-worker
Issue: High Database Connection Count¶
Symptoms: "Too many connections" errors
Diagnosis:
# Check connection count
psql -h 10.0.2.4 -U dbadmin -d slidefactory \
-c "SELECT count(*) FROM pg_stat_activity;"
Solutions: - Increase max_connections in PostgreSQL configuration - Reduce Container App replica count temporarily - Consider adding PgBouncer connection pooler
Issue: Celery Tasks Not Processing¶
Symptoms: Tasks queued but not executed
Diagnosis:
# Check worker logs
az containerapp logs show \
--name slidefactory-worker \
--resource-group rg-slidefactory-prod-001 \
--tail 100
# Check Redis for queued tasks
redis-cli -h slidefactory-redis.redis.cache.windows.net \
-p 6380 -a "<password>" --tls -n 2 \
LLEN "celery"
Solutions: - Verify worker is running and healthy - Check Redis connection (DB 2) - Restart worker: az containerapp restart --name slidefactory-worker
Useful Commands¶
View Container App Status:
az containerapp show \
--name slidefactory-web \
--resource-group rg-slidefactory-prod-001 \
--query "{status:properties.provisioningState, replicas:properties.template.scale}"
View Logs (Live):
az containerapp logs show \
--name slidefactory-web \
--resource-group rg-slidefactory-prod-001 \
--follow
Execute Command in Container:
az containerapp exec \
--name slidefactory-web \
--resource-group rg-slidefactory-prod-001 \
--command "bash"
Restart Container App:
Scale Container App:
az containerapp update \
--name slidefactory-web \
--resource-group rg-slidefactory-prod-001 \
--min-replicas 2 \
--max-replicas 5
View Database Connections:
psql -h 10.0.2.4 -U dbadmin -d slidefactory \
-c "SELECT datname, usename, application_name, client_addr, state FROM pg_stat_activity;"
Check Redis Memory:
redis-cli -h slidefactory-redis.redis.cache.windows.net \
-p 6380 -a "<password>" --tls \
INFO memory
Appendix¶
Related Documentation¶
- Infrastructure Details - Detailed resource specifications
- Deployment Process - Deployment workflows
- Monitoring - Observability setup
- Troubleshooting - Common issues
- N8N Queue Mode Report - N8N setup details
External Resources¶
- Azure Container Apps Docs
- PostgreSQL Flexible Server Docs
- Azure Cache for Redis Docs
- N8N Queue Mode Docs
Maintenance Windows¶
- Database: Sunday 02:00-06:00 UTC
- Deployments: Any time after preview testing
- Major Updates: Scheduled in advance with user notification
Contact & Support¶
- Repository: https://github.com/cgast/s5-slidefactory
- Core Package: https://github.com/cgast/slidefactory-core
- Issues: GitHub Issues in respective repositories
Document Version: 1.0 Last Review: 2025-12-02 Next Review: 2025-12-15