Skip to content

S5 Slidefactory - Complete Azure Setup Guide

Last Updated: 2025-12-02 Version: Production-ready Environment: Azure Container Apps


Table of Contents

  1. Architecture Overview
  2. Azure Resources
  3. Network Configuration
  4. Container Apps Configuration
  5. Database Setup
  6. Redis Configuration
  7. Storage Configuration
  8. N8N Queue Mode Setup
  9. CI/CD Pipeline
  10. Environment Variables
  11. Deployment Process
  12. Monitoring & Observability
  13. Security Configuration
  14. Cost Analysis
  15. Disaster Recovery
  16. Troubleshooting

Architecture Overview

High-Level Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                          AZURE CLOUD                                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                           │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │            Azure Container Apps Environment                       │  │
│  │            Virtual Network: 10.0.0.0/16                          │  │
│  ├──────────────────────────────────────────────────────────────────┤  │
│  │                                                                   │  │
│  │  ┌─────────────────┐  ┌─────────────────┐  ┌────────────────┐  │  │
│  │  │  Web (Preview)  │  │  Web (Prod)     │  │  N8N Main      │  │  │
│  │  │  slidefactory-  │  │  slidefactory-  │  │  slidefactory- │  │  │
│  │  │  web-preview    │  │  web            │  │  n8n           │  │  │
│  │  │                 │  │                 │  │                │  │  │
│  │  │  FastAPI        │  │  FastAPI        │  │  Queue Mode    │  │  │
│  │  │  Port 8000      │  │  Port 8000      │  │  UI/API/Hooks  │  │  │
│  │  │  1-3 replicas   │  │  2-5 replicas   │  │  1 replica     │  │  │
│  │  └─────────────────┘  └─────────────────┘  └────────────────┘  │  │
│  │                                                                   │  │
│  │  ┌─────────────────┐  ┌─────────────────┐  ┌────────────────┐  │  │
│  │  │ Worker (Preview)│  │ Worker (Prod)   │  │ N8N Workers    │  │  │
│  │  │ slidefactory-   │  │ slidefactory-   │  │ slidefactory-  │  │  │
│  │  │ worker-preview  │  │ worker          │  │ n8n-worker     │  │  │
│  │  │                 │  │                 │  │                │  │  │
│  │  │ Celery          │  │ Celery          │  │ Queue Workers  │  │  │
│  │  │ Background      │  │ Background      │  │ 2-10 replicas  │  │  │
│  │  │ 1-2 replicas    │  │ 1-3 replicas    │  │ Auto-scaling   │  │  │
│  │  └─────────────────┘  └─────────────────┘  └────────────────┘  │  │
│  │                                                                   │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│         │                  │                  │                          │
│         ▼                  ▼                  ▼                          │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │              Private Endpoints (10.0.2.0/24)                      │  │
│  ├──────────────────────────────────────────────────────────────────┤  │
│  │                                                                   │  │
│  │  ┌─────────────────┐  ┌─────────────────┐  ┌────────────────┐  │  │
│  │  │  PostgreSQL     │  │  Redis Cache    │  │  Blob Storage  │  │  │
│  │  │  10.0.2.4:5432  │  │  10.0.2.5:6380  │  │  10.0.2.6      │  │  │
│  │  │                 │  │                 │  │                │  │  │
│  │  │  - slidefactory │  │  DB 2: Celery  │  │  presentations │  │  │
│  │  │  - n8n          │  │  DB 6: N8N     │  │  templates     │  │  │
│  │  │  pgvector ext.  │  │                 │  │  documents     │  │  │
│  │  └─────────────────┘  └─────────────────┘  └────────────────┘  │  │
│  │                                                                   │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                                                           │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │  Azure Container Registry (ACR)                                  │  │
│  │  slidefactoryacr.azurecr.io                                      │  │
│  │  - slidefactory:preview-{sha}                                    │  │
│  │  - slidefactory:prod-{sha}                                       │  │
│  │  - n8n-custom:1.121.3                                            │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                                                           │
└─────────────────────────────────────────────────────────────────────────┘
                          ┌───────┴────────┐
                          │  GitHub Actions │
                          │  CI/CD Pipeline │
                          └────────────────┘

Deployment Flow

Developer → GitHub (preview/main) → GitHub Actions → Build Docker Image →
ACR Push → Container App Update → Rolling Deployment

Data Flow

User Request → FastAPI (Web) → PostgreSQL/Redis/Storage
                              → N8N API (for workflows)
                              → Celery (via Worker) → Background Tasks

N8N Workflow → N8N Main (queue mode) → Redis Queue (DB 6) →
N8N Workers (2-10 replicas) → Execute Workflows

Azure Resources

Resource Group Structure

Production Resource Group: rg-slidefactory-prod-001

All resources (both preview and production environments) are in a single resource group.

Complete Resource List

Resource Name Type Environment Purpose
slidefactory-web-preview Container App Preview FastAPI web application (preview)
slidefactory-worker-preview Container App Preview Celery background worker (preview)
slidefactory-web Container App Production FastAPI web application (prod)
slidefactory-worker Container App Production Celery background worker (prod)
slidefactory-n8n Container App Shared N8N main instance (queue mode)
slidefactory-n8n-worker Container App Shared N8N worker pool (2-10 replicas)
slidefactory-postgres PostgreSQL Flexible Server Shared Database (both envs + n8n)
slidefactory-redis Azure Cache for Redis Shared Cache + Celery + N8N queue
slidefactoryprod Storage Account Shared Blob storage (all envs)
slidefactoryacr Container Registry Shared Docker images
log-slidefactory Log Analytics Workspace Shared Centralized logging
appi-slidefactory Application Insights Shared Application telemetry

Azure Subscription

  • Subscription ID: 022ab726-cdb5-4a02-bf2b-bea8d87d8e83
  • Region: West Europe
  • Resource Group: rg-slidefactory-prod-001

Network Configuration

Virtual Network

VNET: vnet-slidefactory (10.0.0.0/16)

Subnets: - Apps Subnet: snet-apps (10.0.1.0/24) - Container Apps - Data Subnet: snet-data (10.0.2.0/24) - Private endpoints

Private Endpoints

Service IP Address DNS Zone
PostgreSQL 10.0.2.4:5432 privatelink.postgres.database.azure.com
Redis 10.0.2.5:6380 privatelink.redis.cache.windows.net
Blob Storage 10.0.2.6 privatelink.blob.core.windows.net

Firewall Rules

PostgreSQL: - Allow Azure services: Yes - Allow Container Apps subnet: 10.0.1.0/24

Redis: - Private endpoint only - TLS required - Port: 6380 (SSL)

Storage Account: - Public access: Enabled (with SAS tokens) - Allow Azure services: Yes - Allow Container Apps subnet: 10.0.1.0/24

DNS Configuration

  • Private DNS zones automatically created for private endpoints
  • Azure-provided DNS for Container Apps
  • Custom domain not configured (using Azure-provided URLs)

Container Apps Configuration

Container Apps Environment

Environment ID: Shared environment for all container apps Virtual Network Integration: Yes (10.0.1.0/24) Log Analytics: Enabled Dapr: Not used

Web Service (Preview)

Name: slidefactory-web-preview URL: https://slidefactory-web-preview.thankfulsmoke-fef50a06.westeurope.azurecontainerapps.io

Image: slidefactoryacr.azurecr.io/slidefactory:preview-{sha}

Resources: - CPU: 0.5 cores - Memory: 1.0 GB - Min Replicas: 1 - Max Replicas: 3

Scaling Rules: - CPU Utilization: >75% triggers scale up - HTTP Requests: >100 concurrent requests

Health Probes: - Liveness: GET /health (30s interval, 3 retries) - Readiness: GET /health (10s interval, 3 retries) - Startup: GET /health (5s interval, 60s timeout)

Ingress: - External ingress enabled - Target port: 8000 - Transport: HTTP/2 - Allow insecure: No (HTTPS only)

Command: /code/scripts/start-web-azure.sh

Web Service (Production)

Name: slidefactory-web URL: https://slidefactory-web.thankfulsmoke-fef50a06.westeurope.azurecontainerapps.io

Image: slidefactoryacr.azurecr.io/slidefactory:prod-{sha}

Resources: - CPU: 1.0 cores - Memory: 2.0 GB - Min Replicas: 2 - Max Replicas: 5

Scaling Rules: - CPU Utilization: >70% triggers scale up - HTTP Requests: >200 concurrent requests

Health Probes: Same as preview

Ingress: Same as preview

Command: /code/scripts/start-web-azure.sh

Worker Service (Preview)

Name: slidefactory-worker-preview

Image: slidefactoryacr.azurecr.io/slidefactory:preview-{sha}

Resources: - CPU: 0.5 cores - Memory: 1.0 GB - Min Replicas: 1 - Max Replicas: 2

Scaling Rules: - CPU Utilization: >80% triggers scale up

Ingress: Internal only (port 8080 for health checks)

Command: /code/scripts/start-worker-azure-healthcheck.sh

Worker Service (Production)

Name: slidefactory-worker

Image: slidefactoryacr.azurecr.io/slidefactory:prod-{sha}

Resources: - CPU: 1.0 cores - Memory: 2.0 GB - Min Replicas: 1 - Max Replicas: 3

Scaling Rules: - CPU Utilization: >75% triggers scale up

Ingress: Internal only (port 8080 for health checks)

Command: /code/scripts/start-worker-azure-healthcheck.sh


Database Setup

PostgreSQL Flexible Server

Name: slidefactory-postgres (internal reference) Host: 10.0.2.4 Port: 5432 Version: PostgreSQL 15

Tier: - SKU: General Purpose (D2s_v3) - vCores: 2 - Memory: 8 GB - Storage: 128 GB (auto-grow enabled)

Databases: 1. slidefactory - Main application database 2. n8n - N8N workflow database

Extensions: - pgvector - Vector similarity search (for RAG/context system)

Configuration Parameters: - max_connections: 100 - work_mem: 16MB - shared_buffers: 2GB - effective_cache_size: 6GB

Backup: - Automated backups: Enabled - Retention: 30 days - Geo-redundant: Yes (production) - Point-in-time restore: Available

Networking: - Public access: Disabled - Private endpoint: 10.0.2.4 - SSL/TLS: Required - Firewall: Allow Azure services

Maintenance: - Window: Sunday 02:00-06:00 UTC - Minor version updates: Automatic - Major version updates: Manual

Admin User: dbadmin Connection String Format: postgresql://dbadmin:{password}@10.0.2.4:5432/{database}?sslmode=require

Database Schema (Slidefactory)

Managed via Alembic migrations. Key tables:

Users & Auth: - users - User accounts - api_keys - API authentication tokens - sessions - User sessions (in-memory/Redis)

Templates & Presentations: - templates - PowerPoint template metadata - n8n_processes - N8N workflow execution tracking - presentations - Generated presentation metadata

Context/RAG System: - context_documents - Uploaded documents - context_chunks - Document chunks with pgvector embeddings - contexts - Context collections for RAG

Migrations: - Location: alembic/versions/ - Applied automatically on web container startup via start-web-azure.sh

Database Initialization

For Fresh Database:

# Run init.sql to create databases and extensions
psql -U postgres -f init.sql

# Mark as Alembic-managed
alembic stamp head

# Apply any pending migrations
alembic upgrade head

For Existing Database (production migration):

# Just mark as Alembic-managed
alembic stamp head

# Apply new migrations
alembic upgrade head


Redis Configuration

Azure Cache for Redis

Name: slidefactory-redis Host: slidefactory-redis.redis.cache.windows.net Port: 6380 (SSL) Version: Redis 6.x

Tier: - SKU: Standard C2 - Memory: 2.5 GB - Replicas: 1

Database Allocation: - DB 0: Default (unused) - DB 1: Application cache (unused) - DB 2: Celery broker + result backend (Slidefactory) - DB 6: N8N queue mode (Bull queue)

Configuration: - TLS: Required - Port: 6380 (SSL only, 6379 disabled) - Eviction policy: allkeys-lru - Persistence: RDB snapshots (hourly) - Max memory policy: Evict least recently used

Networking: - Public access: Disabled - Private endpoint: 10.0.2.5 - TLS: Required

Connection String Format: - Slidefactory (DB 2): redis://:password@slidefactory-redis.redis.cache.windows.net:6380/2?ssl=true - N8N Queue (DB 6): redis://:password@slidefactory-redis.redis.cache.windows.net:6380/6?ssl=true

Monitoring: - Metrics: CPU, Memory, Connections, Hit rate - Alerts: High memory (>80%), Connection errors


Storage Configuration

Azure Blob Storage

Account Name: slidefactoryprod Type: StorageV2 (general purpose v2) Performance: Standard Replication: GRS (Geo-redundant storage) Access Tier: Hot

Containers:

Container Purpose Access Level Lifecycle Policy
presentations Generated presentations Private Cool tier after 90 days, delete after 2 years
templates PowerPoint templates Private No auto-deletion
documents RAG/context documents Private No auto-deletion
static S5 branding assets Blob (public read) No auto-deletion

Networking: - Public access: Enabled (with SAS tokens) - Private endpoint: 10.0.2.6 - Firewall: Allow Azure services + Container Apps subnet

Features: - Soft delete: Enabled (14 days for blobs, 7 days for containers) - Versioning: Disabled - Change feed: Disabled - Blob indexing: Disabled

Access Methods:

  1. Connection String (used by app):

    DefaultEndpointsProtocol=https;AccountName=slidefactoryprod;
    AccountKey={key};EndpointSuffix=core.windows.net
    

  2. Presigned URLs (for downloads):

  3. Generated via get_presigned_url() in storage client
  4. Expiration: 1-24 hours (configurable)
  5. Used for direct downloads (no proxy through FastAPI)

Storage Client Factory: - Location: app/filemanager/storage/factory.py - Backend selection via STORAGE_PROVIDER env var - Supports: Azure Blob, MinIO (local dev)


N8N Queue Mode Setup

Architecture

N8N runs in distributed queue mode with: - 1 Main Instance: Handles UI, API, webhooks, scheduling (does NOT execute workflows) - 2-10 Worker Instances: Execute workflows in parallel (auto-scaling) - Redis Queue: Job distribution via Bull queue (DB 6) - Shared PostgreSQL: Workflow definitions and execution history (10.0.2.4:5432/n8n)

Main Instance

Name: slidefactory-n8n URL: https://slidefactory-n8n.thankfulsmoke-fef50a06.westeurope.azurecontainerapps.io

Image: slidefactoryacr.azurecr.io/n8n-custom:1.121.3

Custom image includes pre-installed community nodes: - n8n-nodes-document-generator - Document generation - Additional nodes as needed

Resources: - CPU: 0.5 cores - Memory: 1.0 GB - Replicas: 1 (fixed, no auto-scaling)

Key Environment Variables:

EXECUTIONS_MODE=queue
QUEUE_BULL_REDIS_HOST=slidefactory-redis.redis.cache.windows.net
QUEUE_BULL_REDIS_PORT=6380
QUEUE_BULL_REDIS_DB=6
QUEUE_BULL_REDIS_TLS=true
N8N_ENCRYPTION_KEY=<must-be-identical-across-all-instances>
DB_TYPE=postgresdb
DB_POSTGRESDB_HOST=10.0.2.4
DB_POSTGRESDB_DATABASE=n8n
N8N_BASIC_AUTH_USER=admin
N8N_BASIC_AUTH_PASSWORD=<secret>
EXECUTIONS_TIMEOUT=3600
EXECUTIONS_TIMEOUT_MAX=7200
QUEUE_WORKER_LOCK_DURATION=7200000
QUEUE_WORKER_MAX_STALLED_COUNT=10
QUEUE_WORKER_STALLED_INTERVAL=60000
N8N_COMMUNITY_PACKAGES_ENABLED=true

Persistent Storage: - Azure File Share: n8n-nodes-storage - Mount path: /home/node/.n8n - Stores: Community nodes, credentials, settings

Ingress: - External ingress enabled - Target port: 5678 (N8N default) - HTTPS only

Worker Instances

Name: slidefactory-n8n-worker

Image: slidefactoryacr.azurecr.io/n8n-custom:1.121.3 (same as main)

Resources: - CPU: 1.0 cores - Memory: 2.0 GB - Min Replicas: 2 (can be 1 for persistent nodes, 0 for cost savings) - Max Replicas: 10

Auto-Scaling Rules: - CPU Utilization: >70% - Memory Utilization: >80% - Cool-down period: 5 minutes

Command: n8n worker

Environment Variables: Same as main instance (except N8N_HOST, WEBHOOK_URL, etc.)

Critical: N8N_ENCRYPTION_KEY MUST be identical to main instance!

Deployment

Automated (via GitHub Actions):

# Trigger workflow
# Go to: Actions → "N8N - Deploy N8N Queue Mode to Azure"
# Select action: deploy-all, deploy-main, deploy-workers, rollback
# Set worker replicas (min/max)
# Set N8N version (default: 1.121.3)

Manual (via script):

# Set required environment variables
export REDIS_PASSWORD="..."
export DB_PASSWORD="..."
export N8N_ENCRYPTION_KEY="..."
export N8N_ADMIN_PASSWORD="..."

# Deploy
./scripts/deploy-n8n-queue-mode.sh deploy-all

# Or deploy components separately
./scripts/deploy-n8n-queue-mode.sh deploy-main
./scripts/deploy-n8n-queue-mode.sh deploy-workers

# Verify
./scripts/deploy-n8n-queue-mode.sh verify

# Rollback
./scripts/deploy-n8n-queue-mode.sh rollback

Monitoring N8N Queue

Check Worker Count:

az containerapp revision list \
  --name slidefactory-n8n-worker \
  --resource-group rg-slidefactory-prod-001 \
  --query "[?properties.active].{name:name, replicas:properties.replicas}"

Check Redis Queue:

# Connect to Redis
redis-cli -h slidefactory-redis.redis.cache.windows.net \
  -p 6380 -a "<password>" --tls -n 6

# Check queue depth
LLEN "bull:n8n:waiting"
LLEN "bull:n8n:active"
LLEN "bull:n8n:completed"

View Worker Logs:

az containerapp logs show \
  --name slidefactory-n8n-worker \
  --resource-group rg-slidefactory-prod-001 \
  --follow


CI/CD Pipeline

GitHub Actions Workflows

Location: .github/workflows/

Preview Deployment

File: .github/workflows/preview.yml Trigger: Push to preview branch Target: Preview environment

Steps: 1. Checkout code 2. Login to Azure (using AZURE_CREDENTIALS secret) 3. Login to Azure Container Registry 4. Auto-detect core version from wheel file 5. Build Docker image with build arg CORE_VERSION 6. Push image with tag preview-{sha} 7. Update slidefactory-web-preview container app 8. Update slidefactory-worker-preview container app

Secrets Required: - AZURE_CREDENTIALS - Azure service principal JSON - Environment variables are configured in Container App

Duration: ~5-10 minutes

Production Deployment

File: .github/workflows/production.yml Trigger: Push to main branch Target: Production environment

Steps: Same as preview, but: - Uses AZURE_CREDENTIALS secret (same resource group) - Pushes image with tag prod-{sha} - Updates slidefactory-web and slidefactory-worker

Duration: ~5-10 minutes

N8N Custom Image Build

File: .github/workflows/build-n8n-custom.yml Trigger: Manual workflow dispatch or push to specific path Target: Custom N8N image with community nodes

Steps: 1. Build custom N8N image from n8nio/n8n:{version} 2. Install community nodes via npm 3. Push to ACR as n8n-custom:{version}

N8N Queue Mode Deployment

File: .github/workflows/deploy-n8n-queue-mode.yml Trigger: Manual workflow dispatch Target: N8N main + worker instances

Inputs: - action: deploy-main, deploy-workers, deploy-all, rollback - worker_min_replicas: Min worker count (default: 1) - worker_max_replicas: Max worker count (default: 10) - n8n_version: N8N image version (default: 1.121.3)

Secrets Required: - AZURE_CREDENTIALS - PROD_REDIS_PASSWORD - PROD_DB_PASSWORD - N8N_ENCRYPTION_KEY - N8N_ADMIN_PASSWORD


Environment Variables

Web & Worker Services

All environment variables are configured in Container App configuration (encrypted at rest).

Database:

DATABASE_URL=postgresql://dbadmin:{password}@10.0.2.4:5432/slidefactory?sslmode=require

Redis:

REDIS_HOST=slidefactory-redis.redis.cache.windows.net
REDIS_PORT=6380
REDIS_PASSWORD={secret}
REDIS_DB=2
REDIS_SSL=true
CELERY_BROKER_URL=redis://:{password}@slidefactory-redis.redis.cache.windows.net:6380/2?ssl=true
CELERY_RESULT_BACKEND=redis://:{password}@slidefactory-redis.redis.cache.windows.net:6380/2?ssl=true

Storage:

STORAGE_PROVIDER=azure
AZURE_STORAGE_CONNECTION_STRING=DefaultEndpointsProtocol=https;AccountName=slidefactoryprod;AccountKey={key};EndpointSuffix=core.windows.net
AZURE_STORAGE_ACCOUNT_NAME=slidefactoryprod
AZURE_STORAGE_ACCOUNT_KEY={secret}

AI Providers (examples):

AI_PROVIDER=openai
AI_MODEL=gpt-4o
OPENAI_API_KEY={secret}
AZURE_OPENAI_API_KEY={secret}
AZURE_OPENAI_ENDPOINT=https://....openai.azure.com/
ANTHROPIC_API_KEY={secret}

N8N Integration:

N8N_API_URL=https://slidefactory-n8n.thankfulsmoke-fef50a06.westeurope.azurecontainerapps.io
N8N_API_KEY={secret}

Context/RAG:

JINA_API_KEY={secret}

Authentication:

AZURE_TENANT_ID={guid}
AZURE_CLIENT_ID={guid}
AZURE_CLIENT_SECRET={secret}

Application:

VERSION=1.1.1
DEBUG_MODE=false
ENFORCE_HTTPS=true
GENERIC_TIMEZONE=Europe/Berlin

N8N Environment Variables

See N8N Queue Mode Setup section above.


Deployment Process

Standard Deployment Flow

Preview: 1. Developer commits to preview branch 2. GitHub Actions workflow triggers automatically 3. Docker image built with core package + S5 branding 4. Image pushed to ACR with tag preview-{sha} 5. Container Apps updated with new image 6. Rolling deployment (zero downtime) 7. Health checks verify successful deployment

Production: 1. Merge previewmain (after testing) 2. GitHub Actions workflow triggers automatically 3. Same build process as preview 4. Image tagged as prod-{sha} 5. Production Container Apps updated 6. Rolling deployment with health checks

Manual Deployment

Update Container App:

# Login
az login

# Update preview web
az containerapp update \
  --name slidefactory-web-preview \
  --resource-group rg-slidefactory-prod-001 \
  --image slidefactoryacr.azurecr.io/slidefactory:preview-abc123

# Update production web
az containerapp update \
  --name slidefactory-web \
  --resource-group rg-slidefactory-prod-001 \
  --image slidefactoryacr.azurecr.io/slidefactory:prod-abc123

Rollback Procedures

Quick Rollback (switch to previous revision):

# List revisions
az containerapp revision list \
  --name slidefactory-web \
  --resource-group rg-slidefactory-prod-001 \
  --output table

# Activate previous revision
az containerapp revision activate \
  --name slidefactory-web \
  --resource-group rg-slidefactory-prod-001 \
  --revision slidefactory-web--previous-revision

Git Rollback (trigger redeployment):

git revert HEAD
git push origin main
# Wait for GitHub Actions to redeploy

Pre-Deployment Checklist

Before production deployment: - [ ] Preview deployment successful - [ ] All tests passing (run python scripts/test_deploy.py) - [ ] Smoke tests pass on preview - [ ] Database migrations tested (if any) - [ ] N8N workflows tested - [ ] Performance acceptable (<2s response times) - [ ] No errors in logs - [ ] S5 branding displays correctly - [ ] Azure AD login working


Monitoring & Observability

Application Insights

Resource: appi-slidefactory Instrumentation Key: Available in portal

Telemetry: - HTTP requests/responses (timing, status codes) - Dependencies (database, Redis, storage, external APIs) - Exceptions and errors (with stack traces) - Custom events and metrics - User sessions and page views

Retention: 90 days

Key Metrics: - Request rate (requests/sec) - Response time (avg, p95, p99) - Failure rate (%) - Dependency duration - Exception count

Log Analytics

Resource: log-slidefactory Workspace ID: Available in portal

Data Sources: - Container Apps logs (stdout/stderr) - Application Insights telemetry - Azure resource logs (PostgreSQL, Redis, Storage)

Retention: 90 days

Sample Queries:

// Container App errors
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "slidefactory-web"
| where Log_s contains "ERROR"
| order by TimeGenerated desc
| take 100

// N8N worker activity
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "slidefactory-n8n-worker"
| where Log_s contains "Workflow executed"
| summarize count() by bin(TimeGenerated, 5m)

// Response time trends
requests
| where name contains "/api/"
| summarize avg(duration), percentile(duration, 95) by bin(timestamp, 5m)

Alerts

Configured alerts (email notifications):

Container App Health: - Health probe failures (>3 consecutive) - App not responding (503 errors) - High restart rate (>5 restarts/hour)

Resource Utilization: - CPU >80% for 5 minutes - Memory >85% for 5 minutes - Disk space >90%

Database: - Connection failures - High connection count (>80% of max) - Long-running queries (>30s)

Redis: - Connection errors - High memory usage (>80%) - High eviction rate

Storage: - Throttling errors (429) - High latency (>1s avg)

Dashboards

Azure Portal: - Container App metrics (CPU, memory, requests) - Database metrics (connections, queries, storage) - Redis metrics (memory, connections, hit rate)

Application Insights: - Live metrics stream - Application map (dependencies) - Performance blade (requests, dependencies) - Failures blade (exceptions, failed requests)

Health Endpoints

Web Service: GET /health - Returns: {"status": "healthy"} (200 OK) - Checks: Database connection, Redis connection

Worker Service: GET /health (port 8080) - Returns: {"status": "healthy"} (200 OK) - Checks: Redis connection, Celery broker


Security Configuration

Identity & Access Management

Service Principal (for GitHub Actions): - Name: github-actions-slidefactory - Role: Contributor on rg-slidefactory-prod-001 - Used for: Automated deployments

Managed Identity (Container Apps): - System-assigned identity for each container app - Used for: Accessing ACR (image pull)

API Keys (application-level): - Stored in api_keys table - Scopes: * (admin), specific endpoints - Created via CLI: slidefactory api-key create

Authentication

User Authentication: - Azure AD integration (OAuth 2.0) - Session-based (stored in Redis) - Local accounts (for testing)

API Authentication: - Bearer token: Authorization: Bearer sf_xxxxx - API key validation via database lookup

Network Security

TLS/HTTPS: - All external traffic: HTTPS only (enforced by Container Apps) - PostgreSQL: TLS required - Redis: TLS required (port 6380) - Storage: HTTPS required

Firewall: - Database: Only accessible from Container Apps subnet - Redis: Only accessible via private endpoint - Storage: Accessible from Container Apps + management IPs

Private Endpoints: - Database, Redis, Storage all use private endpoints - No public internet access to data services

Secrets Management

GitHub Secrets (for CI/CD): - AZURE_CREDENTIALS - Service principal JSON - PROD_REDIS_PASSWORD - Redis password - PROD_DB_PASSWORD - PostgreSQL password - N8N_ENCRYPTION_KEY - N8N encryption key - N8N_ADMIN_PASSWORD - N8N admin password

Container App Configuration: - Environment variables stored encrypted at rest - Not logged in Container App logs - Accessed only by application code

Not Using: - Azure Key Vault (future consideration) - Customer-managed encryption keys

Data Protection

Encryption at Rest: - All Azure services use Microsoft-managed keys - Blob Storage: Encrypted - PostgreSQL: Encrypted - Redis: Encrypted

Encryption in Transit: - TLS 1.2+ required for all connections - HTTPS only for web traffic - Database and Redis require TLS

Backup & Recovery: - Database: Automated backups (30 days retention) - Blob Storage: Soft delete (14 days) - Container Images: Retained in ACR


Cost Analysis

Monthly Cost Estimate

Production Environment (~$500-600/month):

Service Cost/Month Notes
Container Apps (Web) $100-150 2-5 replicas, 1 vCPU, 2GB RAM
Container Apps (Worker) $50-80 1-3 replicas, 1 vCPU, 2GB RAM
N8N Main $30-40 1 replica, 0.5 vCPU, 1GB RAM
N8N Workers $150-200 2-10 replicas, 1 vCPU, 2GB RAM
PostgreSQL $80-100 General Purpose, 2 vCores, 8GB RAM
Redis $40-50 Standard C2, 2.5GB
Blob Storage $20-30 GRS, Hot tier, ~1TB
Container Registry $5 Standard SKU
Log Analytics $10-20 90 day retention
Application Insights $5-10 Based on telemetry volume
Total ~$500-600

Preview Environment (~$200-250/month):

Service Cost/Month Notes
Container Apps (Web) $40-60 1-3 replicas, 0.5 vCPU, 1GB RAM
Container Apps (Worker) $30-50 1-2 replicas, 0.5 vCPU, 1GB RAM
Shared services - Uses production DB, Redis, Storage
Total ~$70-110

Combined Total: ~$600-750/month

Cost Optimization

Implemented: - Auto-scaling for Container Apps (scale down when idle) - Shared database, Redis, storage across environments - Storage lifecycle policies (move to cool tier after 90 days)

Potential Savings: - Scale preview to 0 outside business hours: ~$30/month - Use Burstable PostgreSQL for preview: ~$20/month - Reserved Instances for production database: ~$30/month (30% discount) - Optimize N8N worker min replicas (set to 1 instead of 2): ~$50/month

Not Recommended (would impact availability): - Disable geo-redundancy for production - Reduce backup retention - Scale production to fewer replicas


Disaster Recovery

Backup Strategy

Database: - Automated backups: Every hour - Retention: 30 days (production), 7 days (preview) - Geo-redundant: Yes (production) - Point-in-time restore: Yes (any time within retention)

Blob Storage: - Soft delete: 14 days (blobs), 7 days (containers) - Geo-redundant: Yes (production) - Manual export: Quarterly to offline storage

Container Images: - Retention: Last 10 images per tag prefix - Stored in ACR with geo-replication option

Configuration: - GitHub repository: All code and configuration - Container App configuration: Exported via Azure CLI - Environment variables: Documented + stored in secrets

Recovery Procedures

Database Restore (point-in-time):

az postgres flexible-server restore \
  --resource-group rg-slidefactory-prod-001 \
  --name slidefactory-postgres-restored \
  --source-server slidefactory-postgres \
  --restore-time "2025-12-02T10:00:00Z"

Container App Rollback (to previous revision):

az containerapp revision activate \
  --name slidefactory-web \
  --resource-group rg-slidefactory-prod-001 \
  --revision slidefactory-web--previous-revision

Blob Storage Undelete:

# Via Azure Portal: Storage Account → Containers → Deleted blobs → Undelete

Complete Environment Recreation: 1. Restore database from backup 2. Recreate Container Apps from GitHub Actions 3. Restore blob storage from geo-redundant replica 4. Reconfigure environment variables from documentation 5. Redeploy latest code from GitHub

RTO/RPO

Recovery Time Objective (RTO): 1 hour - Container App rollback: ~1 minute - Database restore: ~30 minutes - Complete environment recreation: ~1 hour

Recovery Point Objective (RPO): - Database: 1 hour (automated backups) - Blob Storage: 24 hours (geo-replication lag) - Code: 0 (Git repository)


Troubleshooting

Common Issues

Issue: Container App Not Starting

Symptoms: App shows "Provisioning" or "Failed" status

Diagnosis:

# Check logs
az containerapp logs show \
  --name slidefactory-web \
  --resource-group rg-slidefactory-prod-001 \
  --tail 100

# Check revision status
az containerapp revision list \
  --name slidefactory-web \
  --resource-group rg-slidefactory-prod-001

Common Causes: - Database migration failure (check logs for Alembic errors) - Missing environment variables - Invalid database connection string - Image pull failure (check ACR credentials)

Solutions: - Verify all environment variables are set - Check database is accessible from Container Apps subnet - Manually run migrations: az containerapp exec --command "alembic upgrade head" - Verify ACR credentials in Container App configuration

Issue: N8N Workers Not Picking Up Jobs

Symptoms: Workflows queued but not executing

Diagnosis:

# Check worker logs
az containerapp logs show \
  --name slidefactory-n8n-worker \
  --resource-group rg-slidefactory-prod-001 \
  --tail 100

# Check Redis queue
redis-cli -h slidefactory-redis.redis.cache.windows.net \
  -p 6380 -a "<password>" --tls -n 6 \
  LLEN "bull:n8n:waiting"

Common Causes: - N8N_ENCRYPTION_KEY mismatch between main and workers - Redis connection issues - Worker instances not running

Solutions: - Verify N8N_ENCRYPTION_KEY is identical in both main and worker - Check Redis connectivity from worker: az containerapp exec --command "redis-cli ping" - Restart workers: az containerapp restart --name slidefactory-n8n-worker

Issue: High Database Connection Count

Symptoms: "Too many connections" errors

Diagnosis:

# Check connection count
psql -h 10.0.2.4 -U dbadmin -d slidefactory \
  -c "SELECT count(*) FROM pg_stat_activity;"

Solutions: - Increase max_connections in PostgreSQL configuration - Reduce Container App replica count temporarily - Consider adding PgBouncer connection pooler

Issue: Celery Tasks Not Processing

Symptoms: Tasks queued but not executed

Diagnosis:

# Check worker logs
az containerapp logs show \
  --name slidefactory-worker \
  --resource-group rg-slidefactory-prod-001 \
  --tail 100

# Check Redis for queued tasks
redis-cli -h slidefactory-redis.redis.cache.windows.net \
  -p 6380 -a "<password>" --tls -n 2 \
  LLEN "celery"

Solutions: - Verify worker is running and healthy - Check Redis connection (DB 2) - Restart worker: az containerapp restart --name slidefactory-worker

Useful Commands

View Container App Status:

az containerapp show \
  --name slidefactory-web \
  --resource-group rg-slidefactory-prod-001 \
  --query "{status:properties.provisioningState, replicas:properties.template.scale}"

View Logs (Live):

az containerapp logs show \
  --name slidefactory-web \
  --resource-group rg-slidefactory-prod-001 \
  --follow

Execute Command in Container:

az containerapp exec \
  --name slidefactory-web \
  --resource-group rg-slidefactory-prod-001 \
  --command "bash"

Restart Container App:

az containerapp restart \
  --name slidefactory-web \
  --resource-group rg-slidefactory-prod-001

Scale Container App:

az containerapp update \
  --name slidefactory-web \
  --resource-group rg-slidefactory-prod-001 \
  --min-replicas 2 \
  --max-replicas 5

View Database Connections:

psql -h 10.0.2.4 -U dbadmin -d slidefactory \
  -c "SELECT datname, usename, application_name, client_addr, state FROM pg_stat_activity;"

Check Redis Memory:

redis-cli -h slidefactory-redis.redis.cache.windows.net \
  -p 6380 -a "<password>" --tls \
  INFO memory


Appendix

External Resources

Maintenance Windows

  • Database: Sunday 02:00-06:00 UTC
  • Deployments: Any time after preview testing
  • Major Updates: Scheduled in advance with user notification

Contact & Support

  • Repository: https://github.com/cgast/s5-slidefactory
  • Core Package: https://github.com/cgast/slidefactory-core
  • Issues: GitHub Issues in respective repositories

Document Version: 1.0 Last Review: 2025-12-02 Next Review: 2025-12-15