Skip to content

Troubleshooting S5 on Azure

Common issues and solutions for S5 Slidefactory on Azure Container Apps.

Quick Diagnostics

Is S5 Down?

# Check health endpoint
curl https://slidefactory.sportfive.com/health

# Expected: {"status": "healthy", "database": "connected", "redis": "connected"}

Check Container App Status

  1. Go to Azure Portal → slidefactory-web-prod
  2. Check Revision Management → Active revision should be "Running"
  3. Check Log stream for recent errors

Common Issues

Issue: Website Not Loading (502/503 Error)

Symptoms: Browser shows "Bad Gateway" or "Service Unavailable"

Diagnosis:

# Check if health endpoint responds
curl https://slidefactory.sportfive.com/health

# Check Container App logs
az containerapp logs show \
  --name slidefactory-web-prod \
  --resource-group rg-slidefactory-prod \
  --tail 50

Solutions:

  1. Container App not running:
  2. Go to Azure Portal → Container App → Check replica count
  3. If 0 replicas: Check if auto-scaling is enabled
  4. Manually scale: Set min replicas to 1

  5. Health probe failing:

  6. Check logs for startup errors
  7. Verify database connection in environment variables
  8. Restart Container App: Azure Portal → Restart

  9. Recent deployment failed:

  10. Check GitHub Actions workflow logs
  11. Rollback to previous revision: See Deployment - Rollback

Issue: Cannot Log In (Azure AD Error)

Symptoms: "AADSTS" error code when attempting Azure AD login

Common AADSTS Errors:

Error Code Meaning Solution
AADSTS50011 Redirect URI mismatch Add correct redirect URI to Azure AD app registration
AADSTS700016 Application not found Verify AZURE_CLIENT_ID in Container App environment
AADSTS7000215 Invalid client secret Generate new client secret and update AZURE_CLIENT_SECRET
AADSTS50020 User not in tenant Add user to Azure AD tenant

Solutions:

  1. Verify Azure AD configuration:
  2. Go to Azure AD → App registrations → Find S5 app
  3. Check Redirect URIs includes: https://slidefactory.sportfive.com/auth/azure/callback
  4. Check Certificates & secrets → Client secret is not expired

  5. Verify environment variables:

    az containerapp show \
      --name slidefactory-web-prod \
      --resource-group rg-slidefactory-prod \
      --query "properties.configuration.secrets" \
      --output table
    
    Ensure AZURE_TENANT_ID, AZURE_CLIENT_ID, AZURE_CLIENT_SECRET are set correctly

  6. Test credentials locally:

  7. Try logging in on preview environment first
  8. If preview works, compare environment variables

Issue: Presentation Generation Fails

Symptoms: Presentations stuck in "running" status or fail with error

Diagnosis:

# Check worker logs
az containerapp logs show \
  --name slidefactory-worker-prod \
  --resource-group rg-slidefactory-prod \
  --tail 100 | grep ERROR

Solutions:

  1. Worker not running:
  2. Check worker Container App status
  3. Verify worker has min 1 replica
  4. Restart worker if needed

  5. N8N workflow error:

  6. Go to N8N UI → Executions
  7. Find failed execution
  8. Check error message
  9. Common: N8N API key expired, workflow disabled, missing credentials

  10. Storage error:

  11. Verify Azure Blob Storage is accessible
  12. Check storage account firewall rules
  13. Verify AZURE_STORAGE_ACCOUNT_NAME and _KEY are correct

  14. AI provider error:

  15. Check OpenAI/Azure OpenAI API key is valid
  16. Verify API quota not exceeded
  17. Check AI provider status page

Issue: Database Connection Errors

Symptoms: Logs show psycopg2.OperationalError or database connection failures

Solutions:

  1. Check database status:
  2. Go to Azure Portal → PostgreSQL Flexible Server
  3. Verify status is "Available"
  4. Check if maintenance is in progress

  5. Connection string issues:

  6. Verify DATABASE_URL environment variable
  7. Format: postgresql://user:password@host:5432/database?sslmode=require
  8. Ensure sslmode=require is present

  9. Firewall rules:

  10. Go to PostgreSQL → Networking
  11. Ensure "Allow Azure services" is enabled
  12. Add Container Apps subnet if using private endpoint

  13. Connection pool exhausted:

  14. Check active connections:
    SELECT count(*) FROM pg_stat_activity WHERE datname = 'slidefactory';
    
  15. If near limit (default 100), increase max_connections in PostgreSQL config
  16. Restart Container App to reset connection pool

Issue: Redis Connection Errors

Symptoms: Logs show redis.exceptions.ConnectionError or timeout

Solutions:

  1. Check Redis status:
  2. Go to Azure Portal → Azure Cache for Redis
  3. Verify status is "Running"
  4. Check if scheduled maintenance is in progress

  5. Connection settings:

  6. Verify REDIS_HOST, REDIS_PORT, REDIS_PASSWORD in environment variables
  7. Ensure REDIS_SSL=true for Azure Redis
  8. Port should be 6380 (not 6379) for TLS

  9. Firewall rules:

  10. Go to Redis → Firewall
  11. Ensure Container Apps subnet is allowed
  12. Check if private endpoint is configured correctly

  13. Connection limits:

  14. Check connected clients:
    redis-cli -h <host> -p 6380 -a <password> --tls CLIENT LIST | wc -l
    
  15. If near limit, scale Redis to higher tier

Issue: Slow Performance

Symptoms: Pages load slowly (> 5 seconds), timeouts

Diagnosis:

# Check Application Insights for slow requests
# Go to Azure Portal → Application Insights → Performance
# Sort by duration to find slow requests

Solutions:

  1. High CPU usage:
  2. Check Container App metrics (CPU usage)
  3. If consistently > 70%, scale up (add more replicas or increase CPU/memory)
  4. Auto-scaling should handle this automatically

  5. Slow database queries:

  6. Find slow queries in PostgreSQL:
    SELECT query, mean_time, calls
    FROM pg_stat_statements
    ORDER BY mean_time DESC
    LIMIT 10;
    
  7. Add indexes to frequently queried columns
  8. Optimize N+1 queries

  9. Redis memory issues:

  10. Check Redis memory usage in metrics
  11. If > 80%, scale to higher tier or implement cache eviction

  12. Network latency:

  13. Verify Container Apps, database, and Redis are in same Azure region
  14. Check if private endpoints are configured (reduces latency)

Issue: Storage Errors (Blob Storage)

Symptoms: Cannot upload templates, presentations don't download

Solutions:

  1. Authentication errors:
  2. Verify AZURE_STORAGE_ACCOUNT_NAME and AZURE_STORAGE_ACCOUNT_KEY
  3. Regenerate access key if needed (will require updating environment variable)

  4. Firewall blocking access:

  5. Go to Storage Account → Networking
  6. Add Container Apps subnet to allowed networks
  7. Or enable "Allow Azure services"

  8. Container not found:

  9. Verify containers exist: presentations, templates, documents, static
  10. Create missing containers via Azure Portal or CLI

  11. Quota exceeded:

  12. Check storage account capacity
  13. Implement lifecycle policies to delete old files
  14. Clean up large files manually

Issue: Deployment Fails

Symptoms: GitHub Actions workflow fails, Container App not updated

Solutions:

  1. Check GitHub Actions logs:
  2. Go to repository → Actions → Select failed workflow
  3. Look for specific error message

  4. Common deployment errors:

  5. Docker build fails: Check Dockerfile syntax, verify base image exists
  6. Azure login fails: Verify AZURE_CREDENTIALS secret is valid
  7. Image push fails: Check Azure Container Registry credentials
  8. Container App update fails: Check if resource group/app exists

  9. Core package download fails:

  10. Verify GH_PAT_SLIDEFACTORY_CORE secret has access to private repo
  11. Check if slidefactory-core v1.0.8 release exists
  12. Verify wheel file is attached to release

Diagnostic Commands

Check All Services

# Web app status
az containerapp show \
  --name slidefactory-web-prod \
  --resource-group rg-slidefactory-prod \
  --query "properties.runningStatus"

# Worker status
az containerapp show \
  --name slidefactory-worker-prod \
  --resource-group rg-slidefactory-prod \
  --query "properties.runningStatus"

# Database status
az postgres flexible-server show \
  --name postgres-prod \
  --resource-group rg-slidefactory-prod \
  --query "state"

# Redis status
az redis show \
  --name redis-prod \
  --resource-group rg-slidefactory-prod \
  --query "provisioningState"

View Recent Logs

# Last 50 lines from web app
az containerapp logs show \
  --name slidefactory-web-prod \
  --resource-group rg-slidefactory-prod \
  --tail 50

# Filter for errors only
az containerapp logs show \
  --name slidefactory-web-prod \
  --resource-group rg-slidefactory-prod \
  --tail 100 | grep -E "ERROR|CRITICAL"

Database Connection Test

# From local machine (if firewall allows)
psql "${DATABASE_URL}" -c "SELECT version();"

# From Container App
az containerapp exec \
  --name slidefactory-web-prod \
  --resource-group rg-slidefactory-prod \
  --command "psql ${DATABASE_URL} -c 'SELECT 1;'"

Escalation

When to Escalate

Escalate to Azure support if: - Azure service is down (check Azure Status) - Issue persists after trying all solutions - Data loss or corruption suspected - Security incident detected

Information to Provide

When escalating, include: - Timestamp when issue started - Affected environment (preview/production) - Error messages from logs - Steps already tried - Impact on users

Support Contacts

  • Azure Support: Create support ticket in Azure Portal
  • S5 Repository Owner: Create GitHub issue with urgent label
  • On-Call Engineer: Use PagerDuty (production issues only)