Troubleshooting S5 on Azure¶
Common issues and solutions for S5 Slidefactory on Azure Container Apps.
Quick Diagnostics¶
Is S5 Down?¶
# Check health endpoint
curl https://slidefactory.sportfive.com/health
# Expected: {"status": "healthy", "database": "connected", "redis": "connected"}
Check Container App Status¶
- Go to Azure Portal →
slidefactory-web-prod - Check Revision Management → Active revision should be "Running"
- Check Log stream for recent errors
Common Issues¶
Issue: Website Not Loading (502/503 Error)¶
Symptoms: Browser shows "Bad Gateway" or "Service Unavailable"
Diagnosis:
# Check if health endpoint responds
curl https://slidefactory.sportfive.com/health
# Check Container App logs
az containerapp logs show \
--name slidefactory-web-prod \
--resource-group rg-slidefactory-prod \
--tail 50
Solutions:
- Container App not running:
- Go to Azure Portal → Container App → Check replica count
- If 0 replicas: Check if auto-scaling is enabled
-
Manually scale: Set min replicas to 1
-
Health probe failing:
- Check logs for startup errors
- Verify database connection in environment variables
-
Restart Container App: Azure Portal → Restart
-
Recent deployment failed:
- Check GitHub Actions workflow logs
- Rollback to previous revision: See Deployment - Rollback
Issue: Cannot Log In (Azure AD Error)¶
Symptoms: "AADSTS" error code when attempting Azure AD login
Common AADSTS Errors:
| Error Code | Meaning | Solution |
|---|---|---|
| AADSTS50011 | Redirect URI mismatch | Add correct redirect URI to Azure AD app registration |
| AADSTS700016 | Application not found | Verify AZURE_CLIENT_ID in Container App environment |
| AADSTS7000215 | Invalid client secret | Generate new client secret and update AZURE_CLIENT_SECRET |
| AADSTS50020 | User not in tenant | Add user to Azure AD tenant |
Solutions:
- Verify Azure AD configuration:
- Go to Azure AD → App registrations → Find S5 app
- Check Redirect URIs includes:
https://slidefactory.sportfive.com/auth/azure/callback -
Check Certificates & secrets → Client secret is not expired
-
Verify environment variables:
Ensureaz containerapp show \ --name slidefactory-web-prod \ --resource-group rg-slidefactory-prod \ --query "properties.configuration.secrets" \ --output tableAZURE_TENANT_ID,AZURE_CLIENT_ID,AZURE_CLIENT_SECRETare set correctly -
Test credentials locally:
- Try logging in on preview environment first
- If preview works, compare environment variables
Issue: Presentation Generation Fails¶
Symptoms: Presentations stuck in "running" status or fail with error
Diagnosis:
# Check worker logs
az containerapp logs show \
--name slidefactory-worker-prod \
--resource-group rg-slidefactory-prod \
--tail 100 | grep ERROR
Solutions:
- Worker not running:
- Check worker Container App status
- Verify worker has min 1 replica
-
Restart worker if needed
-
N8N workflow error:
- Go to N8N UI → Executions
- Find failed execution
- Check error message
-
Common: N8N API key expired, workflow disabled, missing credentials
-
Storage error:
- Verify Azure Blob Storage is accessible
- Check storage account firewall rules
-
Verify
AZURE_STORAGE_ACCOUNT_NAMEand_KEYare correct -
AI provider error:
- Check OpenAI/Azure OpenAI API key is valid
- Verify API quota not exceeded
- Check AI provider status page
Issue: Database Connection Errors¶
Symptoms: Logs show psycopg2.OperationalError or database connection failures
Solutions:
- Check database status:
- Go to Azure Portal → PostgreSQL Flexible Server
- Verify status is "Available"
-
Check if maintenance is in progress
-
Connection string issues:
- Verify
DATABASE_URLenvironment variable - Format:
postgresql://user:password@host:5432/database?sslmode=require -
Ensure
sslmode=requireis present -
Firewall rules:
- Go to PostgreSQL → Networking
- Ensure "Allow Azure services" is enabled
-
Add Container Apps subnet if using private endpoint
-
Connection pool exhausted:
- Check active connections:
- If near limit (default 100), increase
max_connectionsin PostgreSQL config - Restart Container App to reset connection pool
Issue: Redis Connection Errors¶
Symptoms: Logs show redis.exceptions.ConnectionError or timeout
Solutions:
- Check Redis status:
- Go to Azure Portal → Azure Cache for Redis
- Verify status is "Running"
-
Check if scheduled maintenance is in progress
-
Connection settings:
- Verify
REDIS_HOST,REDIS_PORT,REDIS_PASSWORDin environment variables - Ensure
REDIS_SSL=truefor Azure Redis -
Port should be
6380(not 6379) for TLS -
Firewall rules:
- Go to Redis → Firewall
- Ensure Container Apps subnet is allowed
-
Check if private endpoint is configured correctly
-
Connection limits:
- Check connected clients:
- If near limit, scale Redis to higher tier
Issue: Slow Performance¶
Symptoms: Pages load slowly (> 5 seconds), timeouts
Diagnosis:
# Check Application Insights for slow requests
# Go to Azure Portal → Application Insights → Performance
# Sort by duration to find slow requests
Solutions:
- High CPU usage:
- Check Container App metrics (CPU usage)
- If consistently > 70%, scale up (add more replicas or increase CPU/memory)
-
Auto-scaling should handle this automatically
-
Slow database queries:
- Find slow queries in PostgreSQL:
- Add indexes to frequently queried columns
-
Optimize N+1 queries
-
Redis memory issues:
- Check Redis memory usage in metrics
-
If > 80%, scale to higher tier or implement cache eviction
-
Network latency:
- Verify Container Apps, database, and Redis are in same Azure region
- Check if private endpoints are configured (reduces latency)
Issue: Storage Errors (Blob Storage)¶
Symptoms: Cannot upload templates, presentations don't download
Solutions:
- Authentication errors:
- Verify
AZURE_STORAGE_ACCOUNT_NAMEandAZURE_STORAGE_ACCOUNT_KEY -
Regenerate access key if needed (will require updating environment variable)
-
Firewall blocking access:
- Go to Storage Account → Networking
- Add Container Apps subnet to allowed networks
-
Or enable "Allow Azure services"
-
Container not found:
- Verify containers exist:
presentations,templates,documents,static -
Create missing containers via Azure Portal or CLI
-
Quota exceeded:
- Check storage account capacity
- Implement lifecycle policies to delete old files
- Clean up large files manually
Issue: Deployment Fails¶
Symptoms: GitHub Actions workflow fails, Container App not updated
Solutions:
- Check GitHub Actions logs:
- Go to repository → Actions → Select failed workflow
-
Look for specific error message
-
Common deployment errors:
- Docker build fails: Check Dockerfile syntax, verify base image exists
- Azure login fails: Verify
AZURE_CREDENTIALSsecret is valid - Image push fails: Check Azure Container Registry credentials
-
Container App update fails: Check if resource group/app exists
-
Core package download fails:
- Verify
GH_PAT_SLIDEFACTORY_COREsecret has access to private repo - Check if slidefactory-core v1.0.8 release exists
- Verify wheel file is attached to release
Diagnostic Commands¶
Check All Services¶
# Web app status
az containerapp show \
--name slidefactory-web-prod \
--resource-group rg-slidefactory-prod \
--query "properties.runningStatus"
# Worker status
az containerapp show \
--name slidefactory-worker-prod \
--resource-group rg-slidefactory-prod \
--query "properties.runningStatus"
# Database status
az postgres flexible-server show \
--name postgres-prod \
--resource-group rg-slidefactory-prod \
--query "state"
# Redis status
az redis show \
--name redis-prod \
--resource-group rg-slidefactory-prod \
--query "provisioningState"
View Recent Logs¶
# Last 50 lines from web app
az containerapp logs show \
--name slidefactory-web-prod \
--resource-group rg-slidefactory-prod \
--tail 50
# Filter for errors only
az containerapp logs show \
--name slidefactory-web-prod \
--resource-group rg-slidefactory-prod \
--tail 100 | grep -E "ERROR|CRITICAL"
Database Connection Test¶
# From local machine (if firewall allows)
psql "${DATABASE_URL}" -c "SELECT version();"
# From Container App
az containerapp exec \
--name slidefactory-web-prod \
--resource-group rg-slidefactory-prod \
--command "psql ${DATABASE_URL} -c 'SELECT 1;'"
Escalation¶
When to Escalate¶
Escalate to Azure support if: - Azure service is down (check Azure Status) - Issue persists after trying all solutions - Data loss or corruption suspected - Security incident detected
Information to Provide¶
When escalating, include: - Timestamp when issue started - Affected environment (preview/production) - Error messages from logs - Steps already tried - Impact on users
Support Contacts¶
- Azure Support: Create support ticket in Azure Portal
- S5 Repository Owner: Create GitHub issue with
urgentlabel - On-Call Engineer: Use PagerDuty (production issues only)