Quick Reference
| Resource | URL |
|---|---|
| Production | https://app-governance-prod.azurewebsites.net |
| Health Check | https://app-governance-prod.azurewebsites.net/health |
| API Docs | /api/swagger/ (public) · prod /docs requires auth |
| Azure Portal | https://portal.azure.com |
Daily Operations (5 minutes)
Morning Health Check
Run automated check:
./scripts/daily-ops-check.sh
Manual verification:
# Health endpoint
curl -s https://app-governance-prod.azurewebsites.net/health | jq .
# Response time (should be <1s)
for i in {1..3}; do
curl -s -o /dev/null -w "%{time_total}\n" \
https://app-governance-prod.azurewebsites.net/health
done
Alert Review
Check Azure Portal: Monitor → Alerts
Verify 4 alert rules enabled:
- ✅ Server Errors - Critical
- ✅ High Response Time - Warning
- ✅ Availability Drop - Critical
- ✅ Business Logic Errors - Critical
Weekly Operations (15 minutes)
Monday Morning Routine
./scripts/weekly-ops-review.sh
Metrics Review
| Metric | Target | Check Location |
|---|---|---|
| Availability | 99.9% | App Insights → Availability |
| Response Time | <500ms | App Insights → Performance |
| Error Rate | <1% | App Insights → Failures |
| Cost | <$15/mo | Portal → Cost Management |
Deployment Procedures
Blue-Green Deployment
# 1. Deploy to staging
make deploy-staging
# 2. Validate staging
./scripts/verify-and-test-deployment.sh --environment staging
# 3. Swap to production
az webapp deployment slot swap \
--name app-governance-prod \
--resource-group rg-governance-production \
--slot staging \
--target-slot production
# 4. Verify production
./scripts/verify-and-test-deployment.sh --environment production
Rollback
# Swap back (if within 5 minutes)
az webapp deployment slot swap \
--name app-governance-prod \
--resource-group rg-governance-production \
--slot production \
--target-slot staging
Troubleshooting
Application Down (503/500)
- Check App Service status in Portal
- Review App Insights exceptions
- If unresolved in 5 min, execute rollback
# Emergency restart
az webapp restart \
--name app-governance-prod \
--resource-group rg-governance-production
High Response Time (>1s)
- Check App Insights → Performance
- Review SQL Query Store
- Check App Service Plan CPU/Memory
Database Connection Failures
# Check SQL status
az sql db show \
--name governance \
--server sql-governance-prod \
--resource-group rg-governance-production
# Check firewall rules
az sql server firewall-rule list \
--server sql-governance-prod \
--resource-group rg-governance-production
Monitoring & Alerting
Alert Response Matrix
| Alert | Severity | Response | Action |
|---|---|---|---|
| Server Errors | Critical | Immediate | Investigate |
| Availability Drop | Critical | Immediate | Rollback |
| High Response Time | Warning | 30 min | Tune thresholds |
| Business Logic Errors | Critical | Immediate | Check deploy |
Escalation
Severity 1: Production Down
- Page on-call engineer
- Attempt automatic recovery
- Escalate if not resolved in 15 min
- Post-mortem within 24 hours
SLAs
| Metric | Target | Current |
|---|---|---|
| Availability | 99.9% | 99.9%+ ✅ |
| Response Time (p95) | <500ms | ~130ms ✅ |
| Error Rate | <1% | <0.1% ✅ |
| Recovery Time | <15 min | N/A ✅ |
Operations Runbook v1.8.1 | Last Updated: March 31, 2026