ADR-0010: Sync Reliability — Prevent Cascading Failures in Multi-Tenant Data Sync
Context and Problem Statement
The Azure Governance Platform syncs data from multiple Azure tenants on a
scheduled basis (costs, compliance, resources, identity, DMARC). During
production operations several failure modes were observed:
1. Column overflow cascade: Azure Policy names exceeding the
policy_name VARCHAR(255) column caused a DataError. Because all
tenants shared a single SQLAlchemy session, the poisoned session killed
sync for *every remaining tenant* in the batch.
2. Ghost jobs: If the sync process was OOM-killed or crashed mid-run,
jobs remained stuck in running status forever, blocking subsequent
scheduler invocations.
3. Silent cold start: IntervalTrigger scheduler jobs without an
explicit next_run_time don't fire until their first interval elapses
(hours later), leaving dashboards empty after a fresh deployment.
4. Dead code confusion: A placeholder SyncService stub in
app/api/services/sync_service.py returned mock data, causing
confusion about which code path was actually used.
How do we make multi-tenant data sync resilient to single-tenant failures
without requiring a complete architectural rewrite?
Decision Drivers
- Blast radius: A single tenant's bad data must not affect other tenants
- Observability: Truncation and failures must be logged for audit trail
- Startup latency: Dashboards must show data within minutes of deployment
- Simplicity: Prefer targeted fixes over major refactors (YAGNI)
- Testability: Fixes must be enforceable via automated fitness functions
Considered Options
2. Full async task queue (Celery / Azure Service Bus)
3. Separate worker process per tenant
Decision Outcome
Chosen option: Per-tenant session isolation with safe truncation, because
it addresses all observed failure modes with minimal architectural change and
is fully enforceable via fitness functions.
Implementation Details
| Fix | Description | Files |
|---|---|---|
| FF-1: Widen policy_name | `String(255)` → `String(1000)` | `alembic/versions/009_widen_policy_name.py`, `app/models/compliance.py` |
| FF-2: safe_truncate | Audit-logged truncation for oversized fields | `app/core/sync/utils.py`, used in `app/core/sync/compliance.py` |
| FF-3: Per-tenant sessions | Each tenant gets a fresh `get_db_context()` session | All 5 sync modules: `compliance`, `costs`, `resources`, `identity`, `dmarc` |
| FF-4: Staggered startup | `next_run_time` + `timedelta` offsets on all `IntervalTrigger` jobs | `app/core/scheduler.py` |
| FF-5: Remove dead stub | Delete placeholder `SyncService` | `app/api/services/sync_service.py` (deleted) |
| FF-6: Migration guard | Alembic migration 009 must exist and reference `policy_name` | `alembic/versions/009_widen_policy_name.py` |
Consequences
- Good: Single-tenant failures are now isolated — one tenant's bad data
cannot cascade to kill other tenants' sync
- Good: Oversized fields are truncated with structured warning logs
(satisfies STRIDE T-1 tampering and R-1 repudiation requirements)
- Good: Dashboards show data within 2 minutes of deployment (staggered
startup with 15-second offsets)
- Good: Ghost jobs are auto-cleaned by
cleanup_ghost_jobs()(30 min
threshold)
- Neutral: Slightly more database connections during sync (one per
tenant instead of one shared), acceptable for current scale (< 10 tenants)
- Bad: If we grow to 50+ tenants, the per-tenant session model may need
batching — but this is a good problem to have (see scaling path below)
Confirmation
All six fixes are enforced by architectural fitness functions in
tests/architecture/test_sync_data_integrity.py:
uv run pytest tests/architecture/test_sync_data_integrity.py -v
These tests verify structural properties of the codebase (column widths, AST
patterns, file existence) and will fail immediately if any fix regresses.
STRIDE Security Analysis
| Threat Category | Risk Level | Mitigation |
|---|---|---|
| Spoofing | Low | No auth changes; sync uses existing UAMI credentials |
| Tampering | Medium → Low | `safe_truncate` logs all truncations with field name, original length, and context |
| Repudiation | Medium → Low | Structured logging provides audit trail for all data modifications |
| Information Disclosure | Low | No new data exposure; truncation only reduces data |
| Denial of Service | High → Low | Session isolation prevents cascade; ghost job cleanup prevents stuck state |
| Elevation of Privilege | Low | No privilege changes; sync operates with existing service identity |
Overall Security Posture: Significantly improved. The primary risk (DoS
via cascading session failure) is eliminated.
Scaling Path
| Scale | Tenants | Strategy |
|---|---|---|
| Current (Phase 1) | < 10 | Per-tenant sessions, sequential sync |
| Phase 2 | 10–50 | Batched sessions (5 tenants per batch), `asyncio.gather` |
| Phase 3 | 50+ | Azure Service Bus task queue, dedicated sync worker |
More Information
- Fitness functions:
tests/architecture/test_sync_data_integrity.py - Sync utilities:
app/core/sync/utils.py - Scheduler with staggered startup:
app/core/scheduler.py - Ghost job cleanup:
app/api/services/monitoring_service.py
Template Version: MADR 4.0 (September 2024) with STRIDE Security Analysis
Last Updated: 2025-05-25
Maintained By: Code Puppy 🐶 (retroactive documentation)