ADR-0010: Sync Reliability — Prevent Cascading Failures in Multi-Tenant Data Sync

Context and Problem Statement

The Azure Governance Platform syncs data from multiple Azure tenants on a

scheduled basis (costs, compliance, resources, identity, DMARC). During

production operations several failure modes were observed:

1. Column overflow cascade: Azure Policy names exceeding the

policy_name VARCHAR(255) column caused a DataError. Because all

tenants shared a single SQLAlchemy session, the poisoned session killed

sync for *every remaining tenant* in the batch.

2. Ghost jobs: If the sync process was OOM-killed or crashed mid-run,

jobs remained stuck in running status forever, blocking subsequent

scheduler invocations.

3. Silent cold start: IntervalTrigger scheduler jobs without an

explicit next_run_time don't fire until their first interval elapses

(hours later), leaving dashboards empty after a fresh deployment.

4. Dead code confusion: A placeholder SyncService stub in

app/api/services/sync_service.py returned mock data, causing

confusion about which code path was actually used.

How do we make multi-tenant data sync resilient to single-tenant failures

without requiring a complete architectural rewrite?

Decision Drivers

  • Blast radius: A single tenant's bad data must not affect other tenants
  • Observability: Truncation and failures must be logged for audit trail
  • Startup latency: Dashboards must show data within minutes of deployment
  • Simplicity: Prefer targeted fixes over major refactors (YAGNI)
  • Testability: Fixes must be enforceable via automated fitness functions
  • Considered Options

2. Full async task queue (Celery / Azure Service Bus)

3. Separate worker process per tenant

Decision Outcome

Chosen option: Per-tenant session isolation with safe truncation, because

it addresses all observed failure modes with minimal architectural change and

is fully enforceable via fitness functions.

Implementation Details

Fix Description Files
FF-1: Widen policy_name `String(255)` → `String(1000)` `alembic/versions/009_widen_policy_name.py`, `app/models/compliance.py`
FF-2: safe_truncate Audit-logged truncation for oversized fields `app/core/sync/utils.py`, used in `app/core/sync/compliance.py`
FF-3: Per-tenant sessions Each tenant gets a fresh `get_db_context()` session All 5 sync modules: `compliance`, `costs`, `resources`, `identity`, `dmarc`
FF-4: Staggered startup `next_run_time` + `timedelta` offsets on all `IntervalTrigger` jobs `app/core/scheduler.py`
FF-5: Remove dead stub Delete placeholder `SyncService` `app/api/services/sync_service.py` (deleted)
FF-6: Migration guard Alembic migration 009 must exist and reference `policy_name` `alembic/versions/009_widen_policy_name.py`

Consequences

  • Good: Single-tenant failures are now isolated — one tenant's bad data

cannot cascade to kill other tenants' sync

  • Good: Oversized fields are truncated with structured warning logs

(satisfies STRIDE T-1 tampering and R-1 repudiation requirements)

  • Good: Dashboards show data within 2 minutes of deployment (staggered

startup with 15-second offsets)

  • Good: Ghost jobs are auto-cleaned by cleanup_ghost_jobs() (30 min

threshold)

  • Neutral: Slightly more database connections during sync (one per

tenant instead of one shared), acceptable for current scale (< 10 tenants)

  • Bad: If we grow to 50+ tenants, the per-tenant session model may need

batching — but this is a good problem to have (see scaling path below)

Confirmation

All six fixes are enforced by architectural fitness functions in

tests/architecture/test_sync_data_integrity.py:


uv run pytest tests/architecture/test_sync_data_integrity.py -v

These tests verify structural properties of the codebase (column widths, AST

patterns, file existence) and will fail immediately if any fix regresses.

STRIDE Security Analysis

Threat Category Risk Level Mitigation
Spoofing Low No auth changes; sync uses existing UAMI credentials
Tampering Medium → Low `safe_truncate` logs all truncations with field name, original length, and context
Repudiation Medium → Low Structured logging provides audit trail for all data modifications
Information Disclosure Low No new data exposure; truncation only reduces data
Denial of Service High → Low Session isolation prevents cascade; ghost job cleanup prevents stuck state
Elevation of Privilege Low No privilege changes; sync operates with existing service identity

Overall Security Posture: Significantly improved. The primary risk (DoS

via cascading session failure) is eliminated.

Scaling Path

Scale Tenants Strategy
Current (Phase 1) < 10 Per-tenant sessions, sequential sync
Phase 2 10–50 Batched sessions (5 tenants per batch), `asyncio.gather`
Phase 3 50+ Azure Service Bus task queue, dedicated sync worker

More Information

  • Fitness functions: tests/architecture/test_sync_data_integrity.py
  • Sync utilities: app/core/sync/utils.py
  • Scheduler with staggered startup: app/core/scheduler.py
  • Ghost job cleanup: app/api/services/monitoring_service.py

Template Version: MADR 4.0 (September 2024) with STRIDE Security Analysis

Last Updated: 2025-05-25

Maintained By: Code Puppy 🐶 (retroactive documentation)