ADR-0011: Granular RBAC with Permission Strings

Context and Problem Statement

The Azure Governance Platform currently uses a flat User.roles: list[str] model where "admin" is the only meaningful elevated role. All authorization checks follow the pattern "admin" in user.roles — effectively binary (admin or not-admin). Meanwhile, the UserTenant model has ad-hoc boolean flags (can_manage_resources, can_view_costs, can_manage_compliance) that duplicate authorization logic in the data layer.

As we onboard 5 MSP tenants (HTT, BCC, FN, TLL, DCE) with different staff accessing different modules, we need roles between "god mode admin" and "default user." A customer's finance analyst should see cost data but not trigger sync jobs. A tenant admin should manage their tenant's compliance rules but not create new tenants.

How do we add granular, module-level authorization without breaking the existing auth stack?

Decision Drivers

  • Backward compatibility (K.O.): Existing "admin" in user.roles checks must keep working with zero changes
  • Tenant isolation preserved: Must layer on top of TenantAuthorization, not replace it
  • Minimal complexity: 4 roles × 5 tenants — this is not an enterprise ABAC engine
  • Azure AD alignment: Roles should map cleanly to Entra ID App Roles (JWT roles claim)
  • Testability: Permission sets must be unit-testable without database or HTTP calls
  • Incremental migration: Routes migrate one-at-a-time, not big-bang
  • Considered Options

2. PyCasbin policy engine — Externalized RBAC/ABAC policies with Casbin DSL

3. Database-driven role/permission tables — Full DB schema for roles, permissions, and role-permission join table

Decision Outcome

Chosen option: "Code-defined roles with permission strings", because it's the only option that requires zero new dependencies, zero database migrations, and can be shipped incrementally alongside existing require_roles() checks. For 4 predefined roles serving 5 tenants, code-defined permission sets are simpler, safer, and more testable than a policy engine or database-driven system.

Architecture: How It Layers


Request → Authentication (JWT/Azure AD) → unchanged
        → Tenant Isolation (TenantAuthorization) → unchanged  
        → Permission Check (NEW: require_permissions) → this ADR
        → Persona UI Gating (personas.py) → unchanged

The four authorization concerns remain orthogonal:

Concern Question Answered Mechanism
Authentication Who are you? JWT token validation
Tenant access Which tenants' data can you see? `TenantAuthorization` + `UserTenant`
Permissions What actions can you perform? Role → permission set resolution (this ADR)
Personas Which UI sections do you see? Entra ID groups → `personas.yaml`

Predefined Roles

Role Slug Description Key Permissions
Admin `admin` Full system access. Wildcard `*`. Everything
Tenant Admin `tenant_admin` Manages a tenant's config, users, compliance, and data. Cannot create tenants or access system settings. `costs:manage`, `compliance:manage`, `resources:manage`, `identity:manage`, `users:manage`, `sync:trigger`
Analyst `analyst` Read and export data across accessible modules. Cannot modify configuration. `costs:read`, `costs:export`, `compliance:read`, `resources:read`, `resources:export`, `identity:read`, `identity:export`
Viewer `viewer` Read-only dashboard access. No exports, no writes. `dashboard:read`, `costs:read`, `compliance:read`, `resources:read`, `identity:read`

Permission String Registry

Format: resource:action (OAuth2 scope convention).


dashboard:read
costs:read        costs:export       costs:manage
compliance:read   compliance:write   compliance:manage
resources:read    resources:export   resources:manage
identity:read     identity:export    identity:manage
audit_logs:read   audit_logs:export
sync:read         sync:trigger       sync:manage
tenants:read      tenants:manage
users:read        users:manage
system:health     system:admin
riverside:read    riverside:manage
dmarc:read        dmarc:manage
preflight:read    preflight:run
budgets:read      budgets:manage
recommendations:read
monitoring:read   monitoring:manage

Role → Permission Mapping (Containment Hierarchy)


Viewer ⊂ Analyst ⊂ TenantAdmin ⊂ Admin (wildcard)
  • Viewer gets all *:read permissions (no exports, no writes)
  • Analyst gets Viewer + all *:export permissions
  • TenantAdmin gets Analyst + all *:manage, *:write, *:trigger, *:run permissions (except system:admin, tenants:manage)
  • Admin gets wildcard * (all permissions, including system:admin and tenants:manage)
  • Legacy Role Mapping

Legacy Role Maps To Rationale
`"admin"` `admin` Unchanged — wildcard
`"operator"` `tenant_admin` Closest equivalent
`"reader"` `viewer` Closest equivalent
`"user"` `viewer` Default role

Key Implementation Components

app/core/permissions.py — Permission enum, Role enum, ROLE_PERMISSIONS frozen set mappings, LEGACY_ROLE_MAP, and resolution functions (resolve_user_permissions, has_permission).

app/core/rbac.py — FastAPI dependency require_permissions(["costs:read"]) that resolves permissions from user.roles and checks against required permissions. Validates permission strings at import time (fail-fast on typos).

User.permissions property — Computed property on the existing User model that resolves the full permission set from user.roles. Cached per request instance.

Route Migration Pattern

Routes migrate incrementally from require_roles() to require_permissions():


# Before (still works — backward compatible)
@router.get("/costs")
async def get_costs(user: User = Depends(require_roles(["admin", "operator"]))):
    ...

# After (granular)
@router.get("/costs")  
async def get_costs(user: User = Depends(require_permissions(["costs:read"]))):
    ...

Both patterns coexist during migration. require_roles() is never removed — it stays as a convenience wrapper.

Azure AD App Roles Integration

Define 4 App Roles in the Entra ID App Registration manifest:

Display Name Value Allowed Member Types
Admin `admin` Users/Groups
Tenant Admin `tenant_admin` Users/Groups
Analyst `analyst` Users/Groups
Viewer `viewer` Users/Groups

The JWT roles claim from Entra ID maps directly to the Role enum — replacing the brittle keyword-matching in _map_groups_to_roles(). Group-based mapping remains as a fallback during transition.

Migration Strategy

Phase Scope Breaking Changes Effort
1: Foundation Create `permissions.py`, `rbac.py`, tests. Add `permissions` property to `User`. None 6 hours
2: Route migration Replace `require_roles()` calls with `require_permissions()` one route file at a time. None — both coexist 8 hours
3: Entra ID Define App Roles in manifest. Update `validate_token()` to prefer `roles` claim over group matching. None — fallback preserved 4 hours
4: Cleanup Deprecate `UserTenant.can_*` boolean flags. Update `UserTenant.role` values to new role slugs. DB migration (additive) 4 hours

Consequences

Good:

  • Least-privilege enforcement — viewers can't trigger syncs, analysts can't manage compliance
  • Zero new dependencies — pure Python, standard FastAPI patterns
  • Fully backward compatible — existing "admin" in user.roles checks unchanged
  • Testable — permission sets are frozen sets, role containment hierarchy is assertable
  • Azure AD native — App Roles map 1:1 to our Role enum
  • Role changes require code deployment (acceptable for 4 predefined roles)
  • Two authorization patterns coexist during migration (require_roles + require_permissions)
  • No per-tenant role customization in Phase 1 (e.g., user is Analyst in tenant A but Viewer in tenant B) — requires Phase 4
  • Token size unchanged — roles stay in JWT, permissions resolved server-side
  • Persona system unaffected — personas gate UI visibility, permissions gate actions
  • Confirmation

  • tests/architecture/test_rbac_permissions.py — validates permission string format, role hierarchy containment, legacy mapping
  • tests/unit/test_permissions.py — unit tests for has_permission, resolve_user_permissions
  • Manual: verify require_roles(["admin"]) still works after Phase 1
  • STRIDE Security Analysis

    Threat Category Risk Level Mitigation
    Spoofing Low Permissions resolved server-side from authenticated `user.roles`. Cannot be forged — role claim is signed in JWT by Azure AD or internal key.
    Tampering Low Permission sets are `frozenset` in code — immutable at runtime. No database-stored permissions to tamper with. Role-permission mappings are version-controlled.
    Repudiation Medium `require_permissions()` logs denied attempts with user ID, required permission, and granted permissions. Recommend adding audit log entries for write/manage actions in Phase 2.
    Information Disclosure Low 403 response includes required permission name (acceptable — it's not sensitive). Permission sets themselves are in source code, not exposed via API.
    Denial of Service Low Permission resolution is a `set` lookup — O(1). No database queries, no external calls. No cache invalidation needed.
    Elevation of Privilege Medium Wildcard `*` only assigned to `admin` role. Legacy role mapping is explicit (`LEGACY_ROLE_MAP`). Fail-closed: unknown roles get zero permissions. Risk: if `UserTenant.role` column contains unexpected values, they'll map to zero permissions (safe default).

Pros and Cons of the Options

Option 1: Code-Defined Roles with Permission Strings (Chosen)

  • Good, because zero new dependencies — pure Python, standard FastAPI dependency injection
  • Good, because fully backward compatible — require_roles() untouched
  • Good, because testable — frozen sets are trivially assertable in unit tests
  • Good, because permission strings validated at import time (fail-fast on typos)
  • Good, because maps directly to Entra ID App Roles without translation layer
  • Neutral, because role changes require code deployment (acceptable for 4 roles)
  • Bad, because no runtime role customization without deployment
  • Bad, because no per-tenant role overrides until Phase 4
  • Option 2: PyCasbin Policy Engine

  • Good, because supports complex policies (RBAC, ABAC, multi-tenancy) via declarative DSL
  • Good, because well-maintained open source (Apache 2.0)
  • Neutral, because supports policy hot-reload without deployment
  • Bad, because massive overkill for 4 roles × 5 tenants — learning curve unjustified
  • Bad, because adds dependency and operational complexity (policy files, adapters)
  • Bad, because team unfamiliar with Casbin DSL — higher onboarding cost
  • Bad, because doesn't integrate with Entra ID App Roles without custom adapter
  • Option 3: Database-Driven Role/Permission Tables

  • Good, because runtime-modifiable roles via admin UI
  • Good, because supports per-tenant role customization natively
  • Neutral, because familiar RDBMS pattern
  • Bad, because requires DB migration and new tables (Role, Permission, RolePermission)
  • Bad, because cache invalidation complexity — when does a role change take effect?
  • Bad, because misconfiguration risk — wrong DB entry could grant unintended access
  • Bad, because harder to test — needs database fixtures, not just frozen sets
  • Bad, because overkill for 4 predefined roles that rarely change
  • More Information

  • Research: research/rbac-fastapi/ — full analysis, source evaluation, implementation guide
  • Related ADR: ADR-0007 (Auth Evolution) — covers authentication layer this builds on
  • Personas: app/core/personas.py — separate UI-gating concern, unaffected by this ADR
  • Tenant auth: app/core/authorization.pyTenantAuthorization class, unaffected by this ADR
  • Current roles code: app/core/auth.pyUser.has_role(), require_roles(), _map_groups_to_roles()

Template Version: MADR 4.0 (September 2024) with STRIDE Security Analysis

Last Updated: 2025-07-11

Maintained By: Solutions Architect 🏛️