Audit Middleware

AuditMiddleware for the Registry Staff Portal API — design, files changed, configuration, and how to verify audit events land in the Audit Manager Postgres store.

What it does

A single middleware class — AuditMiddleware — registered in the Staff Portal API's main.py, after AuthMiddleware. It captures API calls and emits one CloudEvent to the Audit Manager service per call.

Key properties:

  • Never blocks the response. Emission is asyncio.create_task fire-and-forget. The user's request has already returned by the time the audit POST completes.

  • Never raises. Audit Manager unreachable, slow, or returning errors are all logged but never propagated. A broken audit pipeline cannot break the Staff Portal API.

  • Disabled by default. Both audit_enabled=true and a non-empty audit_manager_url are required to actually emit events. The default setup is a no-op — safe to ship without configuring Audit Manager at all.

Audit policy

Request kind
Audited?

Authenticated (request.state.auth set), any outcome

Yes

Anonymous + outcome non-2xx (rejected attempt)

Yes — captured as actor.type=anonymous (or recovered from JWT on 403, see below). Toggle off via audit_anonymous_failures=false.

Anonymous + outcome 2xx (legitimate public endpoint)

No

Health probes (/ping)

No

OpenAPI surfaces (/docs, /redoc, /openapi.json, /docs/oauth2-redirect)

No

OPTIONS preflight

No

Why audit rejected anonymous calls? They're attempted unauthorized access — exactly the signal a security review needs. The combination "automatic skip of legitimate anonymous traffic + capture of rejected anonymous traffic" gives you compliance signal without flooding the audit store with bot pings or browser CORS.

Disabling anonymous-failure auditing. Set REGISTRY_STAFF_PORTAL_API_AUDIT_ANONYMOUS_FAILURES=false. The middleware then reverts to the original "audit only authenticated user calls" rule.

Recovering the real user on a 403

When a user has a valid token but the wrong role, the existing AuthMiddleware raises ForbiddenError before setting request.state.auth — so by default the audit would have no user context. The middleware handles this specially: on a 403 with a bearer token present, it decodes the JWT payload itself (without re-verifying the signature — AuthMiddleware already did that before raising) to recover sub, name, preferred_username, and the client roles. This is safe because we know the signature was validated; we're just reading what the upstream already accepted.

For 401 (no token, invalid signature, expired token), the JWT cannot be trusted, so the actor is recorded as anonymous with only the client IP preserved.

Where it sits in the middleware stack

The order matters: audit must wrap auth (not the other way around) so that by the time we read request.state.auth after the response, it has been populated.

What gets emitted (per call)

A single CloudEvents 1.0 envelope with the OpenG2P data conventions:

Field
Source in the request

id

UUID4 generated by the middleware

source

/openg2p/registry-staff-portal-api (configurable via audit_source)

type

org.openg2p.staff_portal.<endpoint_function_name>

time

UTC timestamp when the response was built

data.actor.type

"user" for authenticated callers; "anonymous" for unauthenticated rejected attempts

data.actor.id

principal.sub (Keycloak subject id), JWT sub on 403, or "anonymous"

data.actor.name

principal.name / JWT name claim (display name, e.g. "Admin User")

data.actor.username

JWT preferred_username claim (login handle, e.g. "admin"). Decoded directly from the bearer token. Not in the Actor schema explicitly — preserved via extra="allow" and lands under details.actor.username.

data.actor.roles

principal.client_roles[<keycloak_client_id>] (or resource_access.<client>.roles from JWT on 403) — roles for this client only

data.actor.ip

X-Forwarded-For first hop → X-Real-IPrequest.client.host. Picks the real user IP behind Istio / a load balancer rather than the proxy's IP.

data.actor.session_id

JWT session_state (or sid) claim — useful for grouping all actions in the same Keycloak login session.

data.action

First word of the endpoint function name (e.g. approve_change_requestapprove, get_individualsget). See note below.

data.outcome

2xx → success, 401/403 → denied, other 4xx/5xx → failure

data.context.api

"<METHOD> <path>" — e.g. "POST /change-requests/approve_change_request"

data.context.module

"registry-staff-portal-api" (configurable via audit_module)

data.context.http_status

response.status_code

data.context.request_id

Value of the X-Request-ID header if present

Why action is the first word, not the full function name

Most Staff Portal endpoints are declared as POST (they take JSON bodies for filters / pagination / sort), so the HTTP method tells you nothing about intent. The endpoint function name does — get_individuals, create_change_request, delete_template. The middleware splits on the first _ and stores just the verb in data.action.

This keeps action a low-cardinality dimension (~6 verbs: get / list / create / update / delete / search) so it's useful for cross-service dashboards and filters like "all delete events last week" or "all login failures across the platform". If we stored the full name there, the column would have hundreds of distinct values and be useless for aggregation.

Nothing is lost — the full function name is preserved in two other places on the same row:

  • typeorg.openg2p.staff_portal.get_individuals (the full operation name)

  • details.context.apiPOST /getIndividual (the wire-level call)

So action is the summary verb, type is the full op, and context.api is the HTTP form. Three layers, three uses.

data.resource is intentionally not populated in this iteration — most Staff Portal endpoints are RPC-shaped POSTs without a clean URL-path entity to extract. We can add it later via per-route hints if needed.

data.actor.id and data.actor.type land in flat indexed columns (actor_id, actor_type); the rest of actor.* (name, username, roles, ip) lives under the details.actor.* JSONB column on the audit-manager side — see Mapping from CloudEvents to Postgres columns.

Files changed

In openg2p-registry-gen2-apis/openg2p-registry-staff-portal-api/:

File
Change

src/openg2p_registry_staff_portal_api/audit_middleware.py

New — the middleware class. ~280 lines including JWT-decode helper for the 403 recovery path.

src/openg2p_registry_staff_portal_api/config.py

+6 settings: audit_enabled, audit_manager_url, audit_timeout_seconds, audit_source, audit_module, audit_anonymous_failures

src/openg2p_registry_staff_portal_api/main.py

+11 lines to register AuditMiddleware after AuthMiddleware

.env.example

+8 lines documenting the new env vars

In openg2p-audit-manager/:

File
Change

src/audit_manager/schema/cloud_event.py

Actor model gains extra="allow" so emitter-supplied custom actor fields (e.g. username) flow through to details.actor.* without a schema change here.

Configuration

Six new environment variables (all prefixed with REGISTRY_STAFF_PORTAL_API_):

Env var
Default
Purpose

REGISTRY_STAFF_PORTAL_API_AUDIT_ENABLED

false

Master on/off switch. Must be true AND a URL must be set for emission to happen.

REGISTRY_STAFF_PORTAL_API_AUDIT_MANAGER_URL

empty

Base URL of Audit Manager, e.g. http://localhost:8002 or http://audit-manager:80.

REGISTRY_STAFF_PORTAL_API_AUDIT_TIMEOUT_SECONDS

2.0

Timeout on each POST to Audit Manager. Bounded so a slow audit endpoint can't pile up.

REGISTRY_STAFF_PORTAL_API_AUDIT_SOURCE

/openg2p/registry-staff-portal-api

CloudEvents source field. Override only if you run multiple staff-portal deployments.

REGISTRY_STAFF_PORTAL_API_AUDIT_MODULE

registry-staff-portal-api

Module name placed in data.context.module.

REGISTRY_STAFF_PORTAL_API_AUDIT_ANONYMOUS_FAILURES

true

When true, also audit rejected anonymous calls (401/403). Set to false to revert to the original "audit only authenticated user calls" rule.

To disable auditing entirely: set AUDIT_ENABLED=false, or omit AUDIT_MANAGER_URL. Either condition makes the middleware a no-op — there's no need to remove the middleware from main.py. The startup log will say AuditMiddleware disabled (...). No-op. so you can confirm.

To disable only anonymous-failure auditing (and keep authenticated auditing): set AUDIT_ANONYMOUS_FAILURES=false. Useful in environments where the service is exposed to bot/scanner traffic and you don't want the audit store to fill with rejected anonymous probes.

To enable for local dev (Audit Manager port-forwarded from cluster):

Restart uvicorn and the startup log will show:

Wiring through the registry Helm chart

You do not set REGISTRY_STAFF_PORTAL_API_AUDIT_* env vars by hand on each deployment. The registry's base Helm chartarrow-up-right already plumbs them through three global.audit* values that flow into the staff-portal-api's envVars block:

To enable auditing for an environment, add to your per-env values file (e.g. values-trial.yaml):

…then helm upgrade the registry release. The Rancher UI also exposes these three under the Audit Manager group via questions.yamlarrow-up-right, so operators can flip them without editing YAML.

Cross-namespace deployment — if audit-manager is not in the same namespace as the registry release, set the FQDN:

For the full registry-side documentation (dependency table, version matrix, the 4.1.0 release entry that introduces this feature), see the registry Helm chart 4.x doc — Audit Manager integration.

Enabling without redeploying staff-portal-api code. Because every audit env var is no-op-by-default, you can ship the chart change first (while the staff-portal-api image still lacks the AuditMiddleware — the unknown env vars are silently ignored by pydantic-settings' extra="allow"). When you later roll the new image with the middleware, the variables are already in place and emission turns on at restart.

Verification — end-to-end

After enabling:

The response is unchanged — same status code, same body, same latency.

Expected: at least one new row with

  • type = org.openg2p.staff_portal.get_registry_configuration

  • actor_id = <your admin user's sub>

  • action = get

  • outcome = failure (because the empty body returned 400)

The DB user / DB name may differ depending on how the postgres-init chart provisioned them in your cluster — adjust the psql command accordingly. See the Audit Manager deployment notes for naming.

What "no-op" looks like

When auditing is off (default), the middleware adds essentially zero overhead per request:

  1. One attribute access (request.state.auth)

  2. One short-circuit return

No CloudEvent is built, no HTTP client is created, no async task is scheduled. A startup log line confirms the disabled state:

What we deliberately did NOT do (yet)

Skipped feature
Why
When to add it

data.resource extraction

Most endpoints are RPC POSTs without clean entity URLs

When we want investigators to filter by entity id

Capture response body's error reason into data.reason

Reading the body in middleware needs care with streaming responses

When 4xx/5xx volume warrants quick triage by reason

Local disk spool on emission failure

At-least-once is already provided by the Audit Manager itself; staff portal pod crashes mid-emit are rare and tolerable

If volumes/availability targets ever require zero loss on the producer side

Sampling

Audit volume from staff portal is small enough to capture every call

If a future caller is too chatty (high-traffic public API)

Promotion to a shared library

Keep the integration scoped to one service while we validate the shape

Once a second service wants the same middleware, lift it to openg2p-fastapi-common

Operational notes

  • Cold start cost: the httpx.AsyncClient is lazy-created on the first emission, not at app import time. The first audited call sees a small extra latency (~ms) for client setup; subsequent calls reuse the connection pool.

  • Restart safety: all in-flight asyncio.create_task(_emit(...)) calls are cancelled on shutdown. Up to a few hundred ms of tail emissions can be lost during a rolling restart. Acceptable per design (audits at this layer are best-effort; durability is provided by Audit Manager itself once the event reaches Kafka).

  • Audit Manager errors are logged at WARN. A spike of WARN lines with Audit emission failed typically means: Audit Manager is down, unreachable, returning 503 backpressure, or the URL is wrong. None of these affect Staff Portal API responses.

Last updated

Was this helpful?