Audit Middleware
AuditMiddleware for the Registry Staff Portal API — design, files changed, configuration, and how to verify audit events land in the Audit Manager Postgres store.
What it does
A single middleware class — AuditMiddleware — registered in the Staff Portal API's main.py, after AuthMiddleware. It captures API calls and emits one CloudEvent to the Audit Manager service per call.
Key properties:
Never blocks the response. Emission is
asyncio.create_taskfire-and-forget. The user's request has already returned by the time the audit POST completes.Never raises. Audit Manager unreachable, slow, or returning errors are all logged but never propagated. A broken audit pipeline cannot break the Staff Portal API.
Disabled by default. Both
audit_enabled=trueand a non-emptyaudit_manager_urlare required to actually emit events. The default setup is a no-op — safe to ship without configuring Audit Manager at all.
Audit policy
Authenticated (request.state.auth set), any outcome
Yes
Anonymous + outcome non-2xx (rejected attempt)
Yes — captured as actor.type=anonymous (or recovered from JWT on 403, see below). Toggle off via audit_anonymous_failures=false.
Anonymous + outcome 2xx (legitimate public endpoint)
No
Health probes (/ping)
No
OpenAPI surfaces (/docs, /redoc, /openapi.json, /docs/oauth2-redirect)
No
OPTIONS preflight
No
Why audit rejected anonymous calls? They're attempted unauthorized access — exactly the signal a security review needs. The combination "automatic skip of legitimate anonymous traffic + capture of rejected anonymous traffic" gives you compliance signal without flooding the audit store with bot pings or browser CORS.
Disabling anonymous-failure auditing. Set REGISTRY_STAFF_PORTAL_API_AUDIT_ANONYMOUS_FAILURES=false. The middleware then reverts to the original "audit only authenticated user calls" rule.
Recovering the real user on a 403
When a user has a valid token but the wrong role, the existing AuthMiddleware raises ForbiddenError before setting request.state.auth — so by default the audit would have no user context. The middleware handles this specially: on a 403 with a bearer token present, it decodes the JWT payload itself (without re-verifying the signature — AuthMiddleware already did that before raising) to recover sub, name, preferred_username, and the client roles. This is safe because we know the signature was validated; we're just reading what the upstream already accepted.
For 401 (no token, invalid signature, expired token), the JWT cannot be trusted, so the actor is recorded as anonymous with only the client IP preserved.
Where it sits in the middleware stack
The order matters: audit must wrap auth (not the other way around) so that by the time we read request.state.auth after the response, it has been populated.
What gets emitted (per call)
A single CloudEvents 1.0 envelope with the OpenG2P data conventions:
id
UUID4 generated by the middleware
source
/openg2p/registry-staff-portal-api (configurable via audit_source)
type
org.openg2p.staff_portal.<endpoint_function_name>
time
UTC timestamp when the response was built
data.actor.type
"user" for authenticated callers; "anonymous" for unauthenticated rejected attempts
data.actor.id
principal.sub (Keycloak subject id), JWT sub on 403, or "anonymous"
data.actor.name
principal.name / JWT name claim (display name, e.g. "Admin User")
data.actor.username
JWT preferred_username claim (login handle, e.g. "admin"). Decoded directly from the bearer token. Not in the Actor schema explicitly — preserved via extra="allow" and lands under details.actor.username.
data.actor.roles
principal.client_roles[<keycloak_client_id>] (or resource_access.<client>.roles from JWT on 403) — roles for this client only
data.actor.ip
X-Forwarded-For first hop → X-Real-IP → request.client.host. Picks the real user IP behind Istio / a load balancer rather than the proxy's IP.
data.actor.session_id
JWT session_state (or sid) claim — useful for grouping all actions in the same Keycloak login session.
data.action
First word of the endpoint function name (e.g. approve_change_request → approve, get_individuals → get). See note below.
data.outcome
2xx → success, 401/403 → denied, other 4xx/5xx → failure
data.context.api
"<METHOD> <path>" — e.g. "POST /change-requests/approve_change_request"
data.context.module
"registry-staff-portal-api" (configurable via audit_module)
data.context.http_status
response.status_code
data.context.request_id
Value of the X-Request-ID header if present
Why action is the first word, not the full function name
action is the first word, not the full function nameMost Staff Portal endpoints are declared as POST (they take JSON bodies for filters / pagination / sort), so the HTTP method tells you nothing about intent. The endpoint function name does — get_individuals, create_change_request, delete_template. The middleware splits on the first _ and stores just the verb in data.action.
This keeps action a low-cardinality dimension (~6 verbs: get / list / create / update / delete / search) so it's useful for cross-service dashboards and filters like "all delete events last week" or "all login failures across the platform". If we stored the full name there, the column would have hundreds of distinct values and be useless for aggregation.
Nothing is lost — the full function name is preserved in two other places on the same row:
type→org.openg2p.staff_portal.get_individuals(the full operation name)details.context.api→POST /getIndividual(the wire-level call)
So action is the summary verb, type is the full op, and context.api is the HTTP form. Three layers, three uses.
data.resource is intentionally not populated in this iteration — most Staff Portal endpoints are RPC-shaped POSTs without a clean URL-path entity to extract. We can add it later via per-route hints if needed.
data.actor.id and data.actor.type land in flat indexed columns (actor_id, actor_type); the rest of actor.* (name, username, roles, ip) lives under the details.actor.* JSONB column on the audit-manager side — see Mapping from CloudEvents to Postgres columns.
Files changed
In openg2p-registry-gen2-apis/openg2p-registry-staff-portal-api/:
src/openg2p_registry_staff_portal_api/audit_middleware.py
New — the middleware class. ~280 lines including JWT-decode helper for the 403 recovery path.
src/openg2p_registry_staff_portal_api/config.py
+6 settings: audit_enabled, audit_manager_url, audit_timeout_seconds, audit_source, audit_module, audit_anonymous_failures
src/openg2p_registry_staff_portal_api/main.py
+11 lines to register AuditMiddleware after AuthMiddleware
.env.example
+8 lines documenting the new env vars
In openg2p-audit-manager/:
src/audit_manager/schema/cloud_event.py
Actor model gains extra="allow" so emitter-supplied custom actor fields (e.g. username) flow through to details.actor.* without a schema change here.
Configuration
Six new environment variables (all prefixed with REGISTRY_STAFF_PORTAL_API_):
REGISTRY_STAFF_PORTAL_API_AUDIT_ENABLED
false
Master on/off switch. Must be true AND a URL must be set for emission to happen.
REGISTRY_STAFF_PORTAL_API_AUDIT_MANAGER_URL
empty
Base URL of Audit Manager, e.g. http://localhost:8002 or http://audit-manager:80.
REGISTRY_STAFF_PORTAL_API_AUDIT_TIMEOUT_SECONDS
2.0
Timeout on each POST to Audit Manager. Bounded so a slow audit endpoint can't pile up.
REGISTRY_STAFF_PORTAL_API_AUDIT_SOURCE
/openg2p/registry-staff-portal-api
CloudEvents source field. Override only if you run multiple staff-portal deployments.
REGISTRY_STAFF_PORTAL_API_AUDIT_MODULE
registry-staff-portal-api
Module name placed in data.context.module.
REGISTRY_STAFF_PORTAL_API_AUDIT_ANONYMOUS_FAILURES
true
When true, also audit rejected anonymous calls (401/403). Set to false to revert to the original "audit only authenticated user calls" rule.
To disable auditing entirely: set AUDIT_ENABLED=false, or omit AUDIT_MANAGER_URL. Either condition makes the middleware a no-op — there's no need to remove the middleware from main.py. The startup log will say AuditMiddleware disabled (...). No-op. so you can confirm.
To disable only anonymous-failure auditing (and keep authenticated auditing): set AUDIT_ANONYMOUS_FAILURES=false. Useful in environments where the service is exposed to bot/scanner traffic and you don't want the audit store to fill with rejected anonymous probes.
To enable for local dev (Audit Manager port-forwarded from cluster):
Restart uvicorn and the startup log will show:
Wiring through the registry Helm chart
You do not set REGISTRY_STAFF_PORTAL_API_AUDIT_* env vars by hand on each deployment. The registry's base Helm chart already plumbs them through three global.audit* values that flow into the staff-portal-api's envVars block:
To enable auditing for an environment, add to your per-env values file (e.g. values-trial.yaml):
…then helm upgrade the registry release. The Rancher UI also exposes these three under the Audit Manager group via questions.yaml, so operators can flip them without editing YAML.
Cross-namespace deployment — if audit-manager is not in the same namespace as the registry release, set the FQDN:
For the full registry-side documentation (dependency table, version matrix, the 4.1.0 release entry that introduces this feature), see the registry Helm chart 4.x doc — Audit Manager integration.
Enabling without redeploying staff-portal-api code. Because every audit env var is no-op-by-default, you can ship the chart change first (while the staff-portal-api image still lacks the AuditMiddleware — the unknown env vars are silently ignored by pydantic-settings' extra="allow"). When you later roll the new image with the middleware, the variables are already in place and emission turns on at restart.
Verification — end-to-end
After enabling:
The response is unchanged — same status code, same body, same latency.
Expected: at least one new row with
type = org.openg2p.staff_portal.get_registry_configurationactor_id = <your admin user's sub>action = getoutcome = failure(because the empty body returned 400)
The DB user / DB name may differ depending on how the postgres-init chart provisioned them in your cluster — adjust the psql command accordingly. See the Audit Manager deployment notes for naming.
What "no-op" looks like
When auditing is off (default), the middleware adds essentially zero overhead per request:
One attribute access (
request.state.auth)One short-circuit return
No CloudEvent is built, no HTTP client is created, no async task is scheduled. A startup log line confirms the disabled state:
What we deliberately did NOT do (yet)
data.resource extraction
Most endpoints are RPC POSTs without clean entity URLs
When we want investigators to filter by entity id
Capture response body's error reason into data.reason
Reading the body in middleware needs care with streaming responses
When 4xx/5xx volume warrants quick triage by reason
Local disk spool on emission failure
At-least-once is already provided by the Audit Manager itself; staff portal pod crashes mid-emit are rare and tolerable
If volumes/availability targets ever require zero loss on the producer side
Sampling
Audit volume from staff portal is small enough to capture every call
If a future caller is too chatty (high-traffic public API)
Promotion to a shared library
Keep the integration scoped to one service while we validate the shape
Once a second service wants the same middleware, lift it to openg2p-fastapi-common
Operational notes
Cold start cost: the
httpx.AsyncClientis lazy-created on the first emission, not at app import time. The first audited call sees a small extra latency (~ms) for client setup; subsequent calls reuse the connection pool.Restart safety: all in-flight
asyncio.create_task(_emit(...))calls are cancelled on shutdown. Up to a few hundred ms of tail emissions can be lost during a rolling restart. Acceptable per design (audits at this layer are best-effort; durability is provided by Audit Manager itself once the event reaches Kafka).Audit Manager errors are logged at WARN. A spike of WARN lines with
Audit emission failedtypically means: Audit Manager is down, unreachable, returning 503 backpressure, or the URL is wrong. None of these affect Staff Portal API responses.
Related pages
Local Install — Staff Portal API — set up the service locally before enabling audit.
Audit Manager — Functional Specifications — the CloudEvents schema and DB column mapping this middleware emits to.
Audit Manager — API Reference — the HTTP contract the middleware POSTs to.
Last updated
Was this helpful?