> For the complete documentation index, see [llms.txt](https://docs.openg2p.org/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.openg2p.org/operations/deployment/infrastructure-setup/backups.md).

# Backups

This page is the entry point for the OpenG2P backup automation that lives at `automation/backups/openg2p-backup.sh` in the deployment repo. It complements the [Production Automation](/operations/deployment/infrastructure-setup/production-automation.md): the production script gets the platform up, the backup automation keeps it recoverable.

{% hint style="info" %}
**Ongoing operational concern** — not a one-time deployment stage. Configure backups **before go-live** and keep them running throughout the system's lifetime. For the staged Production rollout, see the [Production overview](/operations/deployment/infrastructure-setup.md).
{% endhint %}

{% hint style="info" %}
Backups are **required for production** and must be in place before go-live — the Backup node is the 4th node of the production topology. The platform install and the backup setup are **separate steps**: bring the cluster up first, then provision the Backup node (`backup_node.enabled: true`) and run `openg2p-backup.sh install`. (You *can* stand the platform up first and add backups before go-live, but a production deployment is not complete without them.)
{% endhint %}

## What this is, in one paragraph

A 4th "backup" node, on the same VPC, runs cron-driven backups of every part of an OpenG2P production install — PostgreSQL via pgBackRest with WAL streaming for \~1-minute RPO, etcd snapshots from RKE2's built-in mechanism, Kubernetes resources via the [rancher-backup operator](https://ranchermanager.docs.rancher.com/integrations-in-rancher/backup-restore-and-disaster-recovery), NFS data via [restic](https://restic.net/) with a sidecar manifest that maps NFS UUID directories back to their PVC/namespace/app, and the small but critical filesystem state (Wireguard config, Nginx config, RKE2 TLS material) via restic over SSH-tar. All repos are encrypted at rest. Drills run weekly. Restores are deliberate — staged into temp dirs, never overwriting live data without an operator's runbook step.

## Sub-pages

* [Architecture](/operations/deployment/infrastructure-setup/backups/architecture.md) — the tools, why each is here, what's deliberately not used
* [What gets backed up](/operations/deployment/infrastructure-setup/backups/what-gets-backed-up.md) — the per-component table and the rationale for what's lost vs. recreated on a fresh install
* [Prerequisites](/operations/deployment/infrastructure-setup/backups/prerequisites.md) — backup-node sizing, network, secret custody (p12 keystore model)
* [Configuration](/operations/deployment/infrastructure-setup/backups/configuration.md) — `backup-config.yaml` reference
* [Operations](/operations/deployment/infrastructure-setup/backups/operations.md) — `install`, `run`, `verify`, `list`, `status`, group toggles
* [Drills](/operations/deployment/infrastructure-setup/backups/drills.md) — weekly verify + dry-run-restore harness, interpreting `.status.json`
* [Restoration](/operations/deployment/infrastructure-setup/backups/restoration.md) — index of restore scenarios
  * [Postgres PITR](/operations/deployment/infrastructure-setup/backups/restoration/postgres-pitr.md)
  * [Single PVC](/operations/deployment/infrastructure-setup/backups/restoration/single-pvc.md)
  * [Etcd in-place](/operations/deployment/infrastructure-setup/backups/restoration/etcd-in-place.md)
  * [Full rebuild](/operations/deployment/infrastructure-setup/backups/restoration/full-rebuild.md)
* [Alerting (Phase 2)](/operations/deployment/infrastructure-setup/backups/alerting.md) — candidate mechanisms, deferred from v1

## TL;DR — get backups running

```bash
# 0. (One-time, only if you didn't enable backup_node before) Re-provision
#    AWS to add the 4th instance + EBS volume.
cd automation/production/aws/
# Set backup_node.enabled: true in aws-config.yaml
./openg2p-aws-provision.sh --config aws-config.yaml

# 1. Configure backups.
cd ../../backups/
cp backup-config.example.yaml backup-config.yaml
# Edit backup-config.yaml — passphrase paths in your p12 keystore,
# group toggles (default: all on), retention, schedules.

# 2. Bootstrap.
./openg2p-backup.sh install --config backup-config.yaml

# 3. Smoke-test.
./openg2p-backup.sh run --config backup-config.yaml --component all
./openg2p-backup.sh status --config backup-config.yaml

# 4. (Optional, separate maintenance window) Enable encryption-at-rest for
#    Kubernetes Secrets in etcd. Apiserver restarts (~30-60s).
./openg2p-backup.sh install --config backup-config.yaml --enable-secret-encryption
```

After install, cron on the backup host runs the daily/weekly schedule. Operators interact with the system via the orchestrator from their laptop for ad-hoc runs, status checks, and restores.

## Recovery objectives

| Component                                         | RPO           | RTO                                           |
| ------------------------------------------------- | ------------- | --------------------------------------------- |
| PostgreSQL (with WAL streaming)                   | ≈1 min        | minutes (PITR), 10s of minutes (full restore) |
| Kubernetes resources (Secrets, CRs, PV/PVCs)      | 24h (nightly) | 5–15 min (rancher-backup `Restore` CR)        |
| NFS data                                          | 24h           | minutes per PVC, hours for full export        |
| etcd snapshots                                    | 6h            | 5–10 min (cluster-reset restore)              |
| RP/compute filesystem state (WG, Nginx, RKE2 TLS) | 24h           | minutes per subsystem                         |

All of these are configurable via `backup-config.yaml` schedules. The defaults match a 6-month retention window and assume a 1 TB backup volume; smaller volumes work but shorten retention before pruning.

## What this does not do

* **Multi-site / offsite replication.** v1 keeps one copy on one volume on the backup node. The 3-2-1 rule says 3 copies on 2 media with 1 offsite — this is 1/1/0. Plan a second offsite target later via `restic copy` or pgBackRest's secondary repo support.
* **Mass alerting.** Status is exposed as `/var/lib/openg2p-backup/.status.json` on the backup host. A Phase 2 layer wires that into Prometheus/email/Slack (see [Alerting](/operations/deployment/infrastructure-setup/backups/alerting.md)).
* **Full disaster-recovery rehearsal.** Weekly drills do per-component verify + dry-run-restore. Cluster-wide rehearsals into a sandbox VPC are a manual, separately-scheduled operator activity.
* **Restoring to a different cluster topology.** Restore assumes you're rebuilding into the same production shape. Cross-version or cross-architecture restore is out of scope.

## Reference

* [pgBackRest user guide](https://pgbackrest.org/user-guide.html)
* [restic documentation](https://restic.readthedocs.io/)
* [RKE2 backup and restore](https://docs.rke2.io/backup_restore)
* [Rancher Backup Operator](https://ranchermanager.docs.rancher.com/integrations-in-rancher/backup-restore-and-disaster-recovery)
* [Kubernetes encryption at rest](https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.openg2p.org/operations/deployment/infrastructure-setup/backups.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
