Production

Production Deployment Guide

The guide here provides some useful hints for production deployment. However, this guide is not intended to be a comprehensive production deployment handbook. Production deployments vary and implementers of OpenG2P (like System Integrators) have a choice of production configurations, orchestration platforms and components. We also encourage our partners to update this guide based on their learning in the field.

RBAC

Carefully assign roles to Rancher users. Pre-defined role templates are available on Rancher. Follow this guide. Specifically, protect the following action on resources:

  • Deletion of deployments/statefulsets

  • Viewing of secrets - at all levels - Cluster, Namespace

  • Deletion of configmaps, secrets

  • Access to DB via port forwarding

  • Logging into DB pods

Postgresql

  • Number of instances of Postgresql pods

  • Cloud native if available

  • Production configuration

  • Master / Slave configuration

High availability of services

Pod replication

  • Replication of pods for high-availability.

Node replication

  • Provisioning of VMs across different underlying hardware and subnets for resilience.

  • Minimum 3 nodes for Rancher and OpenG2P cluster (3 control planes).

Backup and Restore

ETCD

Backup the etcd data of all clusters to ensure recovery in case of a complete failure. Here is how to create and restore backups of the etcd data in an RKE2 cluster. Note: /var/lib/rancher/rke2 is the default data directory for rke2. In RKE2, snapshots are stored on each etcd node. If you have multiple etcd or etcd + control-plane nodes, you will have multiple copies of local etcd snapshots.

You can take a snapshot manually while RKE2 is running with the etcd-snapshot subcommand. For example: rke2 etcd-snapshot save --name pre-upgrade-snapshot.

Restoring a snapshot to existing nodes

When RKE2 is restored from backup, the old data directory will be moved to /var/lib/rancher/rke2/server/db/etcd-old-%date%/. RKE2 will then attempt to restore the snapshot by creating a new data directory and start etcd with a new RKE2 cluster with one etcd member.

  1. You must stop RKE2 service on all server nodes if it is enabled via systemd. Use the following command to do so: systemctl stop rke2-server

  2. Initiate the restore from the snapshot on the first server node with the following commands: rke2 server \ --cluster-reset \ --cluster-reset-restore-path=<PATH-TO-SNAPSHOT>

  3. Once the restore process is complete, start the rke2-server service on the first server node as follows: systemctl start rke2-server

  4. Remove the rke2 db directory on the other server nodes as follows: rm -rf /var/lib/rancher/rke2/server/db

  5. Start the rke2-server service on other server nodes with the following command: systemctl start rke2-server

Restoring a snapshot to new nodes

Note: For all versions of rke2 v.1.20.9 and prior, you will need to back up and restore certificates first due to a known issue in which bootstrap data might not save on restore (Steps 1 - 3 below assume this scenario). See note below for an additional version-specific restore caveat on restore.

  1. Back up the following: /var/lib/rancher/rke2/server/cred, /var/lib/rancher/rke2/server/tls, /var/lib/rancher/rke2/server/token, /etc/rancher

  2. Restore the certs in Step 1 above to the first new server node.

  3. Install rke2 v1.20.8+rke2r1 on the first new server node as in the following example: curl -sfL https://get.rke2.io | INSTALL_RKE2_VERSION="v1.20.8+rke2r1" sh -

  4. Stop RKE2 service on all server nodes if it is enabled and initiate the restore from snapshot on the first server node with the following commands: systemctl stop rke2-server rke2 server \ --cluster-reset \ --cluster-reset-restore-path=<PATH-TO-SNAPSHOT>

  5. Once the restore process is complete, start the rke2-server service on the first server node as follows: systemctl start rke2-server

  6. You can continue to add new server and worker nodes to cluster.

Backup of Persistent Volume information

The mapping between PVCs and PV must be saved after the installation so in case the cluster goes down, or NFS has issues, one can recreate the pods with original data. Download the YAML as shown below and keep it securely accessible to system administrators.

NFS

Cluster access key

Downloading of user's cluster access key to be able to operate OpenG2P cluster directly using kubectl in case Rancher is not accessible. Sys Admins may download this key using Rancher console and keep them safely and protected with them.

Image pull policy

Generally, Helm charts have Docker image pull policy mentioned as Always. This is not advisable in production as the image will get updated if Docker images change for the same tag. Set the imagePullPolicy: IfNotPresent or imagePullPolicy: Never in the Helm chart and upgrade the Helm chart on production.

Data cleanup

Make sure any test or stray data in Postgres, OpenSearch or any other persistence is cleaned up completely before rollout. In case of a fresh version install from scratch, make sure PVCs, and PVs from previous versions are deleted.

Security

Nginx

You may need to set Nginx load balancers in HA mode by having a Nginx cluster (available with Nginx Plus, but it comes with commercial terms). HA for Nginx is critical if user-facing portal traffic lands on the same. For back-office administration tasks, HA may not be critical.

You must adjust the max request body size according to the number of files/data being uploaded. The general limit is set at 50MiB per request. This can updated by modifying the client_max_body_size parameter in nginx.conf.

OpenSearch

  • Enable data nodes in OpenSearch so that backups can be taken of the data node.

  • The data node maybe enabled while installing OpenSearch. (TBD).

CEPH Storage

If the Kubernetes clusters are used for other critical applications with large data that is critical, CEPH storage may be considered. CEPH is a highly scalable and distributed data storage which provides high performance, reliability and scalability. The storage system is installed on a separate cluster and Kubernetes communicates the same via CSI drivers that are available. CEPH automatically replicates data across multiple nodes, ensuring data redundancy and protection against node failures. However, CEPH is very complex to set up and manage as compared to say NFS. It has a steep learning curve. Further, it requires high resources (CPU, memory, network).

Last updated