Production
Production Deployment Guide
The guide here provides some useful hints for production deployment. However, this guide is not intended to be a comprehensive production deployment handbook. Production deployments vary and implementers of OpenG2P (like System Integrators) have a choice of production configurations, orchestration platforms and components. We also encourage our partners to update this guide based on their learning in the field.
RBAC
Carefully assign roles to Rancher users. Pre-defined role templates are available on Rancher. Follow this guide. Specifically, protect the following action on resources:
Deletion of deployments/statefulsets
Viewing of secrets - at all levels - Cluster, Namespace
Deletion of configmaps, secrets
Access to DB via port forwarding
Logging into DB pods
Postgresql
Number of instances of Postgresql pods
Cloud native if available
Production configuration
Master / Slave configuration
High availability of services
Pod replication
Replication of pods for high-availability.
Node replication
Provisioning of VMs across different underlying hardware and subnets for resilience.
Minimum 3 nodes for Rancher and OpenG2P cluster (3 control planes).
Backup and Restore
ETCD
Backup the etcd data of all clusters to ensure recovery in case of a complete failure. Here is how to create and restore backups of the etcd data in an RKE2 cluster. Note: /var/lib/rancher/rke2 is the default data directory for rke2. In RKE2, snapshots are stored on each etcd node. If you have multiple etcd or etcd + control-plane nodes, you will have multiple copies of local etcd snapshots.
You can take a snapshot manually while RKE2 is running with the etcd-snapshot
subcommand. For example: rke2 etcd-snapshot save --name pre-upgrade-snapshot
.
Restoring a snapshot to existing nodes
When RKE2 is restored from backup, the old data directory will be moved to /var/lib/rancher/rke2/server/db/etcd-old-%date%/. RKE2 will then attempt to restore the snapshot by creating a new data directory and start etcd with a new RKE2 cluster with one etcd member.
You must stop RKE2 service on all server nodes if it is enabled via
systemd
. Use the following command to do so:systemctl stop rke2-server
Initiate the restore from the snapshot on the first server node with the following commands:
rke2 server \ --cluster-reset \ --cluster-reset-restore-path=<PATH-TO-SNAPSHOT>
Once the restore process is complete, start the rke2-server service on the first server node as follows:
systemctl start rke2-server
Remove the rke2 db directory on the other server nodes as follows:
rm -rf /var/lib/rancher/rke2/server/db
Start the rke2-server service on other server nodes with the following command:
systemctl start rke2-server
Restoring a snapshot to new nodes
Note: For all versions of rke2 v.1.20.9 and prior, you will need to back up and restore certificates first due to a known issue in which bootstrap data might not save on restore (Steps 1 - 3 below assume this scenario). See note below for an additional version-specific restore caveat on restore.
Back up the following:
/var/lib/rancher/rke2/server/cred
,/var/lib/rancher/rke2/server/tls
,/var/lib/rancher/rke2/server/token
,/etc/rancher
Restore the certs in Step 1 above to the first new server node.
Install rke2 v1.20.8+rke2r1 on the first new server node as in the following example:
curl -sfL https://get.rke2.io | INSTALL_RKE2_VERSION="v1.20.8+rke2r1" sh -
Stop RKE2 service on all server nodes if it is enabled and initiate the restore from snapshot on the first server node with the following commands:
systemctl stop rke2-server rke2 server \ --cluster-reset \ --cluster-reset-restore-path=<PATH-TO-SNAPSHOT>
Once the restore process is complete, start the rke2-server service on the first server node as follows:
systemctl start rke2-server
You can continue to add new server and worker nodes to cluster.
Backup of Persistent Volume information
The mapping between PVCs and PV must be saved after the installation so in case the cluster goes down, or NFS has issues, one can recreate the pods with original data. Download the YAML as shown below and keep it securely accessible to system administrators.
NFS
Cluster access key
Downloading of user's cluster access key to be able to operate OpenG2P cluster directly using kubectl
in case Rancher is not accessible. Sys Admins may download this key using Rancher console and keep them safely and protected with them.
Image pull policy
Generally, Helm charts have Docker image pull policy mentioned as Always
. This is not advisable in production as the image will get updated if Docker images change for the same tag. Set the imagePullPolicy: IfNotPresent
or imagePullPolicy: Never
in the Helm chart and upgrade the Helm chart on production.
Data cleanup
Make sure any test or stray data in Postgres, OpenSearch or any other persistence is cleaned up completely before rollout. In case of a fresh version install from scratch, make sure PVCs, and PVs from previous versions are deleted.
Security
Creation of private access channels.
Nginx
You may need to set Nginx load balancers in HA mode by having a Nginx cluster (available with Nginx Plus, but it comes with commercial terms). HA for Nginx is critical if user-facing portal traffic lands on the same. For back-office administration tasks, HA may not be critical.
You must adjust the max request body size according to the number of files/data being uploaded. The general limit is set at 50MiB per request. This can updated by modifying the client_max_body_size
parameter in nginx.conf.
OpenSearch
Enable data nodes in OpenSearch so that backups can be taken of the data node.
The data node maybe enabled while installing OpenSearch. (TBD).
CEPH Storage
If the Kubernetes clusters are used for other critical applications with large data that is critical, CEPH storage may be considered. CEPH is a highly scalable and distributed data storage which provides high performance, reliability and scalability. The storage system is installed on a separate cluster and Kubernetes communicates the same via CSI drivers that are available. CEPH automatically replicates data across multiple nodes, ensuring data redundancy and protection against node failures. However, CEPH is very complex to set up and manage as compared to say NFS. It has a steep learning curve. Further, it requires high resources (CPU, memory, network).
Last updated