Kubernetes in production - snapshotting cluster state

Are you the lucky one and your Kubernetes workload is 100% stateless? What about the cluster’s state though? What if you accidentally deleted the whole k8s cluster or a namespace? What if a new deployment broke the system? What if Kubernetes api’s upgrade caused a fail and you need to revert? What if an app bug wiped a persistent volume and you lost the data? What if you need to migrate the workload to a new cluster or replicate prod to dev?

Problem statement

Kubernetes stores its state in etcd. Managed Kubernetes solutions like Google Kubernetes Engine take care of the etcd behind the scenes so you can’t access the database directly.

Persistent volumes (and their APIs) are vendor-specific and snapshotting and restoring them can be a pain.

Problem solution

You can dump the Kubernetes api object with the export flag

$ kubectl get po/nginx --export -o yaml >nginx.yaml

and “recreate” it

$ kubectl apply -f nginx.yaml

If that’s all you need, you are good to go.

What I want is:

  • cron scheduled automatic backups/snapshots
  • event-based backups/snapshots
  • timestamps for the backups and configurable retention
  • persistent and cheap storage for the backups (integration with my cloud provider)
  • persistentVolumes snapshots
  • restores!

Luckily there’s a tool that solves most of the mentioned above and more.

Introducing Ark

Heptio’s ark gives you tools to back up and restore your Kubernetes cluster resources and persistent volumes. Ark lets you:

  • Take backups of your cluster and restore in case of loss.
  • Copy cluster resources across cloud providers. NOTE: Cloud volume migrations are not yet supported.
  • Replicate your production environment for development and testing environments.

Ark consists of:

  • A server that runs on your cluster
  • A command-line client that runs locally

I’m not gonna go through installing ark on your local machine, let’s skip that part and assume we already installed it.

Backup

Create a backup for any object that matches the app=nginx label selector:

ark backup create nginx-backup --selector app=nginx

Simulate a disaster:

kubectl delete namespace nginx-example

Restore from backup

ark restore create --from-backup nginx-backup

After the restore finishes, the output looks like the following:

NAME                          BACKUP         STATUS      WARNINGS   ERRORS    CREATED                         SELECTOR nginx-backup-20170727200524   nginx-backup   Completed   0          0         2017-07-27 20:05:24 +0000 UTC   <none>

Setup ark for your cloud provider

Store arks backups on object storage like Google Cloud Storage and backup vendor’s persistent disks like GCE Persistent Disks. Arks lets you use your locally deployed minio as well.

Disaster recovery

Using Schedules and Restore-Only Mode

If you periodically back up your cluster’s resources, you are able to return to a previous state in case of some unexpected mishap, such as a service outage. Doing so with Heptio Ark looks like the following:

After you first run the Ark server on your cluster, set up a daily backup (replacing <SCHEDULE NAME> in the command as desired):

ark schedule create <SCHEDULE NAME> --schedule "0 7 * * *"

This creates a Backup object with the name <SCHEDULE NAME>-<TIMESTAMP>.

A disaster happens and you need to recreate your resources.

Update the Ark server Config, setting restoreOnlyMode to true. This prevents Backup objects from being created or deleted during your Restore process.

Create a restore with your most recent Ark Backup:

ark restore create --from-backup <SCHEDULE NAME>-<TIMESTAMP>

Cluster migration

Using Backups and Restores

Heptio Ark can help you port your resources from one cluster to another, as long as you point each Ark Config to the same cloud object storage. In this scenario, we are also assuming that your clusters are hosted by the same cloud provider. Note that Heptio Ark does not support the migration of persistent volumes across cloud providers.

(Cluster 1) Assuming you haven’t already been checkpointing your data with the Ark schedule operation, you need to first back up your entire cluster (replacing <BACKUP-NAME> as desired):

ark backup create <BACKUP-NAME>

The default TTL is 30 days (720 hours); you can use the --ttl flag to change this as necessary.
(Cluster 2) Make sure that the persistentVolumeProvider and backupStorageProvider fields in the Ark Config match the ones from Cluster 1, so that your new Ark server instance is pointing to the same bucket.
(Cluster 2) Make sure that the Ark Backup object has been created. Ark resources are synced with the backup files available in cloud storage.

(Cluster 2) Once you have confirmed that the right Backup (<BACKUP-NAME>) is now present, you can restore everything with:

ark restore create --from-backup <BACKUP-NAME>