Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Restoring service after power outage

What happens to an RKE cluster after a power outage?

RKE/Kubernetes is good about recovering from a cluster shutdown and requires little intervention, though there is a specific order in which things should be powered back on to minimize errors. Etcd is our primary concern because the rest of the services are stateless. Etcd uses a write-ahead log (WAL) to store certain updates before applying them. If a member crashes and restarts between snapshots, it can locally recover transactions done since the last snapshot by looking at the content of the WAL. NOTE: Etcd uses fdatasync to flush writes from cache to disk.

Reproducing in a lab

  • Prerequisites
  • Edit the cluster.yml to include your node IPs and S3 settings
    vi ./cluster.yml
    
  • Stand up the cluster
    bash ./build.sh
    
  • Verify the cluster is up and healthy
    bash ./verify.sh
    
  • Break the cluster
    bash ./break.sh
    

Restoring/Recovering

  • Power on any storage devices if applicable. Check with your storage vendor to properly power on your storage devices and verify that they are ready.

  • For each etcd node:

    • Power on the system/start the instance.
    • Log into the system via ssh.
    • Ensure docker has started sudo service docker status or sudo systemctl status docker
    • Ensure etcd and kubelet’s status shows Up in Docker sudo docker ps
  • For each control plane node:

    • Power on the system/start the instance.
    • Log into the system via ssh.
    • Ensure docker has started sudo service docker status or sudo systemctl status docker
    • Ensure kube-apiserver, kube-scheduler, kube-controller-manager, and kubelet’s status shows Up in Docker sudo docker ps
  • For each worker node:

    • Power on the system/start the instance.
    • Log into the system via ssh.
    • Ensure docker has started sudo service docker status or sudo systemctl status docker
    • Ensure kubelet’s status shows Up in Docker sudo docker ps
    • Log into the Rancher UI (or use kubectl) and check your various projects to ensure workloads have started as expected. This may take a few minutes, depending on the number of workloads and your server capacity.

Preventive tasks