etcd has built in automated data corruption detection to prevent member state from diverging.
Data corruption detection can be done in two ways:
--experimental-initial-corrupt-check
flag.--experimental-corrupt-check-time
flag.Initial check will be executed during bootstrap of etcd member. Member will compare its persistent state vs other members and exit if there is a mismatch.
Periodic check will be executed by the cluster leader in a cluster that is already running.
Leader will compare its persistent state vs other members and raise a CORRUPT ALARM if there is a mismatch.
Period of checks is configured using format: 1m
- every minute, 1h
- evey hour.
Recommended period is a couple of hours as there is a high performance cost.
Running a check requires computing a checksum by scanning entire etcd content at given revision.
There are three ways to restore a corrupted member:
After the corrupted member is restored, CORRUPT ALARM can be removed.
Members state can be purged by:
snap
subdirectory from the etcd data directory.etcd
with --initial-cluster-state=existing
and cluster members listed in --initial-cluster
.Etcd member is expected to download up-to-date snapshot from the leader.
Member can be replaced by:
etcdctl member remove
.etcdctl member add
etcd
with --initial-cluster-state=existing
and cluster members listed in --initial-cluster
.Cluster can be restored by saving a snapshot from current leader and restoring it to all members.
Run etcdctl snapshot save
against the leader and follow restoring a cluster procedure.
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.