etcd has built in automated data corruption detection to prevent member state from diverging.
Data corruption detection can be done using:
--experimental-initial-corrupt-check
flag.--experimental-compact-hash-check-enabled
flag.--experimental-corrupt-check-time
flag.Initial check will be executed during bootstrap of etcd member. Member will compare its persistent state vs other members and exit if there is a mismatch.
Both periodic check will be executed by the cluster leader in a cluster that is already running. Leader will compare its persistent state vs other members and raise a CORRUPT ALARM if there is a mismatch. Both checks serve the same purpose, however they are both worth enabling to balance performance and time to detection.
When enabled using --experimental-compact-hash-check-enabled
flag, check will be executed once every minute.
This can be adjusted using --experimental-compact-hash-check-time
flag using format: 1m
- every minute, 1h
- evey hour.
This check extends compaction to also calculate checksum that can be compared between cluster members.
Doesn’t cause additional database scan making it very cheap, but requiring a regular compaction in cluster.
Enabled using --experimental-corrupt-check-time
flag, requires providing an execution period in format: 1m
- every minute, 1h
- evey hour.
Recommended period is a couple of hours due to a high performance cost.
Running a check requires computing a checksum by scanning entire etcd content at given revision.
There are three ways to restore a corrupted member:
After the corrupted member is restored, CORRUPT ALARM can be removed.
Members state can be purged by:
snap
subdirectory from the etcd data directory.etcd
with --initial-cluster-state=existing
and cluster members listed in --initial-cluster
.Etcd member is expected to download up-to-date snapshot from the leader.
Member can be replaced by:
etcdctl member remove
.etcdctl member add
etcd
with --initial-cluster-state=existing
and cluster members listed in --initial-cluster
.Cluster can be restored by saving a snapshot from current leader and restoring it to all members.
Run etcdctl snapshot save
against the leader and follow restoring a cluster procedure.
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.