Version v3.6-DRAFT of the documentation is in DRAFT status. For the latest stable documentation, see v3.5.
This document explains the etcd persistent storage format: naming, content and tools that allow developers to inspect them. Going forward the document should be extended with changes to the storage model. This document is targeted at etcd developers to help with their data recovery needs.
The following articles provide helpful background information for this document:
File name | High level purpose |
---|---|
./member/snap/db | bbolt b+tree that stores all the applied data, membership authorization information & metadata. It’s aware of what's the last applied WAL log index ("consistent_index"). |
./member/snap/0000000000000002-0000000000049425.snap ./member/snap/0000000000000002-0000000000061ace.snap | Periodic snapshots of legacy v2 store, containing:
As of etcd v3, the content is redundant to the content of /snap/db files. Periodically (30s) these files are purged, and the last |
/member/snap/000000000007a178.snap.db | A complete bbolt snapshot downloaded from the etcd leader if the replica was lagging too much. Has the same type of content as ( The file is used in 2 scenarios:
The file is not being deleted when the recovery is over (so whole content is populated to ./member/snap/db file). Periodically (30s) the files are purged.
Here also |
./member/wal/000000000000000f-00000000000b38c7.wal ./member/wal/000000000000000e-00000000000a7fe3.wal ./member/wal/000000000000000d-000000000009c70c.wal | Raft’s Write Ahead Logs, containing recent transactions accepted by Raft, periodic snapshots or CRC records. Recent If the snapshots are too infrequent, there can be more than |
./member/wal/0.tmp (or .../1.tmp) | Preallocated space for the next write ahead log file. Used to avoid Raft being stuck by a lack of WAL logs capacity without the possibility to raise an alarm. |
During etcd internal processing, it is possible that several short living files might be encountered:
File | High level purpose |
---|---|
./member/snap/0000000000000002-000000000007a178.snap.broken | Snapshot files are renamed as ‘broken’ when they cannot be loaded. The attempt to load the newest file happens when etcd is being started. Or during backup/migrate commands of etcdctl. |
./member/snap/tmp071677638 (random suffix) | Temporary (bbolt) file created on replicas in response to the msgSnap leaders request, so to the demand from the leader to recover storage from the given snapshot. After successful (complete) retrieval of content the file is renamed to: See etcd/issues/12837. Fixed in etcd 3.5. |
/member/snap/db.tmp.071677638 (random suffix) | A temporary file that contains a copy of the backend content (/member/snap/db), during the process of defragmentation. After the successful process the file is renamed to /member/snap/db, replacing the original backend. On etcd server startup these files get pruned. |
This file contains the main etcd content, applied to a specific point of the Raft log (see consistent_index).
The better bolt storage is physically organized as a b+tree. The physical pages of b-tree are never modified in-place1. Instead, the content is copied to a new page (reclaimed from the freepages list) and the old page is added to the free-pages list as soon as there is no open transaction that might access it. Thanks to this process, an open RO transaction sees a consistent historical state of the storage. The RW transaction is exclusive and blocking all other RW transactions.
Big values are stored on multiple continuous pages. The process of page reclamation combined with a need to allocate contiguous areas of pages of different sizes might lead to growing fragmentation of the bbolt storage.
The bbolt file never shrinks on its own. Only in the defragmentation process, the file can be rewritten to a new one that has some buffer of free pages on its end and has truncated size.
The bbolt storage is divided into buckets. In each bucket there are stored keys (byte[]->value byte[] pairs), in lexicographical order. The list below represents buckets used by etcd (as of version 3.5) and the keys in use.
Bucket | Key | Exemplar value | Description |
---|---|---|---|
alarm | rpcpb.Alarm:
{MemberID, Alarm: NONE|NOSPACE|CORRUPT} | nil | Indicates problems have been diagnosed in one of the members. |
auth | "authRevision" | "" (empty) or BigEndian.PutUint64 | Any change of Roles or Users increments this field on transaction commit. The value is used only for optimistic locking during the authorization process. |
authRoles | [roleName] as string | authpb.Role marshalled | |
authUsers | [userName] as string | authpb.User marshalled | |
cluster | "clusterVersion" | "3.5.0" (string) | minor version of consensus-agreed common storage version. |
"downgrade" | JSON:{ "target-version": "3.4.0" "enabled": true/false } | Persists intent configured by the most recent: Since v3.5 | |
key | [revisionId] encoded using bytesToRev{main,sub} The key-value deletes are marshalled with 't' at the end (as a "Thumbstone") | mvccpb.KeyValue marshalled proto (key, create_rev, mod_rev, version, value, lease id ) | |
lease | leasepb.Lease marshalled proto (ID, TTL, RemainingTTL) | Note: LeaseCheckpoint is extending only RemainingTTL. Just TTL is from the original Grant. Note2: We persist TTLs in seconds (from the undefined 'now'). Crash-looping server does not release leases!!! | |
members | [memberId] in hex as string: "8e9e05c52164694d" | JSON as string serialized Member structure:{ "id":10276657743932975437, "peerURLs":[ "http://localhost:2380"], "name":"default", "clientURLs": ["http://localhost:2379"] } | Agreed cluster membership information. |
members_removed | [memberId] in hex as string: "8e9e05c52164694d" | []byte("removed") | Ids of all removed members. Used to validate that a removed member is never added again under the same id. The field is currently (3.4) read from store V2 and never from V3. See https://github.com/etcd-io/etcd/pull/12820 |
meta | "consistent_index" | uint64 bytes (BigEndian) | Represents the offset of the last applied WAL entry to the bolt DB storage. |
"scheduledCompactRev" | bytesToRev{main,sub} encoded. (16 bytes) | Used to reinitialize compaction if a crash happened after a compaction request. | |
"finishedCompactRev" | bytesToRev{main,sub} encoded. (16 bytes) | Revision at which store was recently successfully compacted (https://github.com/etcd-io/etcd/blob/ae7862e8bc8007eb396099db4e0e04ac026c8df5/server/mvcc/kvstore_compaction.go#L54) | |
"confState" | Since etcd 3.5 | ||
"term" | Since etcd 3.5 | ||
"storage-version" |
bbolt has a command line tool that enables inspecting the file content.
Examples of use:
% go run go.etcd.io/bbolt/cmd/bbolt buckets ./default.etcd/member/snap/db
% go run go.etcd.io/bbolt/cmd/bbolt get ./default.etcd/member/snap/db cluster clusterVersion
etcd-dump-db can be used to list content of v3 etcd backend (bbolt).
% go run go.etcd.io/etcd/v3/tools/etcd-dump-db list-bucket default.etcd
alarm
auth
...
See more examples in: https://github.com/etcd-io/etcd/tree/master/tools/etcd-dump-db
Write ahead log is a Raft persistent storage that is used to store proposals. First the leader stores the proposal in its log and then (concurrently) replicates it using Raft protocol to followers. Each follower persists the proposal in its WAL before confirming back replication to the leader.
The WAL log used in etcd differs from canonical Raft model 2-fold:
The WAL log files are named using following pattern:
"%016x-%016x.wal", seq, index
Example: ./member/wal/0000000000000010-00000000000bf1e6.wal
So the file names contains hex-encoded:
The WAL log file contains a sequence of “Frames”. Each frame contains:
The files are “cut” (new file is started) when the current file is exceeding 64*10^6
bytes.
Write ahead log files in the logical layer contains:
Raftpb.Entry:
recent proposals replicated by Raft leader. Some of these proposals are considered ‘committed’ and the others are subject to be logically overridden.Raftpb.HardState(term,commit,vote):
periodic (very frequent) information about the index of a log entry that is ‘committed’ (replicated to the majority of servers), so guaranteed to be not changed/overridden and that can be applied to the backends (v2, v3). It also contains a “term” (indicator whether there were any election related changes) and a vote - a member the current replica voted for in the current term.walpb.Snapshot(term, index):
periodic snapshots of Raft state (no DB content, just snapshot log index and Raft term)etcdserverpb.Metadata(node_id, cluster_id)
- identifying the cluster & replica the log represents.Each WAL-log file is build from (in order):
CRC-32 frame (running crc from all previous files, 0 for the first file).
Metadata frame (cluster & replica IDs)
For the initial WAL file only:
For not initial (2nd+) WAL file:
Mix of entry, hard-state & snapshot records
The WAL log can contain multiple entries for the same index. Such a situation can happen in cases described in figure 7. of the Raft paper. The etcd WAL log is appended only, so the entries are getting overridden, by appending a new entry with the same index.
In particular during the WAL reading, the logic is overriding old entries with newer entries. Thus only the last version of entries with entry.index <= HardState.commit can be considered as final. Entries with index > HardState.commit are subject to change.
The “terms” in the WAL log are expected to be monotonic.
The “indexes” in the WAL log are expected to:
etcd WAL logs can be read using etcd-dump-logs tool:
% go install go.etcd.io/etcd/v3/tools/etcd-dump-logs@latest
% go run go.etcd.io/etcd/v3/tools/etcd-dump-logs --start-index=0 aname.etcd
Be aware that:
member/snap/{term}-{index}.snap
The filenames are generated here ("%016x-%016x.snap")
and are using 2 hex-encoded compounds:
The *.snap files are created by Snapshotter.SaveSnap method.
There are 2 triggers controlling creation of these files:
Currently the files are roughly3 associated 1-1 with WAL logs Snapshot entries. With store v2 decommissioning we expect the files to stop being written at all (opt-in: 3.5.x, mandatory 3.6.x).
The file contains marshalled snapdb.snapshot proto (uint32 crc, bytes data)
,
that in the ‘data’ field holds Raftpb.Snapshot:
(bytes data, SnapshotMetadata{index, term, conf} metadata),
Finally the nested data holds a JSON serialized store v2 content.
In particular there is:
/0/members/8e9e05c52164694d/attributes -> {"name":"default","clientURLs":["[http://localhost:2379](http://localhost:2379)"]}
/0/members/8e9e05c52164694d/RaftAttributes -> "{"peerURLs":["http://localhost:2380"]}"
Following command allows you to see the file content when executed from etcd root directory:
cat default.etcd/member/snap/0000000000000002-0000000000049425.snap |
protoc --decode=snappb.snapshot \
server/etcdserver/api/snap/snappb/snap.proto \
-I $(go list -f '{{.Dir}}' github.com/gogo/protobuf/proto)/.. \
-I . \
-I $(go list -m -f '{{.Dir}}' github.com/gogo/protobuf)/protobuf
Analogously you can extract ‘data’ field and decode as ‘Raftpb.Snapshot'
{
"Root":{
"Path":"/",
"CreatedIndex":0,
"ModifiedIndex":0,
"ExpireTime":"0001-01-01T00:00:00Z",
"Value":"",
"Children":{
"0":{
"Path":"/0",
"CreatedIndex":0,
"ModifiedIndex":0,
"ExpireTime":"0001-01-01T00:00:00Z",
"Value":"",
"Children":{
"members":{
"Path":"/0/members",
"CreatedIndex":1,
"ModifiedIndex":1,
"ExpireTime":"0001-01-01T00:00:00Z",
"Value":"",
"Children":{
"8e9e05c52164694d":{
"Path":"/0/members/8e9e05c52164694d",
"CreatedIndex":1,
"ModifiedIndex":1,
"ExpireTime":"0001-01-01T00:00:00Z",
"Value":"",
"Children":{
"attributes":{
"Path":"/0/members/8e9e05c52164694d/attributes",
"CreatedIndex":2,
"ModifiedIndex":2,
"ExpireTime":"0001-01-01T00:00:00Z",
"Value":"{\"name\":\"default\",\"clientURLs\":[\"http://localhost:2379\"]}",
"Children":null
},
"RaftAttributes":{
"Path":"/0/members/8e9e05c52164694d/RaftAttributes",
"CreatedIndex":1,
"ModifiedIndex":1,
"ExpireTime":"0001-01-01T00:00:00Z",
"Value":"{\"peerURLs\":[\"http://localhost:2380\"]}",
"Children":null
}
}
}
}
},
"version":{
"Path":"/0/version",
"CreatedIndex":3,
"ModifiedIndex":3,
"ExpireTime":"0001-01-01T00:00:00Z",
"Value":"3.5.0",
"Children":null
}
}
},
"1":{
"Path":"/1",
"CreatedIndex":0,
"ModifiedIndex":0,
"ExpireTime":"0001-01-01T00:00:00Z",
"Value":"",
"Children":{
}
}
}
},
"WatcherHub":{
"EventHistory":{
"Queue":{
"Events":[
{
"action":"create",
"node":{
"key":"/0/members/8e9e05c52164694d/RaftAttributes",
"value":"{\"peerURLs\":[\"http://localhost:2380\"]}",
"modifiedIndex":1,
"createdIndex":1
}
},
{
"action":"set",
"node":{
"key":"/0/members/8e9e05c52164694d/attributes",
"value":"{\"name\":\"default\",\"clientURLs\":[\"http://localhost:2379\"]}",
"modifiedIndex":2,
"createdIndex":2
}
},
{
"action":"set",
"node":{
"key":"/0/version",
"value":"3.5.0",
"modifiedIndex":3,
"createdIndex":3
}
}
]
}
}
}
}
This section is reserved to describe changes to the file formats introduces between different etcd versions.
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.