Version 3.3.12 home Download and build Libraries and tools Branch management Demo Discovery service protocol Frequently Asked Questions (FAQ) Logging conventions Metrics Production users Reporting bugs Tuning etcd release guide Benchmarks Benchmarking etcd v2.1.0 Benchmarking etcd v2.2.0 Benchmarking etcd v2.2.0-rc Benchmarking etcd v2.2.0-rc-memory Benchmarking etcd v3 Storage Memory Usage Benchmark Watch Memory Usage Benchmark Developer guide Experimental APIs and features Interacting with etcd Set up a local cluster System limits Why gRPC gateway etcd API Reference etcd concurrency API Reference gRPC naming and discovery Learning etcd client architecture Client feature matrix Data model Glossary KV API guarantees Learner etcd v3 authentication design etcd versus other key-value stores etcd3 API Operations guide Clustering Guide Configuration flags Design of runtime reconfiguration Disaster recovery Failure modes Hardware recommendations Maintenance Migrate applications from using API v2 to API v3 Monitoring etcd Performance Role-based access control Run etcd clusters inside containers Runtime reconfiguration Supported systems Transport security model Versioning etcd gateway gRPC proxy Platforms Amazon Web Services Container Linux with systemd FreeBSD Upgrading Upgrade etcd from 2.3 to 3.0 Upgrade etcd from 3.0 to 3.1 Upgrade etcd from 3.1 to 3.2 Upgrade etcd from 3.2 to 3.3 Upgrade etcd from 3.3 to 3.4 Upgrade etcd from 3.4 to 3.5 Upgrading etcd clusters and applications etcd v3 API

Data model



etcd is designed to reliably store infrequently updated data and provide reliable watch queries. etcd exposes previous versions of key-value pairs to support inexpensive snapshots and watch history events (“time travel queries”). A persistent, multi-version, concurrency-control data model is a good fit for these use cases.

etcd stores data in a multiversion persistent key-value store. The persistent key-value store preserves the previous version of a key-value pair when its value is superseded with new data. The key-value store is effectively immutable; its operations do not update the structure in-place, but instead always generate a new updated structure. All past versions of keys are still accessible and watchable after modification. To prevent the data store from growing indefinitely over time and from maintaining old versions, the store may be compacted to shed the oldest versions of superseded data.

Logical view

The store’s logical view is a flat binary key space. The key space has a lexically sorted index on byte string keys so range queries are inexpensive.

The key space maintains multiple revisions. Each atomic mutative operation (e.g., a transaction operation may contain multiple operations) creates a new revision on the key space. All data held by previous revisions remains unchanged. Old versions of key can still be accessed through previous revisions. Likewise, revisions are indexed as well; ranging over revisions with watchers is efficient. If the store is compacted to save space, revisions before the compact revision will be removed. Revisions are monotonically increasing over the lifetime of a cluster.

A key’s life spans a generation, from creation to deletion. Each key may have one or multiple generations. Creating a key increments the version of that key, starting at 1 if the key does not exist at the current revision. Deleting a key generates a key tombstone, concluding the key’s current generation by resetting its version to 0. Each modification of a key increments its version; so, versions are monotonically increasing within a key’s generation. Once a compaction happens, any generation ended before the compaction revision will be removed, and values set before the compaction revision except the latest one will be removed.

Physical view

etcd stores the physical data as key-value pairs in a persistent b+tree. Each revision of the store’s state only contains the delta from its previous revision to be efficient. A single revision may correspond to multiple keys in the tree.

The key of key-value pair is a 3-tuple (major, sub, type). Major is the store revision holding the key. Sub differentiates among keys within the same revision. Type is an optional suffix for special value (e.g., t if the value contains a tombstone). The value of the key-value pair contains the modification from previous revision, thus one delta from previous revision. The b+tree is ordered by key in lexical byte-order. Ranged lookups over revision deltas are fast; this enables quickly finding modifications from one specific revision to another. Compaction removes out-of-date keys-value pairs.

etcd also keeps a secondary in-memory btree index to speed up range queries over keys. The keys in the btree index are the keys of the store exposed to user. The value is a pointer to the modification of the persistent b+tree. Compaction removes dead pointers.

© 2019 The etcd authors

Data model