Announcing etcd 3.5

When we launched etcd 3.4 back in August 2019, our focus was on storage backend improvements, non-voting member and pre-vote features. Since then, etcd has become more widely used for various mission critical clustering and database applications and as a result, its feature set has grown more broad and complex. Thus, improving its stability and reliability has been top priority in recent development.

Today, we are releasing etcd 3.5. The past two years allowed for extensive iterations in fixing numerous bugs, identifying optimization opportunities at scale, and evolving its surrounding ecosystem. The etcd project also became a CNCF graduated project during this time frame. This release is the result of continuous evolution and grungy, thankless tasks done by the etcd community.

In this blog post, we review the most notable changes to etcd 3.5, and present a project road map for future releases. For a complete list of changes, see CHANGELOG 3.5. For more updates, follow us on Twitter @etcdio. To get etcd, see Install.

Security

Given that etcd often handles sensitive data, improving and maintaining security posture is our highest priority. In order to have a comprehensive understanding of etcd security landscape, we completed third-party security audits: The first report was published in February 2020, where we identified (and fixed) various edge cases and high severity issues. For more information, see security audit report.

To adhere to the highest levels of security best practices, etcd now has a security release process, and runs automated tests with static analysis tools, such as errcheck, ineffassign, and others.

Features

The migration to structured logging is complete. etcd now defaults to zap logger that has a reflection-free, zero-allocation JSON encoder. We have deprecated capnslog that logged with reflection-based serialization.

etcd now supports built-in log rotation that configures rotate thresholds, compression algorithms, etc. For more information, see hexfusion@ of Red Hat’s code change.

etcd now emits more detailed tracing information for expensive requests, such as:

{
"caller":"traceutil/trace.go:116",
"msg":"trace[123] range",
"detail":"{
    range_begin:foo;
    range_end:fooo; response_count:100000; response_revision:191496;}",
"duration":"132.449773ms",
"start":"...:32.611-0700",
"end":"...:32.744-0700",
"steps":[
    "trace[123] step 'range keys from bolt db' (duration: 92.521911ms)",
    "trace[123] step 'filter and sort the key-value pairs' (duration: 22.789099ms)"]}

This provides a very useful signal for understanding the lifetime of a request spanning multiple etcd server components. See YoyinZyc@’s code change (from Google).

Each etcd cluster maintains its own cluster version, a value agreed on by the quorum of the cluster. Previously, downgrading such cluster versions (e.g., etcd minor version from 3.5 to 3.4) was not supported in order to protect against incompatible changes.

Let’s say we allow 3.3 node to join 3.4 cluster and send a lease checkpoint request to the leader, which was introduced only in etcd 3.4. When 3.3 node receives the lease checkpoint request, it fails to handle the unknown apply request (see etcd server apply code). However, one might not use such a lease checkpointer feature and be willing to risk incompatible changes in order to perform emergency rollbacks (e.g., major defect in new etcd versions). To ensure such rollbacks are easy and reliable, we’ve added a downgrade API that validates, enables, and cancels etcd version downgrades. For more information, see YoyinZyc@ of Google’s code change .

By requiring quorum agreement, etcd cluster membership applies with the same level of consistency as writes. Previously however, the member list call was directly served from the server’s local data, which could be stale. Now, etcd serves the member list with a linearizable guarantee — if the server is disconnected from the quorum, the member list call will fail. See jingyih@ of Google’s code change.

gRPC gateway endpoint is now stable under /v3/*. gRPC gateway generates an HTTP API to make etcd gRPC-based HTTP/2 protocol accessible via HTTP/1, such as:

curl -X POST -L http://localhost:2379/v3/kv/put -d '{"key": "Zm9v", "value": "YmFy"}'

etcd client now uses the latest gRPC, v1.32.0, requires a new import path "go.etcd.io/etcd/client/v3", and migrates balancer implementation upstream. For more information, see ptabor@ of Google’s code change.

Bug fixes

etcd reliability and correctness are of utmost importance. That is why we backport all critical bug fixes to previous etcd releases. The following are the notable bugs we identified and fixed during etcd 3.5 development:

  1. Lease objects piling up caused memory leaks, and the solution was to clear the expired lease queue in the old leader. For details, see tangcong@ of Tencent’s fix.
  2. Ongoing compact operation caused deadlock in mvcc storage layer. For details, see tangcong@ of Tencent’s fix.
  3. etcd server restart had redundant backend database open operations and as a result, reloading 40-million keys took over 5-minutes, and the solution reduced the restart time by half. For details, see tangcong@ of Tencent’s fix.
  4. If etcd crashed before completing defragmentation, the next defragment operation might have read the corrupted file. The solution was to ignore and overwrite the existing file. See jpbetz@ of Google’s fix.
  5. Client cancelling watch did not signal the server to create leaky watchers. The solution was to explicitly send a cancel request to the server. See jackkleeman@ of Apple’s fix.

Performance

Kubernetes, the most prominent user of etcd, queries the entire keyspace to list and watch its cluster resources. This range query happens whenever the resource is not found in the kube-apiserver reflector cache (e.g., requested etcd revision has been compacted, see kube-apiserver v1.21 code), often causing slowness in reads from overloaded etcd servers (see GitHub issue). In such cases, kube-apiserver tracing warns as below:

"List" url:/api/v1/pods,user-agent... (started: ...) (total time: 1.208s): Trace[...]: [1.208s] [1.204s] Writing http response done count:4346

And etcd warns as follows:

etcdserver: read-only range request key:"/registry/pods/" range_end:"/registry/pods0" revision:... range_response_count:500 size:291984 took too long (723.099118ms) to execute

Our deep-dive into etcd heap profile uncovered major inefficiency in the server warning logger that had a redundant encoding operation only to compute the size of range responses with proto.Size call. As a result, a large range query had up to 60% heap allocation overhead, thus causing out-of-memory crashes, or OOM, in overloaded etcd servers (see Figure 1). We optimized the protocol buffer message size operation and as a result, reduced etcd memory consumption up to 50% during peak usage (see Figure 2). It was a set of small code changes, but for years, such performance achievements were invisible without extensive testing and workload simulation. See chaochn47@ of Amazon Web Services’s investigation and patch to replace proto.Size calls.

figure-1

Figure 1: etcd heap usage during a slow request warn logging for computing response size of a protocol buffer message. 61% of the heap was allocated in the proto.Size call path that encodes all key-value pairs in the message to compute the size.

figure-2

Figure 2: etcd heap usage during range query before and after replacing the proto.Size call. Optimizing the proto.Size call in the etcd server reduced memory usage up to 50%.

The etcd 3.4 release made backend read transactions fully concurrent by copying transaction buffers rather than sharing between writes and concurrent reads (see code change from 3.4 release). However, such a buffering mechanism comes with unavoidable copy overhead and negatively impacted write-heavy transaction performance, as creating concurrent read transactions acquires a mutex lock which then blocks incoming write transactions.

etcd 3.5 improvements further increase transaction concurrency.

  1. If a transaction includes a PUT (update) operation, the transaction instead shares the transaction buffer between reads and writes (same behavior as 3.3) in order to avoid copying buffers. This transaction mode can be disabled via etcd --experimental-txn-mode-write-with-shared-buffer=false.

    The benchmark results show that the transaction throughput with a high write ratio has increased up to 2.7 times by avoiding copying buffers when creating a write transaction (see Figures 3 and 4). This benefits all kube-apiserver create and update calls that use etcd transactions (see etcd3 store v1.21 code). For more information, see wilsonwang371@ of ByteDance’s code change and benchmark results.

    figure-3 Figure 3: etcd transaction ratio with a high write ratio. The value at the top is the ratio of reads and writes. The first ratio, 0.125, is 1 read per 8 writes. The second ratio, 0.25, is 1 read for 4 writes. The value at the right bar represents the inverse ratio of transaction throughput before and after etcd/pull/12896. With the shared buffer approach for writes, the transaction throughput is increased up to 2.7 times.

    figure-4

    Figure 4: etcd transaction ratio with a high read ratio. The value at the top is the ratio of reads and writes. The first ratio, 4.0, is 4 reads per 1 write. The second ratio, 8.0, is 8 reads per 1 write. The value at the right bar represents the inverse ratio of transaction throughput before and after etcd/pull/12896. With the shared buffer approach for writes, the transaction throughput is increased up to 25%.

  2. etcd now caches the transaction buffer to avoid the unnecessary copy operations. This speeds up concurrent read transaction creation and as a result, the transaction with a high read ratio has increased up to 2.4 times (see Figures 5 and 6). See wilsonwang371@ of ByteDance’s code change and benchmark results.

    figure-5

    Figure 5: etcd transaction ratio with a high write ratio. The value at the top is the ratio of reads and writes. The first ratio, 0.125, is 1 read per 8 writes. The second ratio, 0.25, is 1 read per 4 writes. The value at the right bar represents the inverse ratio of transaction throughput before and after etcd/pull/12933. With the caching mechanism for read transactions, the transaction throughput is increased up to 1.4 times.

    figure-6

    Figure 6: etcd transaction ratio with a high read ratio. The value at the top is the ratio of reads and writes. The first ratio, 4.0, is 4 reads per 1 write. The second ratio, 8.0, is 8 reads per 1 write. The value at the right bar represents the inverse ratio of transaction throughput before and after etcd/pull/12933. With the caching mechanism for read transactions, the transaction throughput is increased up to 2.5 times.

Monitoring

Long-running load tests revealed that etcd server misrepresented its real memory usage by masking the impact of Go garbage collection. We discovered that etcd server with Go 1.12 changed the runtime to use MADV_FREE in Linux kernel, and as a result, reclaimed memory was not reflected in the resident set size, or RSS, metric. This had made the etcd memory usage metric inaccurately static, thus showing no sign of Go garbage collection. To fix this monitoring problem, we compile etcd 3.5 with Go 1.16 that defaults to MADV_DONTNEED on Linux. For more information, see Figure 7 and GitHub Go issue 42330.

figure-7

Figure 7: etcd memory usage during range query before and after setting MADV_DONTNEED in Go runtime. When run with GODEBUG=madvdontneed=1" or Go v1.16 or later, the etcd server accurately reports its memory usage in the CloudWatch mem_used metric or other monitoring tools such as top.

Monitoring is a fundamental service for reliability and observability. Monitoring enables individual service owners to understand the current state and identify possible causes for problem reports. Known as telemetry, the goal is to detect early warning signs and diagnose potential issues. etcd creates server logs with tracing information and publishes Prometheus metrics.

This information helps us determine possible service impacts and causes. However, when a request call chain spans multiple external components (e.g., from kube-apiserver to etcd), it is harder to identify the issue. In order to efficiently identify the root cause, we added distributed tracing support using OpenTelemetry: When the distributed tracing is enabled, etcd now uses OpenTelemetry to produce a trace across the RPC call chain and thus easily integrate with the surrounding ecosystem. See Figure 8, lilic@ of Red Hat’s proposal, and GitHub issue 12460.

figure-8

Figure 8: Example jaeger tracing UI for etcd.

Testing

Distributed systems are full of subtle edge cases. The fact that a bug may only emerge under special circumstances warrants exhaustive test coverage beyond simple unit testing. etcd runs integration, end-to-end, and failure injection tests, which provide a reliable and faster way to validate each change. However, as development continued with extended feature sets, flaky tests quickly piled up draining our productivity. So, we took on the series of arduous tasks that often required multiple hours of debugging to root cause of failures to improve test quality. Some notable changes include the following:

Platforms

etcd presubmit tests are now fast and reliable, but were mostly running on x86 processors. And there have been numerous requests to support other architectures like ARM (see GitHub issue 12852) and s390x (see GitHub issue 11163). A self-hosted GitHub action runner provides a consistent way of hosting various external test workers (see GitHub issue 12856). Using the GitHub action, etcd now runs tests on ARM-based AWS EC2 instances (Graviton), thereby officially supporting ARM64 (aarch64) platform. In addition, we introduced a mechanism to support other platforms and categorized support tiers based on testing coverage. For more information, see the Supported platforms documentation PR 273 and the Supported platforms documentation.

Developer experience

To better support integration with external projects, etcd now fully adopts the Go 1.16 module. This brought challenges as the existing codebase made migration arduous, causing concerns regarding adoption in the community). Tooling is an important part of etcd development and as a result, we needed a better solution to support our contributor experience.

Using Go module enables clear separation between server and client code, ease of change management for dependency updates, and verifiable build system without a convoluted codebase for vendoring. With reproducible builds in place, we eliminated the need for vending dependency and as a result, we reduced the etcd codebase size by half. See ptabor@ of Google’s modularization proposal and code change to remove the vendor directory.

To better isolate dependency trees, the etcd command-line interface now has a new administrative tool etcdutl (not etcdut-i-l), and the sub-commands include etcdutl snapshot and etcdutl defrag: The etcdctl snapshot and etcdctl defrag commands in previous releases are deprecated. This change aligns well with the new Go module layouts: etcdctl solely relies on client v3 libraries, whereas etcdutl might depend on etcd server-side packages, such as bolt and backend database code. See ptabor@ of Google’s code change.

To support inclusive naming initiatives, etcd projects renamed the default branch master to main (see etcd-io/etcd change and etcd-io/website change). The migration was seamless as GitHub handles the necessary redirects (see GitHub renaming).

Each etcd write incurs an append message in Raft thus fsynced to disk. However, such persistence might not be desirable for testing. To work around this, we’ve added the etcd --unsafe-no-fsync flag to bypass disk writes for Raft WAL entries. See crawshaw@ of Tailscale’s code change and GitHub issue 11930.

Community

The diversity of etcd end users keeps expanding: Cloudflare relies on etcd for managing its data center, Grafana Cortex stores its configuration data in etcd, Netflix Titus uses etcd for managing its container workloads, and Tailscale runs its control plane on top of etcd.

We have also extended our team of vendor contributors. In the etcd 3.5 release, we’ve added two core maintainers; Wenjia Zhang of Google, who’s been leading etcd community meetings and Kubernetes integration, and Piotr Tabor of Google, who’s been leading numerous bug fixes and codebase modularization work.

The diversity of contributors is key to building a sustainable, welcoming open source project and fostering manageable work environments.

See the 2021 CNCF etcd project journey report for more information.

New etcd.io

Since etcd joined Cloud Native Computing Foundation (CNCF) in December 2018, we have refactored all user-facing documentation into a dedicated repository etcd-io/website and modernized its website hosting with Hugo). The migration was a huge undertaking that required multiple months of engineering efforts and communication between maintainers. The very blog post you are reading now is hosted on the new etcd.io, thanks to lucperkins@, chalin@ of CNCF, nate-double-u@ of CNCF, and many other community contributors.

Future roadmaps

Traffic overloads can cause cascading node failures and as a result, scaling such clusters becomes challenging and could impair the ability to recover from quorum loss. With so many mission critical systems built on top of etcd, defending etcd against overload is paramount. We will revisit the etcd throttle feature to shed excessive loads gracefully. Currently, the etcd project has two pending rate-limiter proposals: vivekpatani@ of Apple’s proposal and tangcong@ of Tencent’s proposal.

Large range querying from kube-apiserver is still the most challenging source of process crashing, as it is relatively unpredictable. Our heap profile on such workloads found that the etcd range request handler decodes and holds the entire response before sending it out to gRPC server, adding up to 37% heap allocation. See Figure 9 and chaochn47@ of Amazon Web Services’s investigation.

Paginating range calls in client code doesn’t fully address the issue, because it entails additional consistency considerations and still requires full relists for expired resources. For more information, see kube-apiserver v1.21 code). To work around this inefficiency, etcd needs to support range streams. We will revisit yangxuanjia@ of JD’s range stream proposal, as it requires a significant level of effort to introduce such semantic changes in etcd and in downstream projects.

figure-9

Figure 9: etcd usage during a range query for listing Kubernetes pods. 37% of the heap was allocated in etcd mvcc rangeKeys to hold key-value pairs for creating a range query response.

In order to reduce maintenance overhead, we are completely deprecating the etcd v2 API in favor of a more performant and widely adopted v3 API. The v2 storage translation layer via etcd --experimental-enable-v2v3 remains experimental in 3.5 and to be removed in the next release. For details, see ptabor@ of Google’s proposal.

Historically, etcd releases have been a large undertaking due to infrequency, because of the large delta, and a need for release automation. We will develop an automated release system that is more accessible to the community.

Last modified April 9, 2022: Fix typos (a2da31e)