Run etcd clusters as a Kubernetes StatefulSet

Running etcd as a Kubernetes StatefulSet

Below demonstrates how to perform the static bootstrap process as a Kubernetes StatefulSet.

Example Manifest

This manifest contains a service and statefulset for deploying a static etcd cluster in kubernetes.

If you copy the contents of the manifest into a file named etcd.yaml, it can be applied to a cluster with this command.

$ kubectl apply --filename etcd.yaml

Upon being applied, wait for the pods to become ready.

$ kubectl get pods
NAME     READY   STATUS    RESTARTS   AGE
etcd-0   1/1     Running   0          24m
etcd-1   1/1     Running   0          24m
etcd-2   1/1     Running   0          24m

The container used in the example includes etcdctl and can be called directly inside the pods.

$ kubectl exec -it etcd-0 -- etcdctl member list -wtable
+------------------+---------+--------+-------------------------+-------------------------+------------+
|        ID        | STATUS  |  NAME  |       PEER ADDRS        |      CLIENT ADDRS       | IS LEARNER |
+------------------+---------+--------+-------------------------+-------------------------+------------+
| 4f98c3545405a0b0 | started | etcd-2 | http://etcd-2.etcd:2380 | http://etcd-2.etcd:2379 |      false |
| a394e0ee91773643 | started | etcd-0 | http://etcd-0.etcd:2380 | http://etcd-0.etcd:2379 |      false |
| d10297b8d2f01265 | started | etcd-1 | http://etcd-1.etcd:2380 | http://etcd-1.etcd:2379 |      false |
+------------------+---------+--------+-------------------------+-------------------------+------------+

To deploy with a self-signed certificate, refer to the commented configuration headings starting with ## TLS to find values that you can uncomment. Additional instructions for generating a cert with cert-manager is included in a section below.

# file: etcd.yaml
---
apiVersion: v1
kind: Service
metadata:
  name: etcd
  namespace: default
spec:
  type: ClusterIP
  clusterIP: None
  selector:
    app: etcd
  ##
  ## Ideally we would use SRV records to do peer discovery for initialization.
  ## Unfortunately discovery will not work without logic to wait for these to
  ## populate in the container. This problem is relatively easy to overcome by
  ## making changes to prevent the etcd process from starting until the records
  ## have populated. The documentation on statefulsets briefly talk about it.
  ##   https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#stable-network-id
  publishNotReadyAddresses: true
  ##
  ## The naming scheme of the client and server ports match the scheme that etcd
  ## uses when doing discovery with SRV records.
  ports:
  - name: etcd-client
    port: 2379
  - name: etcd-server
    port: 2380
  - name: etcd-metrics
    port: 8080
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: default
  name: etcd
spec:
  ##
  ## The service name is being set to leverage the service headlessly.
  ## https://kubernetes.io/docs/concepts/services-networking/service/#headless-services
  serviceName: etcd
  ##
  ## If you are increasing the replica count of an existing cluster, you should
  ## also update the --initial-cluster-state flag as noted further down in the
  ## container configuration.
  replicas: 3
  ##
  ## For initialization, the etcd pods must be available to eachother before
  ## they are "ready" for traffic. The "Parallel" policy makes this possible.
  podManagementPolicy: Parallel
  ##
  ## To ensure availability of the etcd cluster, the rolling update strategy
  ## is used. For availability, there must be at least 51% of the etcd nodes
  ## online at any given time.
  updateStrategy:
    type: RollingUpdate
  ##
  ## This is label query over pods that should match the replica count.
  ## It must match the pod template's labels. For more information, see the
  ## following documentation:
  ##   https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors
  selector:
    matchLabels:
      app: etcd
  ##
  ## Pod configuration template.
  template:
    metadata:
      ##
      ## The labeling here is tied to the "matchLabels" of this StatefulSet and
      ## "affinity" configuration of the pod that will be created.
      ##
      ## This example's labeling scheme is fine for one etcd cluster per
      ## namespace, but should you desire multiple clusters per namespace, you
      ## will need to update the labeling schema to be unique per etcd cluster.
      labels:
        app: etcd
      annotations:
        ##
        ## This gets referenced in the etcd container's configuration as part of
        ## the DNS name. It must match the service name created for the etcd
        ## cluster. The choice to place it in an annotation instead of the env
        ## settings is because there should only be 1 service per etcd cluster.
        serviceName: etcd
    spec:
      ##
      ## Configuring the node affinity is necessary to prevent etcd servers from
      ## ending up on the same hardware together.
      ##
      ## See the scheduling documentation for more information about this:
      ##   https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity
      affinity:
        ## The podAntiAffinity is a set of rules for scheduling that describe
        ## when NOT to place a pod from this StatefulSet on a node.
        podAntiAffinity:
          ##
          ## When preparing to place the pod on a node, the scheduler will check
          ## for other pods matching the rules described by the labelSelector
          ## separated by the chosen topology key.
          requiredDuringSchedulingIgnoredDuringExecution:
          ## This label selector is looking for app=etcd
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - etcd
            ## This topology key denotes a common label used on nodes in the
            ## cluster. The podAntiAffinity configuration essentially states
            ## that if another pod has a label of app=etcd on the node, the
            ## scheduler should not place another pod on the node.
            ##   https://kubernetes.io/docs/reference/labels-annotations-taints/#kubernetesiohostname
            topologyKey: "kubernetes.io/hostname"
      ##
      ## Containers in the pod
      containers:
      ## This example only has this etcd container.
      - name: etcd
        image: quay.io/coreos/etcd:v3.5.15
        imagePullPolicy: IfNotPresent
        ports:
        - name: etcd-client
          containerPort: 2379
        - name: etcd-server
          containerPort: 2380
        - name: etcd-metrics
          containerPort: 8080
        ##
        ## These probes will fail over TLS for self-signed certificates, so etcd
        ## is configured to deliver metrics over port 8080 further down.
        ##
        ## As mentioned in the "Monitoring etcd" page, /readyz and /livez were
        ## added in v3.5.12. Prior to this, monitoring required extra tooling
        ## inside the container to make these probes work.
        ##
        ## The values in this readiness probe should be further validated, it
        ## is only an example configuration.
        readinessProbe:
          httpGet:
            path: /readyz
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 30
        ## The values in this liveness probe should be further validated, it
        ## is only an example configuration.
        livenessProbe:
          httpGet:
            path: /livez
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        env:
        ##
        ## Environment variables defined here can be used by other parts of the
        ## container configuration. They are interpreted by Kubernetes, instead
        ## of in the container environment.
        ##
        ## These env vars pass along information about the pod.
        - name: K8S_NAMESPACE
          valueFrom:
            fieldRef:
             fieldPath: metadata.namespace
        - name: HOSTNAME
          valueFrom:
            fieldRef:
             fieldPath: metadata.name
        - name: SERVICE_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['serviceName']
        ##
        ## Configuring etcdctl inside the container to connect to the etcd node
        ## in the container reduces confusion when debugging.
        - name: ETCDCTL_ENDPOINTS
          value: $(HOSTNAME).$(SERVICE_NAME):2379
        ##
        ## TLS client configuration for etcdctl in the container.
        ## These files paths are part of the "etcd-client-certs" volume mount.
        # - name: ETCDCTL_KEY
        #   value: /etc/etcd/certs/client/tls.key
        # - name: ETCDCTL_CERT
        #   value: /etc/etcd/certs/client/tls.crt
        # - name: ETCDCTL_CACERT
        #   value: /etc/etcd/certs/client/ca.crt
        ##
        ## Use this URI_SCHEME value for non-TLS clusters.
        - name: URI_SCHEME
          value: "http"
        ## TLS: Use this URI_SCHEME for TLS clusters.
        # - name: URI_SCHEME
        # value: "https"
        ##
        ## If you're using a different container, the executable may be in a
        ## different location. This example uses the full path to help remove
        ## ambiguity to you, the reader.
        ## Often you can just use "etcd" instead of "/usr/local/bin/etcd" and it
        ## will work because the $PATH includes a directory containing "etcd".
        command:
        - /usr/local/bin/etcd
        ##
        ## Arguments used with the etcd command inside the container.
        args:
        ##
        ## Configure the name of the etcd server.
        - --name=$(HOSTNAME)
        ##
        ## Configure etcd to use the persistent storage configured below.
        - --data-dir=/data
        ##
        ## In this example we're consolidating the WAL into sharing space with
        ## the data directory. This is not ideal in production environments and
        ## should be placed in it's own volume.
        - --wal-dir=/data/wal
        ##
        ## URL configurations are parameterized here and you shouldn't need to
        ## do anything with these.
        - --listen-peer-urls=$(URI_SCHEME)://0.0.0.0:2380
        - --listen-client-urls=$(URI_SCHEME)://0.0.0.0:2379
        - --advertise-client-urls=$(URI_SCHEME)://$(HOSTNAME).$(SERVICE_NAME):2379
        ##
        ## This must be set to "new" for initial cluster bootstrapping. To scale
        ## the cluster up, this should be changed to "existing" when the replica
        ## count is increased. If set incorrectly, etcd makes an attempt to
        ## start but fail safely.
        - --initial-cluster-state=new
        ##
        ## Token used for cluster initialization. The recommendation for this is
        ## to use a unique token for every cluster. This example parameterized
        ## to be unique to the namespace, but if you are deploying multiple etcd
        ## clusters in the same namespace, you should do something extra to
        ## ensure uniqueness amongst clusters.
        - --initial-cluster-token=etcd-$(K8S_NAMESPACE)
        ##
        ## The initial cluster flag needs to be updated to match the number of
        ## replicas configured. When combined, these are a little hard to read.
        ## Here is what a single parameterized peer looks like:
        ##   etcd-0=$(URI_SCHEME)://etcd-0.$(SERVICE_NAME):2380
        - --initial-cluster=etcd-0=$(URI_SCHEME)://etcd-0.$(SERVICE_NAME):2380,etcd-1=$(URI_SCHEME)://etcd-1.$(SERVICE_NAME):2380,etcd-2=$(URI_SCHEME)://etcd-2.$(SERVICE_NAME):2380
        ##
        ## The peer urls flag should be fine as-is.
        - --initial-advertise-peer-urls=$(URI_SCHEME)://$(HOSTNAME).$(SERVICE_NAME):2380
        ##
        ## This avoids probe failure if you opt to configure TLS.
        - --listen-metrics-urls=http://0.0.0.0:8080
        ##
        ## These are some configurations you may want to consider enabling, but
        ## should look into further to identify what settings are best for you.
        # - --auto-compaction-mode=periodic
        # - --auto-compaction-retention=10m
        ##
        ## TLS client configuration for etcd, reusing the etcdctl env vars.
        # - --client-cert-auth
        # - --trusted-ca-file=$(ETCDCTL_CACERT)
        # - --cert-file=$(ETCDCTL_CERT)
        # - --key-file=$(ETCDCTL_KEY)
        ##
        ## TLS server configuration for etcdctl in the container.
        ## These files paths are part of the "etcd-server-certs" volume mount.
        # - --peer-client-cert-auth
        # - --peer-trusted-ca-file=/etc/etcd/certs/server/ca.crt
        # - --peer-cert-file=/etc/etcd/certs/server/tls.crt
        # - --peer-key-file=/etc/etcd/certs/server/tls.key
        ##
        ## This is the mount configuration.
        volumeMounts:
        - name: etcd-data
          mountPath: /data
        ##
        ## TLS client configuration for etcdctl
        # - name: etcd-client-tls
        #   mountPath: "/etc/etcd/certs/client"
        #   readOnly: true
        ##
        ## TLS server configuration
        # - name: etcd-server-tls
        #   mountPath: "/etc/etcd/certs/server"
        #   readOnly: true
      volumes:
      ##
      ## TLS client configuration
      # - name: etcd-client-tls
      #   secret:
      #     secretName: etcd-client-tls
      #     optional: false
      ##
      ## TLS server configuration
      # - name: etcd-server-tls
      #   secret:
      #     secretName: etcd-server-tls
      #     optional: false
  ##
  ## This StatefulSet will uses the volumeClaimTemplate field to create a PVC in
  ## the cluster for each replica. These PVCs can not be easily resized later.
  volumeClaimTemplates:
  - metadata:
      name: etcd-data
    spec:
      accessModes: ["ReadWriteOnce"]
      ##
      ## In some clusters, it is necessary to explicitly set the storage class.
      ## This example will end up using the default storage class.
      # storageClassName: ""
      resources:
        requests:
          storage: 1Gi

Generating Certificates

In this section, we use Helm to install an operator called cert-manager.

With cert-manager installed in the cluster, self-signed certificates can be generated in the cluster. These generated certificates get placed inside a secret object that can be attached as files in containers.

This is the helm command to install cert-manager.

$ helm upgrade --install --create-namespace --namespace cert-manager cert-manager cert-manager --repo https://charts.jetstack.io --set crds.enabled=true

This is an example ClusterIssuer configuration for generating self-signed certificates.

# file: issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: selfsigned
spec:
  selfSigned: {}

This manifest creates Certificate objects for the client and server certs, referencing the ClusterIssuer “selfsigned”. The dnsNames should be an exhaustive list of valid hostnames for the certificates that cert-manager creates.

# file: certificates.yaml
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: etcd-server
  namespace: default
spec:
  secretName: etcd-server-tls
  issuerRef:
    name: selfsigned
    kind: ClusterIssuer
  commonName: etcd
  dnsNames:
  - etcd
  - etcd.default
  - etcd.default.svc.cluster.local
  - etcd-0
  - etcd-0.etcd
  - etcd-0.etcd.default
  - etcd-0.etcd.default.svc
  - etcd-0.etcd.default.svc.cluster.local
  - etcd-1
  - etcd-1.etcd
  - etcd-1.etcd.default
  - etcd-1.etcd.default.svc
  - etcd-1.etcd.default.svc.cluster.local
  - etcd-2
  - etcd-2.etcd
  - etcd-2.etcd.default
  - etcd-2.etcd.default.svc
  - etcd-2.etcd.default.svc.cluster.local
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: etcd-client
  namespace: default
spec:
  secretName: etcd-client-tls
  issuerRef:
    name: selfsigned
    kind: ClusterIssuer
  commonName: etcd
  dnsNames:
  - etcd
  - etcd.default
  - etcd.default.svc.cluster.local
  - etcd-0
  - etcd-0.etcd
  - etcd-0.etcd.default
  - etcd-0.etcd.default.svc
  - etcd-0.etcd.default.svc.cluster.local
  - etcd-1
  - etcd-1.etcd
  - etcd-1.etcd.default
  - etcd-1.etcd.default.svc
  - etcd-1.etcd.default.svc.cluster.local
  - etcd-2
  - etcd-2.etcd
  - etcd-2.etcd.default
  - etcd-2.etcd.default.svc
  - etcd-2.etcd.default.svc.cluster.local