Layer-4 Load Balancer & Zero-downtime Autoscaling and Upgrade

Layer-4 Load Balancer & Zero-downtime Autoscaling and Upgrade

Your Kubernetes cluster probably has a shared ingress for north-south traffic, coming from a cloud load balancer and lands on your favorite proxies like Envoy, or Istio gateways, or Nginx.

If you

  • use a LoadBalancer-type Service to create a Layer-4 Load Balancer fronting your Kubernetes ingress
  • retain source IP address by setting externalTrafficPolicy: Local

Then horizontal autoscaling (scale-in) and rolling upgrade will incur some downtime for you.

This post

  • explains why there is partial disruption, and how much disruption to expect
  • discusses several options to achieve zero downtime upgrade and autoscaling

For simplicity, the rest of the doc assumes Envoy as the ingress gateway.

Background

Layer-4 cloud load balancer

The routing of traffic to Envoy is facilitated by a layer-4 (L4) cloud load balancer, known as Network Load Balancer (NLB) in AWS terminology. The aws-load-balancer-controller provisions such load balancer (LB) by watching LoadBalancer-type Service objects in Kubernetes. Each Service object opens dedicated NodePort on all Nodes in selected Envoy node pools. Traffic to Envoy will first be routed to NodePort on the Node hosting Envoy Pod, then DNAT-ed (iptables) to the Pod on the same Node, as shown in the following diagram.

(image source)

The LB periodically check the HealthCheck NodePort. The HealthCheck NodePort will fail if

  • the Node does not host any target Pods, or
  • none of the target Pods on this Node is ready, determined by the Pod’s readiness probe

externalTrafficPolicy: Local

Suppose the Kubernetes Service object is configured with externalTrafficPolicy: Local. Then, the kube-proxy directs packets exclusively to Envoy Pods residing on the same Node, even if there are other Nodes running Envoy. This setup has two benefits: one less hop (lower latency) and preserving source IP address (for allowlist or rate limiting).

But externalTrafficPolicy: Local is problematic during rolling upgrades or scale-in. The reason is that traffic arriving at NodePort will be dropped by kube-proxy if node has no ready Envoy Pods. LB will keep forwarding traffic to this Node until LB detects the HealthCheck NodePort is failing. Then, LB will mark the Node as unhealthy.

There is a certain delay between two key events in this setup:

  • An Envoy Pod becoming NotReady (for example, if it enters the “Terminating” state during a rolling upgrade).
  • The subsequent periodic health check carried out by the load balancer.

During such delay, client traffic to this Node is blackholed.

(image source)

Partial downtime during upgrade and autoscale-in

Why is there some downtime

As discussed in the previous section, client traffic to an Envoy Node is blackholed during the time between the envoy Pod on such Node enters the Terminating state and the LB performs the next health check. Kube-proxy will remove forwarding rules from NodePort to the Pod once the Pod enters the Terminating state. Kubernetes 1.24 and 1.25 considers the Terminating state as not ready.

For the same reason, horizontal scale-in will also cause downtime. For a while, I was just running Envoy as a DaemonSet on a node pool that does not autoscale.

Why is the downtime partial

This downtime only affects one Node at a time, because currently, Envoy DaemonSet has the following upgrade strategy:

1
2
3
4
5
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 0

Thus, Kubernetes will terminate one Pod at a time, then create new Pod on the same Node. There are 6 Pods in each DaemonSet, so not all envoy Pods are down at the same time.

The reason for maxSurge: 0 is that envoy-ingress Pods run on host networking. It means we cannot have 2 envoy Pods running on the same Node, because they both bind to the same ports. Thus, the current update strategy is to kill a Pod, then start a new one.

Why host networking

Running Envoy in host networking means traffic bypasses the Pod overlay network (Normally, each Pod runs in separate network namespaces). Thus, host networking reduces the overhead of network hops and encapsulation due to overlays. This results in lower latency and higher throughput.

But how much performance gain exactly? It depends on many factors like hardware and bandwidth. Cilium did some benchmark—take this marketing with a grain of salt—that suggests host networking could improve throughput by 20% and latency by 25%. They didn’t say how many iptables rules (which scale linearly) are on the given hosts.

How much downtime

After NLB detects in its target group an unhealthy instance, NLB will stop creating new connections to that target. However, existing connections are not immediately terminated until a default 300s of draining timeout, or RST by clients or Envoy. Thus, in the worst case, the blackhole period per Pod is 310 seconds.

In practice, the startup time of a new Envoy Pod on the same Node will be shorter than 300s. NLB continues health-checking the unhealthy Node, and will mark the Node as healthy as gain once the new Pod is ready. But for the worst-case analysis, let’s assume the blackhole period per Node is 310 seconds.

Given 6 Nodes, Envoy DaemonSet will exhibit a 16.7% error rate for a total of 310 * 6 seconds, which is 1860 seconds, or 31 minutes in the worst case.

The 16.7% error rate comes from the fact that 1 of the 6 Pods are in the Terminating state. Still, 16.7% is an appropriation, because another downside of externalTrafficPolicy: Local is that connections may not distribute evenly, especially if there are long-running connections on the Terminating Pod. NLB does not support the least-connections load balancing scheme.

Solutions

Use Pod IPs as LB backends

In this case, NLB sends traffic directly to the Pods selected by k8s Service. The benefits are:

  • Eliminate the extra network hop (NodePort) through the worker Nodes
  • Allow NLB to keep sending traffic to Pods in the Terminating state but mark the target as Draining

AWS load balancer controller supports this feature natively with “NLB IP-mode”. On other cloud, you can implement such controller yourself, watching Pod events and reconcile with L4 LB target groups.

To enable IP-mode, we just need to update the Service annotations

1
2
3
4
5
6
7
8
9
10
11
-service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance
+service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip

# Health check the Pods directly
+service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: http
+service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: "9901"
+service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: /ready

# NLB with IP targets by default does not pass the client source IP address,
# unless we specifically configure the target group attributes.
+service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=true

To achieve zero-downtime upgrade, we need to additionally configure on the envoy Pod a preStop hook like below. When Pod enters the Terminating state, k8s will execute the preStop hook and keep the Pod in Terminating until the preS`top hook completes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# We must define a longer terminationGracePeriodSeconds, which by default
# is 30s, upon which the Pod is killed even if preStop has not completed.
terminationGracePeriodSeconds: 305

containers:
  - name: envoy
    lifecycle:
      preStop:
        exec:
          # The default target group attribute
          # “deregistration_delay.timeout_seconds” is 300s, configurable
          # through Service annotation.
          command:
            - /bin/sh
            - -c
            - curl -X POST http://localhost:9901/healthcheck/fail && sleep 300

By failing the envoy health check but keeping envoy running in the Terminating state, envoy can still process traffic. Once NLB deems the Envoy Pod unhealthy, it halts new request routing to the Pod but maintains existing connections. Consequently, active TCP connections persist, with client requests continuing to the now-unhealthy NLB target (Envoy Pod) until either client or envoy closes the connection or idle timeout expiry, defaulting to 300 seconds for NLB.

ProxyTerminatingEndpoints

ProxyTerminatingEndpoints is a new beta feature in Kubernetes version 1.26. It is enabled by default.

When there is a rolling update and a Node only contains terminating Pods, kube-proxy will route traffic to the terminating Pods based on their readiness. At the same time, kube-proxy will actively fail the health check NodePort if there are only terminating Pods available. By doing so, kube-proxy alerts the external load balancer that new connections should not be sent to that Node but will gracefully handle requests for existing connections.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# We must define a longer terminationGracePeriodSeconds, which by default
# is 30s, upon which the Pod is killed even if preStop has not completed.
terminationGracePeriodSeconds: 305

containers:
  - name: envoy
    lifecycle:
      preStop:
        exec:
          # The default target group attribute
          # “deregistration_delay.timeout_seconds” is 300s, configurable
          # through Service annotation.
          command:
            - /bin/sh
            - -c
            - sleep 300

Note that here we must NOT call POST http://localhost:9901/healthcheck/fail on Envoy, different from what NLB IP-mode needs. The reason is that Terminating Pods need to pass the readiness probe to continue receiving traffic, so we cannot fail the envoy health check. Since kube-proxy will actively fail the health check NodePort if there are only terminating Pods available on the Node, NLB will start the draining process.

Customize NLB, keep host networking

Forget about the NodePort and HealthCheck NodePort opened by kube-proxy. We can create the NLB not through k8s Service object, but using infra-as-code tools such as pulumi. This bypasses the kube-proxy. The NLB will look like this

NLB frontend port Target port (=NodePort =Pod port because of host networking)
443 8443
80 8080

The NLB will find all Nodes running envoy Pods using the autoscaling group for the envoy-ingress node pool. Yes, we can autoscale with solution 4.3. This setup is similar to Section 4.1.1 NLB IP-mode, except the NLB is not created by Kubernetes. We need the following Pod spec change.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# We must define a longer terminationGracePeriodSeconds, which by default
# is 30s, upon which the Pod is killed even if preStop has not completed.
terminationGracePeriodSeconds: 305

containers:
  - name: envoy
    lifecycle:
      preStop:
        exec:
          # The default target group attribute
          # “deregistration_delay.timeout_seconds” is 300s, configurable
          # through Service annotation.
          command:
            - /bin/sh
            - -c
            - curl -X POST http://localhost:9901/healthcheck/fail && sleep 300

We also need to expose the “/ready” endpoint from envoy to the host. Then, we need to update the Service annotations like the following.

1
2
3
4
# Health check the Pods directly through NodePort 9901
+service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: http
+service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: "9901"
+service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: /ready