Scaling Istio

Scaling Istio

In a large, busy cluster, how do you scale Istio to address Istio-proxy Container being OOM-Killed and Istiod crashes if too many connected istio-proxies?

Istio-proxy Container OOM-Killed

Problem

If istio-proxy dies, Pod disconnects from the world, because istio routes the Pod’s ingress and egress through the istio-proxy container. Thus, the main application container cannot communicate with other services, and clients cannot reach the application either. This disrupts existing connections and risks cascading failure when loads shift to other replicas.

Out-of-memory kill is #1 reason for the istio-proxy death. The istio-proxy is configured with resource limits for CPU and memory, to avoid starving other workloads sharing the k8s Node. The istio-proxy is killed once it exceeds the memory limit.

Restarting istio-proxy won’t help: By default, Kubernetes uses the restart policy “Always” for Pods. Thus, if the istio-proxy container is OOM killed, Kubernetes will restart it. However, because the usage pattern has not changed, istio-proxy will enter OOMKilled again. This forms a crash loop and continued disruption to applications.

To keep bumping the memory limit is expensive and whack-a-mole. Overtime, you have increased the memory limit from 256Mi to 2Gi, which is per istio-proxy container. Given tens of thousands of Pods in istio mesh cross the hundreds of clusters, it is expensive to keep raising the limit. Furthermore, many people only increase the limit is when the oncall got paged about crash-looping Pods, which already impact customer traffic.

Solution

Use Sidecar object to trim unused xDS config

By default, Istio programs all sidecar proxies with the configuration to reach every workload in the mesh, as well as accept traffic on all the ports associated with the workload.

But if you have a locked down Istio mesh, and if a tenant must request for allow-listing such source namespace using some onboarding config, then the istio-proxy container does not need the full mesh config.

The Sidecar API object can restrict the set of services that the proxy can reach. Adopting the Sidecar objects will reduce the number of xDS pushes and overall xDS config size. You could templatize the Sidecar objects and render them based on the per-namespace onboarding configs.

Below is an example Sidecar, which allows istio-proxies in the namespace “observability-cortex” to egress to four other namespaces.

1
2
3
4
5
6
7
8
9
10
11
12
apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
  name: default
  namespace: myapp
spec:
  egress:
  - hosts:
    - "istio-system/*"
    - "my-upstream-ns/*"
    - "kube-system/*"
    - "observability/*"
Use Telemetry object to reduce metrics generation

Istio collects and exports a wide range of Prometheus metrics. Metrics collection impacts memory usage. Istio-proxy doesn’t need to generate all metrics but only those we use. Consider customizing the metrics that Istio collects and exports.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
---
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: drop-unused-metrics-and-tags
  namespace: istio-system
spec:
  # no selector specified, applies to all workloads in the namespace
  metrics:
    - providers:
        - name: prometheus
      overrides:
        - match:
            metric: ALL_METRICS
          tagOverrides:
            connection_security_policy:
              operation: REMOVE
            destination_app:
              operation: REMOVE
            destination_canonical_service:
              operation: REMOVE
            destination_canonical_revision:
              operation: REMOVE
            destination_principal:
              operation: REMOVE
            ...
        - match:
            metric: REQUEST_DURATION
          disabled: true
        - match:
            metric: REQUEST_SIZE
          disabled: true
        - match:
            metric: RESPONSE_SIZE
          disabled: true
        - match:
            metric: TCP_CLOSED_CONNECTIONS
          disabled: true
Istio Ambient Mesh

We can solve sidecar problems if we don’t run sidecar at all. Istio ambient mesh is a sidecar-less approach to service mesh, replacing sidecar proxies with per-node and (not always necessary) per-namespace proxies. With fewer proxies, it will save us lots of money in CPU/Memory and provide shorter latency.

The general problems with sidecars and benefits of ambient mesh:

  • Kubernetes does not have first-class support for sidecars (until k8s 1.28). App container might start before proxy ready, decide itself is unhealthy, and be in a restart loop. Short-lived Pods (Job) need to explicitly kill proxy for Pod to complete.
  • Istio upgrade requires restarting every pod to inject newer-version Istio proxies
  • Sidecar resources are underutilized
  • Difficult to calculate namespace quotas (ResourceQuotas) because sidecars are transparent to tenants but consume namespace quotas.

If you use Calico to enforce L4 NetworkPolicy for Pods, you might face a blocker to adopting ambient mesh because of conflicting IPTables rules that Calico owned (GitHub issue still open). But I encourage you to do another proof of concept, because someone (GitHub issue) used eBPF instead of IPTables to redirect traffic to ambient-mode proxies, thus working around the conflicting Calico IPTables rules.

Istiod crash if too many connected istio-proxies

Problem

Istiod is the control plane of istio. All istio-proxies connect to istiod. Istiod may crash when there were too many connected istio-proxies, specifically if they all were added at the same time by a tenant workload scaling out.

Most people run Istiod as a Deployment with a HorizontalPodAutoscaler (HPA). You could mitigate the scaling issue by setting a high minimum for HPA, but doing so leads to low resource utilization at night and weekends, at odds with the very purpose of autoscaling. Moreover, istiod is still at risk when the tenants scale out aggressively.

Solution

Use discoverySelectors to watch in-mesh Namespaces only

The discoverySelectors configuration enables us to dynamically restrict the set of namespaces that are part of the mesh. The discoverySelectors configuration declares what Istio control plane watches and processes. Not all tenant namespaces enable istio, so istiod could benefit from having to process less k8s events.

Fine-tune HPA

The default scale-up stabilization window is 300 seconds. We should reduce it to 10 seconds to be more responsive, but keep the scale-down stabilization window at 300s to avoid threshing.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 apiVersion: autoscaling/v2beta2
 kind: HorizontalPodAutoscaler
 metadata:
   name: istiod
   namespace: istio-system
   labels:
     app: istiod
     release: istio
     istio.io/rev: system
     install.operator.istio.io/owning-resource: unknown
     operator.istio.io/component: "Pilot"
 spec:
   maxReplicas: 48
-  minReplicas: 32
+  minReplicas: 3
   scaleTargetRef:
     apiVersion: apps/v1
     kind: Deployment
     name: istiod
+  behavior:
+    scaleUp:
+      stabilizationWindowSeconds: 10s  # default is 300s
+    scaleDown:
+      stabilizationWindowSeconds: 300s
   metrics:
   - type: Resource
     resource:
       name: cpu
       target:
         type: Utilization
         averageUtilization: 65
Distribute istio-proxy connections across Istiod Pods

Istio doesn’t explicitly set a default maximum connection time between istio-proxy sidecars and istiod. Typically, the connections from the sidecars to istiod are long-lived gRPC connections used for service discovery, configuration updates, and certificate rotation, and they are expected to be maintained as long as istiod and the sidecars are running. This creates uneven distribution of loads on istiod Pods over time.

One idea is to set a max connection idle timeout for the istio-proxy to istiod connections, so the proxy will reconnect over time, hopefully landing on a new istiod Pods.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: istio-proxy-to-istiod-timeouts
  namespace: istio-system
spec:
  workloadSelector:
    labels: {}
  configPatches:
    - applyTo: HTTP_ROUTE
      match:
        context: SIDECAR_OUTBOUND
        routeConfiguration:
          vhost:
            name: istiod.istio-system.svc.cluster.local:443
      patch:
        operation: MERGE
        value:
          typed_config:
            '@type': type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
            common_http_protocol_options:
              idle_timeout: 300s