Scaling Istio
In a large, busy cluster, how do you scale Istio to address Istio-proxy Container being OOM-Killed and Istiod crashes if too many connected istio-proxies?
Istio-proxy Container OOM-Killed
Problem
If istio-proxy dies, Pod disconnects from the world, because istio routes the Pod’s ingress and egress through the istio-proxy container. Thus, the main application container cannot communicate with other services, and clients cannot reach the application either. This disrupts existing connections and risks cascading failure when loads shift to other replicas.
Out-of-memory kill is #1 reason for the istio-proxy death. The istio-proxy is configured with resource limits for CPU and memory, to avoid starving other workloads sharing the k8s Node. The istio-proxy is killed once it exceeds the memory limit.
Restarting istio-proxy won’t help: By default, Kubernetes uses the restart policy “Always” for Pods. Thus, if the istio-proxy container is OOM killed, Kubernetes will restart it. However, because the usage pattern has not changed, istio-proxy will enter OOMKilled again. This forms a crash loop and continued disruption to applications.
To keep bumping the memory limit is expensive and whack-a-mole. Overtime, you have increased the memory limit from 256Mi to 2Gi, which is per istio-proxy container. Given tens of thousands of Pods in istio mesh cross the hundreds of clusters, it is expensive to keep raising the limit. Furthermore, many people only increase the limit is when the oncall got paged about crash-looping Pods, which already impact customer traffic.
Solution
Use Sidecar
object to trim unused xDS config
By default, Istio programs all sidecar proxies with the configuration to reach every workload in the mesh, as well as accept traffic on all the ports associated with the workload.
But if you have a locked down Istio mesh, and if a tenant must request for allow-listing such source namespace using some onboarding config, then the istio-proxy container does not need the full mesh config.
The Sidecar
API object can restrict the set of services that the proxy can reach. Adopting the Sidecar objects will reduce the number of xDS pushes and overall xDS config size. You could templatize the Sidecar
objects and render them based on the per-namespace onboarding configs.
Below is an example Sidecar, which allows istio-proxies in the namespace “observability-cortex” to egress to four other namespaces.
1
2
3
4
5
6
7
8
9
10
11
12
apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
name: default
namespace: myapp
spec:
egress:
- hosts:
- "istio-system/*"
- "my-upstream-ns/*"
- "kube-system/*"
- "observability/*"
Use Telemetry
object to reduce metrics generation
Istio collects and exports a wide range of Prometheus metrics. Metrics collection impacts memory usage. Istio-proxy doesn’t need to generate all metrics but only those we use. Consider customizing the metrics that Istio collects and exports.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
---
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: drop-unused-metrics-and-tags
namespace: istio-system
spec:
# no selector specified, applies to all workloads in the namespace
metrics:
- providers:
- name: prometheus
overrides:
- match:
metric: ALL_METRICS
tagOverrides:
connection_security_policy:
operation: REMOVE
destination_app:
operation: REMOVE
destination_canonical_service:
operation: REMOVE
destination_canonical_revision:
operation: REMOVE
destination_principal:
operation: REMOVE
...
- match:
metric: REQUEST_DURATION
disabled: true
- match:
metric: REQUEST_SIZE
disabled: true
- match:
metric: RESPONSE_SIZE
disabled: true
- match:
metric: TCP_CLOSED_CONNECTIONS
disabled: true
Istio Ambient Mesh
We can solve sidecar problems if we don’t run sidecar at all. Istio ambient mesh is a sidecar-less approach to service mesh, replacing sidecar proxies with per-node and (not always necessary) per-namespace proxies. With fewer proxies, it will save us lots of money in CPU/Memory and provide shorter latency.
The general problems with sidecars and benefits of ambient mesh:
- Kubernetes does not have first-class support for sidecars (until k8s 1.28). App container might start before proxy ready, decide itself is unhealthy, and be in a restart loop. Short-lived Pods (Job) need to explicitly kill proxy for Pod to complete.
- Istio upgrade requires restarting every pod to inject newer-version Istio proxies
- Sidecar resources are underutilized
- Difficult to calculate namespace quotas (
ResourceQuotas
) because sidecars are transparent to tenants but consume namespace quotas.
If you use Calico to enforce L4 NetworkPolicy for Pods, you might face a blocker to adopting ambient mesh because of conflicting IPTables rules that Calico owned (GitHub issue still open). But I encourage you to do another proof of concept, because someone (GitHub issue) used eBPF instead of IPTables to redirect traffic to ambient-mode proxies, thus working around the conflicting Calico IPTables rules.
Istiod crash if too many connected istio-proxies
Problem
Istiod is the control plane of istio. All istio-proxies connect to istiod. Istiod may crash when there were too many connected istio-proxies, specifically if they all were added at the same time by a tenant workload scaling out.
Most people run Istiod as a Deployment
with a HorizontalPodAutoscaler
(HPA). You could mitigate the scaling issue by setting a high minimum for HPA, but doing so leads to low resource utilization at night and weekends, at odds with the very purpose of autoscaling. Moreover, istiod is still at risk when the tenants scale out aggressively.
Solution
Use discoverySelectors
to watch in-mesh Namespaces only
The discoverySelectors
configuration enables us to dynamically restrict the set of namespaces that are part of the mesh. The discoverySelectors
configuration declares what Istio control plane watches and processes. Not all tenant namespaces enable istio, so istiod could benefit from having to process less k8s events.
Fine-tune HPA
The default scale-up stabilization window is 300 seconds. We should reduce it to 10 seconds to be more responsive, but keep the scale-down stabilization window at 300s to avoid threshing.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: istiod
namespace: istio-system
labels:
app: istiod
release: istio
istio.io/rev: system
install.operator.istio.io/owning-resource: unknown
operator.istio.io/component: "Pilot"
spec:
maxReplicas: 48
- minReplicas: 32
+ minReplicas: 3
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: istiod
+ behavior:
+ scaleUp:
+ stabilizationWindowSeconds: 10s # default is 300s
+ scaleDown:
+ stabilizationWindowSeconds: 300s
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
Distribute istio-proxy connections across Istiod Pods
Istio doesn’t explicitly set a default maximum connection time between istio-proxy sidecars and istiod. Typically, the connections from the sidecars to istiod are long-lived gRPC connections used for service discovery, configuration updates, and certificate rotation, and they are expected to be maintained as long as istiod and the sidecars are running. This creates uneven distribution of loads on istiod Pods over time.
One idea is to set a max connection idle timeout for the istio-proxy to istiod connections, so the proxy will reconnect over time, hopefully landing on a new istiod Pods.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: istio-proxy-to-istiod-timeouts
namespace: istio-system
spec:
workloadSelector:
labels: {}
configPatches:
- applyTo: HTTP_ROUTE
match:
context: SIDECAR_OUTBOUND
routeConfiguration:
vhost:
name: istiod.istio-system.svc.cluster.local:443
patch:
operation: MERGE
value:
typed_config:
'@type': type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
common_http_protocol_options:
idle_timeout: 300s