Kube-proxy and mysterious DNS timeout
Table of Contents
This post reviews how iptables-mode kube-proxy works, why some DNS requests to kube-dns
were blackholed, and how to mitigate the issue.
Background: How kube-proxy works
The kube-dns Service
uses a label selector to select all CoreDNS Pods. The Service
has a ClusterIP
. Requests to such ClusterIP
will be DNAT-ed to one of the CoreDNS Pod IPs. The DNAT is performed by kube-proxy, which runs as a DaemonSet. Kube-proxy is not a real proxy (data plane) but configures the iptables
rules and conntrack
tables on the Node to implement the DNAT.
DNS is primarily over UDP. Although UDP is a connectionless protocol, kube-proxy still uses conntrack
for UDP to remember the NAT translations applied to each pair of the source and destination IP addresses and ports, ensuring that responses can be correctly routed back to the originating Pod.
When the CoreDNS Deployment had a rolling restart, new CoreDNS Pods had new IPs, and old CoreDNS Pods were removed so their IPs became stale. Thus, kube-proxy needs to update the Node’s iptables
rules and conntrack
tables.
Why Some DNS Requests Were Blackholed
The concurrent restart of the kube-proxy DaemonSet and the CoreDNS Deployment created a race condition. Any DaemonSet, including kube-proxy, can not perform surge upgrade (i.e. bring up a new Pod on the same Node then remove the old Pod), because a DaemonSet guarantees there will be at most one kube-proxy Pod on a Node. Thus, when kube-proxy was in upgrade, Kubelet must terminate the existing kube-proxy Pod, then start a new one. In between the delete-then-create, there may be new CoreDNS Pods coming up and old CoreDNS Pods removed. This creates two problems:
-
Until the new kube-proxy Pod was up and ensured iptables and conntrack were up to date, traffic to CoreDNS might be routed to a stale Pod IP that no longer exists (the IP of a deleted CoreDNS Pod). Pod IP works as secondary (alias) IP ranges on the Node. The destination Node’s iptables will just drop the packet if no matching Pod IP.
-
In some cases, new kube-proxy Pod cannot remove stale rules. Some kube-proxy Pod’s log showed (reformatted to be more readable):
1
2
3
4
5
6
7
8
9
10
"Failed to delete endpoint connections"
error deleting conntrack entries for udp peer {172.20.0.10, 10.6.30.154},
conntrack command returned:
conntrack v1.4.4 (conntrack-tools):
Operation failed: such conntrack doesn't exist
udp 17 2 src=10.6.12.242 dst=172.20.0.10 sport=42451 dport=53 src=10.6.30.154 dst=10.6.12.242 sport=53 dport=42451 mark=0 use=1
udp 17 28 src=10.6.5.121 dst=172.20.0.10 sport=53669 dport=53 src=10.6.30.154 dst=10.6.5.121 sport=53 dport=53669 mark=0 use=1
udp 17 1 src=10.6.10.175 dst=172.20.0.10 sport=36264 dport=53 src=10.6.30.154 dst=10.6.10.175 sport=53 dport=36264 mark=0 use=1
error message: exit status 1
servicePortName="kube-system/kube-dns:dns"
The source code that produces such an error message is here and has the comment below. The “TODO” is still on the main branch of Kubernetes.
1
2
3
4
5
6
7
// TODO: Better handling for deletion failure.
// When failure occur, stale udp connection may not get flushed.
// These stale udp connection will keep black hole traffic.
// Making this a best effort operation for now, since it
// is expensive to baby sit all udp connections to kubernetes services.
return fmt.Errorf("error deleting conntrack entries for udp peer {%s, %s}, error: %v", origin, dest, err)
Unfortunately, conntrack
has no log files, and the kube-proxy log is not verbose enough to provide more insight into conntrack
.
Once kube-proxy got an error from conntrack
, it does not retry, as shown in the source code (1, 2, 3).
Mitigations
Cordon and drain all affected Nodes. Find affected Notes by searching for logs Failed to delete endpoint connections
. To assist forensic analysis, prevent the node from being deleted by cluster-autoscaler by annotating the node with “cluster-autoscaler.kubernetes.io/scale-down-disabled=true” (doc)
Run kube-proxy with verbose level 4 to get more details about UDP connections, such as why conntrack exited 1. See https://github.com/kubernetes/kubernetes/pull/95694
Per-node DNS monitoring. Deploy the node-problem-detector as DaemonSet on every cluster. Build a plugin for DNS monitoring. Bonus point is that this agent will be a generalized framework for node-locel issue detection. For example, you may want to cover machine learning use cases such as detecting bad TensorCore Silicon and ECC errors.
Deploy node-local-dns. Node-local-dns allows DNS lookup to skip iptables DNAT and connection tracking. Connections from the node-local caching agent to the kube-dns service are upgraded to TCP. TCP conntrack entries will be removed on connection close and would reduce tail latency attributed to dropped UDP packets. Observability of DNS requests at a node level.
If you run EKS, you may be self-managing kube-proxy, or use the EKS-managed one. Both options require you to manage the life cycle of kube-proxy. Consider hardening the kube-proxy lifecycle management process with:
Log-based metrics for kube-proxy errors. Alert the person on-call about such errors.
Upgrade kube-proxy only on node pools of a new kubernetes version. Don’t do in-place upgrade of kube-proxy. GKE and AKS consider kube-proxy a managed component similar to kubelet, and they upgrade kube-proxy only if the node version is upgraded. We should do the same.
Don’t upgrade coredns and kube-proxy together. Define a strong dependency and ordering between the two.