Photo by Javier Allegue Barros on Unsplash
Reduce cross-AZ traffic costs on EKS using topology aware hints
From a high availability standpoint, it's considered a best practice to spread workloads across multiple nodes in an EKS cluster. In addition to having multiple replicas of the application, one should also consider spreading the workload across multiple Availability Zones to attain high availability and improve reliability. This ensures fault-tolerance and avoids application downtime in the event of a worker node failure. One way to achieve this kind of deployment in EKS is to use podAnitiAffinity
. For example, the below manifest tells EKS scheduler to deploy each replica of the Pod on a node that's in a separate Availability Zone(AZ).
apiVersion: apps/v1
kind: Deployment
metadata:
name: spread-az
labels:
app: web-server
spec:
replicas: 3
selector:
matchLabels:
app: web-server
template:
metadata:
labels:
app: web-server
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-server
topologyKey: kubernetes.io/zone
containers:
- name: web-app
image: nginx:1.16-alpine
The above manifest makes use of topologyKey kubernetes.io/zone
. It tells the Kubernetes scheduler not to schedule two Pods in same AZ.
One of the other approaches that can be used to spread Pods across AZs is to use Pod Topology Spread Constraints which was GA-ed in Kubernetes 1.19. This mechanism aims to spread pods evenly onto multiple node topologies.
While both of these approaches provide high-availability and resiliency for application workloads, customers incur costs for data transfers in inter-AZ traffic routing within an EKS cluster. For large EKS clusters running hundreds of nodes and thousands of pods, the data transfer costs for cross-AZ traffic can be significant.
Enter Topology Aware Hints
To address cross-AZ data transfer costs (which comes up during many EKS conversations on cost optimzation), pods running in a cluster must be able to perform topology-aware routing based on Availability Zone. And this is precisely what Topology Aware Hints helps achieve. Topology Aware Hints provides a mechanism to help keep traffic within the zone it originated from. Prior to topology-aware-hints, Service topology-keys could be used for similar functionality. This was deprecated in kubernetes 1.21 in favor of topology-aware-hints which was introduced in kubernetes 1.21 and became "beta" in Kubernetes 1.23. With EKS 1.24 however, this is enabled by default and EKS users and customers can leverage this feature to keep kubernetes service traffic within the same AZ.
Let's dive in further and see this in action!
For the purposes of this blogpost, let's create a three-node EKS cluster.
Type the following command in your cloud9 terminal.
cat <<EOF>> eks-config.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: topology-demo-cluster
region: us-west-2
version: "1.24"
managedNodeGroups:
- name: appservers
instanceType: t3.xlarge
desiredCapacity: 3
minSize: 1
maxSize: 4
labels: { role: appservers }
volumeSize: 8
iam:
withAddonPolicies:
imageBuilder: true
autoScaler: true
xRay: true
cloudWatch: true
albIngress: true
ssh:
enableSsm: true
EOF
eksctl create cluster -f eks-config.yaml
Once the cluster is created, check the status of the worker nodes and their distribution across AZs.
kubectl get nodes -L topology.kubernetes.io/zone
NAME STATUS ROLES AGE VERSION ZONE
ip-192-168-4-149.us-west-2.compute.internal Ready <none> 36h v1.24.7-eks-fb459a0 us-west-2b
ip-192-168-48-125.us-west-2.compute.internal Ready <none> 36h v1.24.7-eks-fb459a0 us-west-2c
ip-192-168-75-68.us-west-2.compute.internal Ready <none> 36h v1.24.7-eks-fb459a0 us-west-2d
We have each worker node in a separate AZ deployed in our EKS cluster. Let's now try to run a sample application in this cluster.
Use the below application manifest to deploy three replicas of our sample application to deploy in the newly created EKS cluster as below
cat <<EOF>> app-manifest.yaml
apiVersion: v1
kind: Namespace
metadata:
name: topology-demo-ns
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: getaz
namespace: topology-demo-ns
labels:
app: getaz
spec:
replicas: 3
selector:
matchLabels:
app: getaz
template:
metadata:
labels:
app: getaz
namespace: topology-demo-ns
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: getaz
containers:
- name: getaz-container
image: getazcontainer:latest
imagePullPolicy: Always
ports:
- containerPort: 3000
name: web-port
resources:
requests:
cpu: "256m"
---
apiVersion: v1
kind: Service
metadata:
name: getazservice
namespace: topology-demo-ns
spec:
selector:
app: getaz
ports:
- port: 80
targetPort: web-port
protocol: TCP
EOF
kubectl apply -f app-manifest.yaml
The application manifest creates -
a Namespace named "topology-demo-ns"
a Deployment named "getaz" with three Pods. Each Pod runs a container named "getaz-container".
a Service named "getazservice".
The "getaz" Pods and Service "getazservice" all run in the "topology-demo-ns" namespace.
In the above example, we're using the Pod topologySpreadConstraints
with maxSkew
set to 1 and whenUnsatisfiable
set to "DoNotSchedule" to deploy each replica of our sample application in a separate AZ. This example leverages a well-known node label called topology.kubernetes.io/zone
that worker nodes in an EKS cluster is assigned to by default, as the topologyKey
in the pod topology spread. To get the labels on a worker node in the EKS cluster that we spun up, use the below command
$ kubectl describe node ip-192-168-48-125.us-west-2.compute.internal
Name: ip-192-168-48-125.us-west-2.compute.internal
Roles: <none>
Labels: alpha.eksctl.io/cluster-name=topology-demo-cluster
alpha.eksctl.io/nodegroup-name=appservers
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=t3.xlarge
beta.kubernetes.io/os=linux
eks.amazonaws.com/capacityType=ON_DEMAND
eks.amazonaws.com/nodegroup=appservers
eks.amazonaws.com/nodegroup-image=ami-0b149b4c68ab69dce
eks.amazonaws.com/sourceLaunchTemplateId=lt-0a47ee5069d44e8d4
eks.amazonaws.com/sourceLaunchTemplateVersion=1
failure-domain.beta.kubernetes.io/region=us-west-2
failure-domain.beta.kubernetes.io/zone=us-west-2c
k8s.io/cloud-provider-aws=8d60a23f89f8b00a31bfef5d05edc662
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-192-168-48-125.us-west-2.compute.internal
kubernetes.io/os=linux
node.kubernetes.io/instance-type=t3.xlarge
role=appservers
topology.kubernetes.io/region=us-west-2
topology.kubernetes.io/zone=us-west-2c
In the topologySpreadConstraints
section of the example manifest,
maxSkew defines the degree to which Pods may be distributed unevenly. This field must be filled out, and the value must be greater than zero. Its semantics vary depending on the value of whenUnsatisfiable field.
whenUnsatisfiable specifies how to handle a Pod placement that does not satisfy the spread constraint:
- DoNotSchedule (the default value) instructs the scheduler not to schedule it.
- ScheduleAnyway instructs the scheduler to continue scheduling it while prioritizing Nodes with the lowest skew.
Let's check the status and spread of our application pods.
kubectl get po -n topology-demo-ns -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
getaz-9685bbd44-65wcn 1/1 Running 0 2m 192.168.63.154 ip-192-168-48-125.us-west-2.compute.internal <none> <none>
getaz-9685bbd44-kf7gs 1/1 Running 0 2m 192.168.69.57 ip-192-168-75-68.us-west-2.compute.internal <none> <none>
getaz-9685bbd44-tjqkd 1/1 Running 0 2m 192.168.24.149 ip-192-168-4-149.us-west-2.compute.internal <none> <none>
We see from above output that each replica is running on a separate node and since each node is running in a separate AZ, effectively we have three pods each running in a different AZ in the EKS cluster.
For getting detail information about topologySpreadConstraints
, you can use kubectl explain Pod.spec.topologySpreadConstraints
command. You can mix and match these attributes to achieve different spread topologies.
Let us now check the Service
that got created by deploying the app-manifest.yaml file.
kubectl -n topology-demo-ns get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
getazservice ClusterIP 10.100.9.165 <none> 80/TCP 53m
kubectl -n topology-demo-ns describe svc getazservice
Name: getazservice
Namespace: topology-demo-ns
Labels: <none>
Annotations: <none>
Selector: app=getaz
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.100.9.165
IPs: 10.100.9.165
Port: <unset> 80/TCP
TargetPort: web-port/TCP
Endpoints: 192.168.24.149:3000,192.168.63.154:3000,192.168.69.57:3000
Session Affinity: None
Events: <none>
We have a Service
named "getazservice" of ClusterIP
type deployed. The service doesn't have any Annotations
set on it.
As next step, let's deploy a test container that we're going to use to call "getazservice" and check if there are any inter-AZ calls we can spot.
Use the below command to deploy a curl container and ensure curl
is installed.
kubectl run curl-debug --image=radial/busyboxplus:curl -l "type=debug" -n topology-demo-ns -it --tty sh
# check if curl is installed
curl --version
#exit the container
exit
Once the debug container is running, create a bash script to call "getazservice" in a loop and print the Availability Zone of the pod that responded to the call.
kubectl exec -it --tty -n topology-demo-ns $(kubectl get pod -l "type=debug" -n topology-demo-ns -o jsonpath='{.items[0].metadata.name}') sh
#create a test script and call service
cat <<EOF>> test.sh
n=1
while [ \$n -le 5 ]
do
curl -s getazservice.topology-demo-ns
sleep 1
echo "---"
n=\$(( n+1 ))
done
EOF
chmod +x test.sh
clear
./test.sh
#exit the test container
exit
The above execution of the test script in the debug container should produce an output like below which shows that the calls to the "getazservice" Service and its backing Pods are distributed across AZs.
us-west-2d---
us-west-2b---
us-west-2d---
us-west-2c---
us-west-2d---
The load-balancing and forwarding logic of the service call in this case is based on kube-proxy
mode. EKS by default implements "iptables" mode of kube-proxy
. When the curl-debug
container sends the curl
request to the "getazservice" virtual IP, the packet is then processed by the iptables rules on that worker node which are configured by the kube-proxy
. Then a Pod backing the "getazservice" Service
gets chosen at random by default. For detailed documentation on different kube-proxy modes(iptables, ipvs) please refer to Kubernetes Documentation.
To avoid this "randomness" of routing and reduce the cost of inter-AZ traffic routing and network latency, topology-aware-hints can be activated for the Service
to ensure that the service call is routed to a Pod that resides in the same AZ as that of the Pod which the request originated from.
To enable topology-aware routing, simply add the service.kubernetes.io/topology-aware-hints annotation
to "auto" for the "getazservice" as below and re-deploy the manifest.
apiVersion: v1
kind: Service
metadata:
name: getazservice
namespace: topology-demo-ns
annotations:
service.kubernetes.io/topology-aware-hints: auto
spec:
selector:
app: getaz
ports:
- port: 80
targetPort: web-port
protocol: TCP
When we describe the Service, we see the Annotation
associated with it.
kubectl -n topology-demo-ns describe svc getazservice
Name: getazservice
Namespace: topology-demo-ns
Labels: <none>
Annotations: service.kubernetes.io/topology-aware-hints: auto
Selector: app=getaz
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.100.9.165
IPs: 10.100.9.165
Port: <unset> 80/TCP
TargetPort: web-port/TCP
Endpoints: 192.168.24.149:3000,192.168.63.154:3000,192.168.69.57:3000
Session Affinity: None
Events: <none>
If we run the same test as before with the debug container, this time we should see an output similar to below
us-west-2b---
us-west-2b---
us-west-2b---
us-west-2b---
us-west-2b---
This shows that the calls to "getazservice" is getting consistently picked up by the backing Pod that resides in the same AZ as the requester Pod. Topology aware routing in this case is enabled by EndPointSlice
Controller and the kube-proxy
components. EndPointSlice
API in Kubernetes provides a way to track network endpoints within a cluster. EndpointSlices
offer a more scalable and extensible alternative to Endpoints
and is available since Kubernetes 1.21. When calculating the endpoints for a Service
that's annotated with service.kubernetes.io/topology-aware-hints: auto
, the EndpointSlice
controller considers the topology (region and zone) of each Service
endpoint and populates the hints
field to allocate it to a zone. Once the "hints" are populated, kube-proxy
can then consume these hints, and use them to influence how the traffic is routed (favoring topologically closer endpoints).
This solution reduces inter-AZ traffic routing and in turn lowers the cross-AZ data transfer costs in an EKS cluster. By enabling "intelligent" routing, it also helps reduce the network latency. While this approach works well in most cases, sometimes the EndpointSlice
controller allocates endpoints from a different zone to ensure more even distribution of endpoints between zones. This results in some traffic being routed to other zones. Thus, when using Topology-Aware-hints, its important to have application pods balanced across AZs using Topology Spread Constraints to avoid imbalances in the amount of traffic handled by each pod. Additionally, there are some other safeguards and constraints that one should be aware of before using this approach. As alternative solutions, one can use Service Mesh technologies like Istio or Linkerd to achieve topology-aware routing; however service mesh based solutions present additional complexities for the cluster operators to manage. In comparison, using topology-aware-hints is much simpler to implement, is supported out-of-the-box in EKS 1.24 and works great in reducing cross-AZ traffic costs within an EKS cluster.