Iptables basics for container networking

The following article is a small dive into iptables. It is useful for anyone just starting and wanting to understand the basics of iptables. However, it is targeted primarily for users wanting to understand kube-proxy and how it uses iptables. I will be mainly using nat table and using the iptables rule from a local minikube cluster.

Understanding the iptables flow

To start, the following diagram below is all one needs to know to get started with iptables.

This diagram shows the basic flow diagram and describes where a packet traversing through the Linux kernel will encounter which component and when.

Iptables has 4 tables in general. These four tables are,

  1. raw

  2. mangle

  3. forward

  4. nat

Each of these tables has chains associated with them. The default chains are as you can see above in the diagram. We are however not limited to them and can have our own user-defined chains too. The rules also have additional filters assigned to them. Any packet that passes through the rules can then "Jump" to the next chain defined by the rule. As seen in the above diagram the flow has 2 splits. Any network coming into the system from external sources will travel down the top lane. They are processed by the PREROUTING chains of each raw, mangle, and nat table. They get to operate on the packet before even any routing decision has been made. Following that the kernel decides on whether the packet is destined to our device or not. This is made by comparing the Destination IP in the packet to all IP addresses assigned to all the network interfaces in our device. If the packet is destined for us, then packets take the path on the left. If it is not, depending on whether ip_forward rule has been enabled, the packet will be discarded or travel down the right lane instead.

Any forwarded packets travel down mangle and filter FORWARD chain rule. The packets can still be modified to change the destination IP which is again taken into consideration as can be seen above. However, the packet has already been marked to leave the device and will no longer reach any application running in our system. This routing decision only determines which interface should the packet travel to. After which mangle and nat POSTROUTING chain get their chance to change the packet. They allow devices to make changes to packets without affecting the path the packet is to take.

Any packets that were decided initially to be routed to our device, will now pass through mangle and filter INPUT chain rule. This is the final chance for us to still modify IP packets before they are consumed by any local process. This is also the point through which the packets originating from local process start from. They immediately go through a routing decision, after which raw, mangle, nat, and filter OUTPUT chains all have a chance to go through the packet before another routing decision is made after which like the forwarded packet. mangle and nat POSTROUTING get a chance to modify the packets.

Half of the understanding of IPtables comes from understanding the above flow of packets. Each specific table chain rule combination has different stages on which they work in the packet.

The other half comes from understanding the filters and iptables-extensions.

Example rules

Enough of theory, I can go on for another few paragraphs and explain things but honestly, I believe looking at some examples is always the best way to understand.

Below are some rules I took from local minikube node that has Kubernetes deployed with kube-proxy in default iptables mode.

  • iptables -t nat -A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING

    The above rule comes from nat table. It hooks into POSTROUTING hook in the chain. Looking at the diagram above this is the final point before the packet leaves our system. It has a jump to KUBE-POSTROUTING where further processing is done. There is also a comment in the rule, which is there purely for humans to understand. and provides no further filters. They are part of iptables-extension.

Note: Any filter you see start with -m belongs to iptables-extensions. -m specifies the module to match to, further information about it can be found in the man page of iptables-extensions.

Tracing a rule

External packets

Let's trace a route and see where it leads us to. I will be tracing a rule for a deployment with 2 replicas. The deployment is an Nginx web server listening at port 80. The deployment is also exposed to the cluster using a service of type ClusterIP.

  • iptables -t nat -A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES

    The above rule comes from nat table. It hooks into PREROUTING hook in the chain. So it gets to transform the packets before any routing decisions have been made. It also has comments and jumps to a KUBE-SERVICES

  • iptables -t nat -A KUBE-SERVICES -d 10.100.29.239/32 -p tcp -m comment --comment "default/hello-nginx cluster IP" -m tcp --dport 80 -j KUBE-SVC-QV4NJDQ2TL4A6VRO

    Following the previous rule, we end up with this rule, looking at all the rules in the KUBE-SERVICES rule, it is clear that this is a general chain that contains all the services. For now, I have selected the rule that is of interest to the deployment we are watching. Here the rule is filtering for any packet that has the destination 10.100.29.239/32 which is also the service IP for the deployment. It also makes sure that the packet is a TCP packet directed to port 80.

  • I have selected all three rules that follow the above rule,

    iptables -t nat -A KUBE-SVC-QV4NJDQ2TL4A6VRO ! -s 10.244.0.0/16 -d 10.100.29.239/32 -p tcp -m comment --comment "default/hello-nginx cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ

    This rule marks all the incoming packets 10.100.29.239/32 that don't have source IP in the CIDR range of 10.244.0.0/16. This is also the POD CIDR range for the cluster. So it's essentially marking every packet that is coming to the service that didn't originate from inside the cluster.

    iptables -t nat -A KUBE-SVC-QV4NJDQ2TL4A6VRO -m comment --comment "default/hello-nginx -> 10.244.0.6:80" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-BF7GKAYFIIMLRAW5

    This rule shows how iptables do load balancing between pods. This rule has a 50% chance of getting activated. And if it does as can be seen from the comment will transfer the traffic to pod at 10.244.0.6:80

    iptables -t nat -A KUBE-SVC-QV4NJDQ2TL4A6VRO -m comment --comment "default/hello-nginx -> 10.244.0.7:80" -j KUBE-SEP-5GV5FJNBRYEOQ7QB

    This rule is the final rule for our packet. By now we are sure the packet should end up in our nodes. It will forward the packet to pod listening at 10.244.0.7:80 Note how it doesn't have a random filter applied like the previous rule. At this point, if we apply the filter and the filter fails to apply, we would effectively be dropping a packet destined for our pod.

  • iptables -t nat -A KUBE-SEP-5GV5FJNBRYEOQ7QB -p tcp -m comment --comment "default/hello-nginx" -m tcp -j DNAT --to-destination 10.244.0.7:80

    Following the final rule, we end up with this rule. This rule does a DNAT (Destination NAT) and changes the destination to 10.244.0.7:80 which is where our pod is listening. Note we came here through the PREROUTING chain, so the destination for the packet had not been decided yet. Now the packet's destination is decided which we have already modified to our pod.

Internal Packets

While this is all that's needed to route packets originating from external sources, if you look at our diagram above you will notice that packets originating from local sources never end up passing through PREROUTING. They pass through OUTPUT instead. And you will indeed find rules for diverting them also to KUBE-SERVICES

iptables -t nat -A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES

The same flow as above will occur for these packets. Note that since these packets are certain to not originate from within the cluster, we can look at the use of mark.

For this, we trace the following rule,

  • iptables -t nat -A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING

    This route comes into play after all the routing decisions have been made and our packet is destined to leave the system. We pass the packet to KUBE-POSTROUTING for further processing.

  • iptables -t nat -A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN

    This rule filters out all the packets that don't have the previous mark specified. So any packet originating from within the cluster doesn't undergo kube-proxy postrouting processing.

    iptables -t nat -A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0

    This rule is clearing out one-half of the mark. The initial mark applied was
    0x4000/0x4000

    iptables -t nat -A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully
    This rule states to masquerade all the packets going through this rule. This essentially means that any packet not originating from the cluster is IP Masqueraded to point to the host node as host nodes are acting as routers for the cluster within.

This was but a small primer to understanding IPtables. With this knowledge, you are well-equipped to understand most of the rules in iptables now. Of course, there is still the filter table, which is used for Kubernetes Load balancer firewall capabilities, which I leave as an exercise for my dear readers.