Linux Network Performance Ultimate Guide

The following content is from my #til github.

Source

Linux Networking stack

Source:

NOTE: The follow sections will heavily use sysctl. If you don’t familiar with this command, take a look at HOWTO#sysctl section.

Linux network packet reception

NOTE: Some NICs are “multiple queues” NICs. This diagram above shows just a single ring buffer for simplicity, but depending on the NIC you are using and your hardware settings you may have mutliple queues in the system. Check Share the load of packet processing among CPUs section for detail.

  1. Packet arrives at the NIC

  2. NIC verifies MAC (if not on promiscuous mode) and FCS and decide to drop or to continue

  3. NIC does DMA (Direct Memory Access) packets into RAM - in a kernel data structure called an sk_buff or skb (Socket Kernel Buffers - SKBs).

  4. NIC enqueues references to the packets at receive ring buffer queue rx until rx-usecs timeout or rx-frames. Let’s talk about the RX ring buffer:

    • It is a circular buffer where an overflow simply overwrites existing data.
    • It does not contain packet data. Instead it consists of descriptors which point to skbs which is DMA into RAM (step 2).

    • Fixed size, FIFO and located at RAM (of course).
  5. NIC raises a HardIRQ - Hard Interrupt.

    • HardIRQ: interrupt from the hardware, known-as “top-half” interrupts.
    • When a NIC receives incoming data, it copies the data into kernel buffers using DMA. The NIC notifies the kernel of this data by raising a HardIRQ. These interrupts are processed by interrupt handlers which do minimal work, as they have already interrupted another task and cannot be interrupted themselves.
    • HardIRQs can be expensive in terms of CPU usage, especially when holding kernel locks. If they take too long to execute, they will cause the CPU to be unable to respond to other HardIRQ, so the kernel introduces SoftIRQs (Soft Interrupts), so that the time-consuming part of the HardIRQ handler can be moved to the SoftIRQ handler to handle it slowly. We will talk about SoftIRQ in the next steps.
    • HardIRQs can be seen in /proc/interrupts where each queue has an interrupt vector in the 1st column assigned to it. These are initialized when the system boots or when the NIC device driver module is loaded. Each RX and TX queue is assigned a unique vector, which informs the interrupt handler as to which NIC/queue the interrupt is coming from. The columns represent the number of incoming interrupts as a counter value:
    egrep “CPU0|eth3” /proc/interrupts
        CPU0 CPU1 CPU2 CPU3 CPU4 CPU5
    110:    0    0    0    0    0    0   IR-PCI-MSI-edge   eth3-rx-0
    111:    0    0    0    0    0    0   IR-PCI-MSI-edge   eth3-rx-1
    112:    0    0    0    0    0    0   IR-PCI-MSI-edge   eth3-rx-2
    113:    2    0    0    0    0    0   IR-PCI-MSI-edge   eth3-rx-3
    114:    0    0    0    0    0    0   IR-PCI-MSI-edge   eth3-tx
    
  6. CPU runs the IRQ handler that runs the driver’s code.

  7. Driver will schedule a NAPI, clear the HardIRQ on the NIC, so that it can generate IRQs for new packets arrivals.

  8. Driver raise a SoftIRQ (NET_RX_SOFTIRQ).

    • Let’s talk about the SoftIRQ, also known as “bottom-half” interrupt. It is a kernel routines which are scheduled to run at a time when other tasks will not be interrupted.
    • Purpose: drain the network adapter receive Rx ring buffer.
    • These routines run in the form of ksoftirqd/cpu-number processes and call driver-specific code functions.
    • Check command:
    ps aux | grep ksoftirq
                                                                      # ksotirqd/<cpu-number>
    root          13  0.0  0.0      0     0 ?        S    Dec13   0:00 [ksoftirqd/0]
    root          22  0.0  0.0      0     0 ?        S    Dec13   0:00 [ksoftirqd/1]
    root          28  0.0  0.0      0     0 ?        S    Dec13   0:00 [ksoftirqd/2]
    root          34  0.0  0.0      0     0 ?        S    Dec13   0:00 [ksoftirqd/3]
    root          40  0.0  0.0      0     0 ?        S    Dec13   0:00 [ksoftirqd/4]
    root          46  0.0  0.0      0     0 ?        S    Dec13   0:00 [ksoftirqd/5]
    
    • Monitor command:
    watch -n1 grep RX /proc/softirqs
    watch -n1 grep TX /proc/softirqs
    
  9. NAPI polls data from the rx ring buffer.

    • NAPI was written to make processing data packets of incoming cards more efficient. HardIRQs are expensive because they can’t be interrupt, we both known that. Even with Interrupt coalesecense (describe later in more detail), the interrupt handler will monopolize a CPU core completely. The design of NAPI allows the driver to go into a polling mode instead of being HardIRQ for every required packet receive.
    • Step 1->9 in brief:

    • The polling routine has a budget which determines the CPU time the code is allowed, by using netdev_budget_usecs timeout or netdev_budget and dev_weight packets. This is required to prevent SoftIRQs from monopolizing the CPU. On completion, the kernel will exit the polling routine and re-arm, then the entire procedure will repeat itself.

    • Let’s talk about netdev_budget_usecs timeout or netdev_budget and dev_weight packets:

      • If the SoftIRQs do not run for long enough, the rate of incoming data could exceed the kernel’s capability to drain the buffer last enough. As a result, the NIC buffers will overflow and traffic will be lost. Occasionaly, it is necessary to increase the time that SoftIRQs are allowed to run on the CPU. This is known as the netdev_budget.

        • Check command, the default value is 300, it means the SoftIRQ process to drain 300 messages from the NIC before getting off the CPU.
        sysctl net.core.netdev_budget
        net.core.netdev_budget = 300
        
      • netdev_budget_usecs: The maximum number of microseconds in 1 NAPI polling cycle. Polling will exit when either netdev_budget_usecs have elapsed during the poll cycle or the number of packets processed reaches netdev_budget.

        • Check command:
        sysctl net.core.netdev_budget_usecs
        
        net.core.netdev_budget_usecs = 8000
        
      • dev_weight: the maximum number of packets that kernel can handle on a NAPI interrupt, it’s a PER-CPU variable. For drivers that support LRO or GRO_HW, a hardware aggregated packet is counted as one packet in this.

      sysctl net.core.dev_weight
      
      net.core.dev_weight = 64
      
  10. Linux also allocates memory to sk_buff.

  11. Linux fills the metadata: protocol, interface, setmatchheader, removes ethernet

  12. Linux passes the skb to the kernel stack (netif_receive_skb)

  13. It sets the network header, clone skb to taps (i.e. tcpdump) and pass it to tc ingress

  14. Packets are handled to a qdisc sized netdev_max_backlog with its algorithm defined by default_qdisc:

    • netdev_max_backlog: a queue whitin the Linux kernel where traffic is stored after reception from the NIC, but before processing by the protocols stacks (IP, TCP, etc). There is one backlog queue per CPU core. A given core’s queue can grow automatically, containing a number of packets up to the maximum specified by the netdev_max_backlog settings.
    • In other words, this is the maximum number of packets, queued on the INPUT side (the ingress dsic), when the interface receives packets faster than kernel can process them.
    • Check command, the default value is 1000.
    sysctl net.core.netdev_max_backlog
    
    net.core.netdev_max_backlog = 1000
    
    • rxqueuelen: Receipt Queue Length, is a TCP/IP stack network interface value that sets the number of packets allowed per kernel receive queue of a network interface device.
      • By default, value is 1000 (depend on network interface driver): ifconfig <interface> | grep rxqueuelen
    • default_qdisc: the default queuing discipline to use for network devices. This allows overriding the default of pfifo_fast with an alternative. Since the default queuing discipline is created without additional parameters so is best suited to queuing disciplines that work well without configuration like stochastic fair queue (sfq), CoDel (codel) or fair queue CoDel (fq_codel). For full details for each QDisc in man tc <qdisc-name> (for example, man tc fq_codel).
  15. It calls ip_rcv and packets are handled to IP

  16. It calls netfilter (PREROUTING)

  17. It looks at the routing table, if forwarding or local

  18. If it’s local it calls netfilter (LOCAL_IN)

  19. It calls the L4 protocol (for instance tcp_v4_rcv)

  20. It finds the right socket

  21. It goes to the tcp finite state machine

  22. Enqueue the packet to the receive buffer and sized as tcp_rmem rules

    • If `tcp_moderate_rcvbuf is enabled kernel will auto-tune the receive buffer
    • tcp_rmem: Contains 3 values that represent the minimum, default and maximum size of the TCP socket receive buffer.
      • min: minimal size of receive buffer used by TCP sockets. It is guaranteed to each TCP socket, even under moderate memory pressure. Default: 4 KB.
      • default: initial size of receive buffer used by TCP sockets. This value overrides net.core.rmem_default used by other protocols. Default: 131072 bytes. This value results in initial window of 65535.
      • max: maximal size of receive buffer allowed for automatically selected receiver buffers for TCP socket. This value does not override net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables automatic tuning of that socket’s receive buffer size, in which case this value is ignored. SO_RECVBUF sets the fixed size of the TCP receive buffer, it will override tcp_rmem, and the kernel will no longer dynamically adjust the buffer. The maximum value set by SO_RECVBUF cannot exceed net.core.rmem_max. Normally, we will not use it. Default: between 131072 and 6MB, depending on RAM size.
    • net.core.rmem_max: the upper limit of the TCP receive buffer size.
      • Between net.core.rmem_max and net.ipv4.tcp-rmem‘max value, the bigger value takes precendence.
      • Increase this buffer to enable scaling to a larger window size. Larger windows increase the amount of data to be transferred before an acknowledgement (ACK) is required. This reduces overall latencies and results in increased throughput.
      • This setting is typically set to a very conservative value of 262,144 bytes. It is recommended this value be set as large as the kernel allows. 4.x kernels accept values over 16 MB.
  23. Kernel will signalize that there is data available to apps (epoll or any polling system)

  24. Application wakes up and reads the data

Linux kernel network transmission

Although simpler than the ingress logic, the egress is still worth acknowledging

  1. Application sends message (sendmsg or other)

  2. TCP send message allocates skb_buff

  3. It enqueues skb to the socket write buffer of tcp_wmem size

    • tcp_wmem: Contains 3 values that represent the minimum, default and maximum size of the TCP socket send buffer.
      • min: amount of memory reserved for send buffers for TCP sockets. Each TCP socket has rights to use it due to fact of its birth. Default: 4K
      • default: initial size of send buffer used by TCP sockets. This value overrides net.core.wmem_default used by other protocols. It is usually lower than net.core.wmem_default. Default: 16K
      • max: maximal amount of memory allowed for automatically tuned send buffers for TCP sockets. This value does not override net.core.wmem_max. Calling setsockopt() with SO_SNDBUF disables automatic tuning of that socket’s send buffer size, in which case this value is ignored. SO_SNDBUF sets the fixed size of the send buffer, it will override tcp_wmem, and the kernel will no longer dynamically adjust the buffer. The maximum value set by SO_SNDBUF cannot exceed net.core.wmem_max. Normally, we will not use it. Default: between 64K and 4MB, depending on RAM size.
    • Check command:
    sysctl net.ipv4.tcp_wmem
    net.ipv4.tcp_wmem = 4096        16384   262144
    
    • The size of the TCP send buffer will be dynamically adjusted between min and max by the kernel. The initial size is default.
    • net.core.wmem_max: the upper limit of the TCP send buffer size. Similar to net.core.rmem_max (but for transimission).
  4. Builds the TCP header (src and dst port, checksum)

  5. Calls L3 handler (in this case ipv4 on tcp_write_xmit and tcp_transmit_skb)

  6. L3 (ip_queue_xmit) does its work: build ip header and call netfilter (LOCAL_OUT)

  7. Calls output route action

  8. Calls netfilter (POST_ROUTING)

  9. Fragment the packet (ip_output)

  10. Calls L2 send function (dev_queue_xmit)

  11. Feeds the output (QDisc) queue of txqueuelen length with its algorithm default_qdisc

    • txqueuelen: Transmit Queue Length, is a TCP/IP stack network interface value that sets the number of packets allowed per kernel transmit queue of a network interface device.
      • By default, value is 1000 (depend on network interface driver): ifconfig <interface> | grep txqueuelen
    • default_qdisc: the default queuing discipline to use for network devices. This allows overriding the default of pfifo_fast with an alternative. Since the default queuing discipline is created without additional parameters so is best suited to queuing disciplines that work well without configuration like stochastic fair queue (sfq), CoDel (codel) or fair queue CoDel (fq_codel). For full details for each QDisc in man tc <qdisc-name> (for example, man tc fq_codel).
  12. The driver code enqueue the packets at the ring buffer tx

  13. The driver will do a soft IRQ (NET_TX_SOFTIRQ) after tx-usecs timeout or tx-frames

  14. Re-enable hard IRQ to NIC

  15. Driver will map all the packets (to be sent) to some DMA’ed region

  16. NIC fetches the packets (via DMA) from RAM to transmit

  17. After the transmission NIC will raise a hard IRQ to signal its completion

  18. The driver will handle this IRQ (turn it off)

  19. And schedule (soft IRQ) the NAPI poll system

  20. NAPI will handle the receive packets signaling and free the RAM

Network Performance tuning

Tuning a NIC for optimum throughput and latency is a complex process with many factors to consider. There is no generic configuration that can be broadly applied to every system.

There are factors should be considered for network performance tuning. Note that, the interface card name may be different in your device, change the appropriate value.

Ok, let’s follow through the Packet reception (and transmission) and do some tuning.

Quick HOWTO

/proc/net/softnet_stat & /proc/net/sockstat

Before we continue, let’s discuss about /proc/net/softnet_stat & /proc/net/sockstat as these files will be used a lot then.

cat /proc/net/softnet_stat

0000272d 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
000034d9 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000001
00002c83 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000002
0000313d 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000003
00003015 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000004
000362d2 00000000 000000d2 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000005
cat /proc/net/sockstat

sockets: used 937
TCP: inuse 21 orphan 0 tw 0 alloc 22 mem 5
UDP: inuse 9 mem 5
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

ss

ss -tm

#  -m, --memory
#         Show socket memory usage. The output format is:

#         skmem:(r<rmem_alloc>,rb<rcv_buf>,t<wmem_alloc>,tb<snd_buf>,
#                       f<fwd_alloc>,w<wmem_queued>,o<opt_mem>,
#                       bl<back_log>,d<sock_drop>)

#         <rmem_alloc>
#                the memory allocated for receiving packet

#         <rcv_buf>
#                the total memory can be allocated for receiving packet

#         <wmem_alloc>
#                the memory used for sending packet (which has been sent to layer 3)

#         <snd_buf>
#                the total memory can be allocated for sending packet

#         <fwd_alloc>
#                the  memory  allocated  by  the  socket as cache, but not used for receiving/sending packet yet. If need memory to send/receive packet, the memory in this
#                cache will be used before allocate additional memory.

#         <wmem_queued>
#                The memory allocated for sending packet (which has not been sent to layer 3)

#         <ropt_mem>
#                The memory used for storing socket option, e.g., the key for TCP MD5 signature

#         <back_log>
#                The memory used for the sk backlog queue. On a process context, if the process is receiving packet, and a new packet is received, it will be put into  the
#                sk backlog queue, so it can be received by the process immediately

#         <sock_drop>
#                the number of packets dropped before they are de-multiplexed into the socket

#  -t, --tcp
#         Display TCP sockets.

State       Recv-Q Send-Q        Local Address:Port        Peer Address:Port
ESTAB       0      0             192.168.56.102:ssh        192.168.56.1:56328
skmem:(r0,rb369280,t0,tb87040,f0,w0,o0,bl0,d0)

# rcv_buf: 369280 bytes
# snd_buf: 87040 bytes

netstat

sysctl

echo "value" > /proc/sys/location/variable
# To display a list of available sysctl variables
sysctl -a | less
# To only list specific variables use
sysctl variable1 [variable2] [...]
# To change a value temporarily use the sysctl command with the -w option:
sysctl -w variable=value
# To override the value persistently, the /etc/sysctl.conf file must be changed. This is the recommend method. Edit the /etc/sysctl.conf file.
vi /etc/sysctl.conf
# Then add/change the value of the variable
variable = value
# Save the changes and close the file. Then use the -p option of the sysctl command to load the updated sysctl.conf settings:
sysctl -p or sysctl -p /etc/sysctl.conf
# The updated sysctl.conf values will now be applied when the system restarts.

The NIC Ring Buffer

Interrupt Coalescence (IC) - rx-usecs, tx-usecs, rx-frames, tx-frames (hardware IRQ)

IRQ Affinity

systemctl status irqbalance.service
systemctl stop irqbalance.service
# Determine the IRQ number associated with the Ethernet driver
grep eth0 /proc/interrupts

32:   0     140      45       850264      PCI-MSI-edge      eth0

# IRQ 32
# Check the current value
# The default value is 'f', meaning that the IRQ can be serviced
# on any of the CPUs
cat /proc/irq/32/smp_affinity

f

# CPU0 is the only CPU used
echo 1 > /proc/irq/32/smp_affinity
cat /proc/irq/32/smp_affinity

1

# Commas can be used to delimit smp_affinity values for discrete 32-bit groups
# This is required on systems with more than 32 cores
# For example, IRQ  40 is serviced on all cores of a 64-core system
cat /proc/irq/40/smp_affinity

ffffffff,ffffffff

# To service IRQ 40 on only the upper 32 cores
echo 0xffffffff,00000000 > /proc/irq/40/smp_affinity
cat /proc/irq/40/smp_affinity

ffffffff,00000000

Share the load of packet processing among CPUs

Source:

Once upon a time, everything was so simple. The network card was slow and had only one queue. When packets arrives, the network card copies packets through DMA and sends an interrupt, and the Linux kernel harvests those packets and completes interrupt processing. As the network cards became faster, the interrupt based model may cause IRQ storm due to the massive incoming packets. This will consume the most of CPU power and freeze the system. To solve this problem, NAPI (interrupt and polling) was proposed. When the kernel receives an interrupt from the network card, it starts to poll the device and harvest packets in the queues as fast as possible. NAPI works nicely with the 1Gbps network card which is common nowadays. However, it comes to 10Gbps, 20Gbps, or even 40Gbps network cards, NAPI may not be sufficient. Those cards would demand mush faster CPU if we still use one CPU and one queue to receive packets. Fortunately, multi-core CPUs are popular now, so why not process packets in parallel?

Receive-side scaling (RSS)

Receive Packet Steering (RPS)

/sys/class/net/<dev>/queues/rx-<n>/rps_cpus

# This file implements a bitmap of CPUs
# 0 (default): disabled

Receive Flow Steering (RFS)

sysctl -w net.core.rps_sock_flow_entries 32768

Accelerated Receive Flow Steering (aRFS)

Interrupt Coalescing (soft IRQ)

Ingress QDisc

Egress Disc - txqueuelen and default_qdisc

TCP Read and Write Buffers/Queues

TCP FSM and congestion algorithm

Accept and SYN queues are governed by net.core.somaxconn and net.ipv4.tcp_max_syn_backlog. Nowadays net.core.somaxconn caps both queue sizes.

NUMA

ls -ld /sys/devices/system/node/node*

drwxr-xr-x. 3 root root 0 Aug 15 19:44 /sys/devices/system/node/node0
drwxr-xr-x. 3 root root 0 Aug 15 19:44 /sys/devices/system/node/node1
cat /sys/devices/system/node/node0/cpulist

0-5

cat /sys/devices/system/node/node1/cpulist
# empty
systemctl stop irqbalance

Further more - Packet processing

This section is an advance one. It introduces some advance module/framework to achieve high performance.

AF_PACKET v4

Source:

PACKET_MMAP

Source:

Kernel bypass: Data Plane Development Kit (DPDK)

Source:

PF_RING

Source:

Programmable packet processing: eXpress Data Path (XDP)

Source: