一个中断关闭时间太长导致的网络延迟问题

近期线上出现了一个问题,现象是有一台机器,网络出现了不定时的延迟:

# ping -i 0.05 192.168.0.5
#...
64 bytes from 192.168.0.5: icmp_seq=673 ttl=63 time=0.203 ms
64 bytes from 192.168.0.5: icmp_seq=674 ttl=63 time=0.210 ms
64 bytes from 192.168.0.5: icmp_seq=675 ttl=63 time=0.218 ms
64 bytes from 192.168.0.5: icmp_seq=676 ttl=63 time=0.233 ms
64 bytes from 192.168.0.5: icmp_seq=677 ttl=63 time=406 ms
64 bytes from 192.168.0.5: icmp_seq=678 ttl=63 time=354 ms
64 bytes from 192.168.0.5: icmp_seq=679 ttl=63 time=302 ms
64 bytes from 192.168.0.5: icmp_seq=680 ttl=63 time=251 ms
64 bytes from 192.168.0.5: icmp_seq=681 ttl=63 time=199 ms
64 bytes from 192.168.0.5: icmp_seq=682 ttl=63 time=147 ms
64 bytes from 192.168.0.5: icmp_seq=683 ttl=63 time=94.8 ms
64 bytes from 192.168.0.5: icmp_seq=684 ttl=63 time=43.0 ms
64 bytes from 192.168.0.5: icmp_seq=685 ttl=63 time=0.216 ms
64 bytes from 192.168.0.5: icmp_seq=686 ttl=63 time=0.248 ms
#...

以50ms为间隔ping,发现概率性的会出现超过400ms的延迟,但是并没有丢包的现象发生。

首先想到了使用nettrace工具来分析一下这个问题,nettrace是一个腾讯开源的基于bpftrace的网络流量追踪工具,可以追踪到内核网络栈的每一个函数调用。期望能通过这个工具来看一下这个延迟是不是消耗在了内核的网络协议栈上。根据ping结果中响应时间最长的seq=677的包,过滤出来从协议栈接收包到发送回复包的所有函数调用:

# nettrace -p icmp
#...
***************** d8061b00 ***************
[24583464.629102] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629114] [dev_gro_receive     ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629248] [__netif_receive_skb_core] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629253] [packet_rcv          ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629256] [tcf_classify        ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629262] [__netif_receive_skb_core] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629268] [ip_rcv              ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629271] [ip_rcv_core         ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629276] [nf_hook_slow        ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392 *ipv4 in chain: PRE_ROUTING*
[24583464.629284] [ip_route_input_slow ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629293] [fib_validate_source ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629299] [ip_local_deliver    ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629301] [nf_hook_slow        ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392 *ipv4 in chain: INPUT*
[24583464.629304] [nft_do_chain        ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392 *iptables table:filter, chain:INPUT*
[24583464.629311] [ip_local_deliver_finish] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629316] [icmp_rcv            ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629321] [icmp_echo           ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629325] [icmp_reply          ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629413] [consume_skb         ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392 *packet is freed (normally)*
#...
***************** a3faf500,a3faec00,a3faf800 ***************
[24583464.629343] [__ip_local_out      ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629347] [nf_hook_slow        ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392 *ipv4 in chain: OUTPUT*
[24583464.629350] [nft_do_chain        ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392 *iptables table:filter, chain:OUTPUT*
[24583464.629354] [ip_output           ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629358] [nf_hook_slow        ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392 *ipv4 in chain: POST_ROUTING*
[24583464.629361] [ip_finish_output    ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629365] [ip_finish_output2   ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629368] [__dev_queue_xmit    ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629372] [tcf_classify        ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629376] [skb_clone           ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629381] [__dev_queue_xmit    ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629385] [dev_hard_start_xmit ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392 *skb is successfully sent to the NIC driver*
[24583464.629389] [bond_dev_queue_xmit ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629393] [__dev_queue_xmit    ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629396] [dev_hard_start_xmit ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392 *skb is successfully sent to the NIC driver*
[24583464.629398] [skb_clone           ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629401] [packet_rcv          ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629404] [consume_skb         ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392 *packet is freed (normally)*
[24583464.629409] [consume_skb         ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392 *packet is freed (normally)*
[24583464.630375] [consume_skb         ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392 *packet is freed (normally)*
#...

从过滤出来的结果看,最左侧的时间一栏,发现从接受icmp请求到将回复请求扔给网卡,整个协议栈只消耗了不到1ms的时间,几乎可以排除是内核网络协议栈的问题。那是怎么回事呢?

尝试继续分析一下nettrace工具的输出:

# grep -E 'napi_gro_receive_entry|__ip_local_out' trace.log
#...
[24583464.120360] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 675, id: 392
[24583464.120449] [__ip_local_out      ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 675, id: 392
[24583464.172360] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 676, id: 392
[24583464.172456] [__ip_local_out      ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 676, id: 392
# 上次回复之后,大约等待了450ms,才继续收到下一个ping包,并且在同1ms内连续收到了8个包
[24583464.629102] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629118] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 678, id: 392
[24583464.629503] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 679, id: 392
[24583464.629515] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 680, id: 392
[24583464.629521] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 681, id: 392
[24583464.629526] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 682, id: 392
[24583464.629531] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 683, id: 392
[24583464.630165] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 684, id: 392
[24583464.629343] [__ip_local_out      ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629452] [__ip_local_out      ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 678, id: 392
[24583464.629578] [__ip_local_out      ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 679, id: 392
[24583464.629781] [__ip_local_out      ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 680, id: 392
[24583464.629886] [__ip_local_out      ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 681, id: 392
[24583464.629989] [__ip_local_out      ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 682, id: 392
[24583464.630081] [__ip_local_out      ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 683, id: 392
[24583464.630282] [__ip_local_out      ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 684, id: 392
[24583464.637354] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 685, id: 392
[24583464.637439] [__ip_local_out      ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 685, id: 392
[24583464.689368] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 686, id: 392
[24583464.689469] [__ip_local_out      ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 686, id: 392
[24583464.741353] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 687, id: 392
[24583464.741462] [__ip_local_out      ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 687, id: 392
[24583464.793361] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 688, id: 392
[24583464.793477] [__ip_local_out      ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 688, id: 392
#...

从日志可以看到,再回复完seq=676的包后,经过了450ms才继续接收到了seq=678的包,并且在同1ms内连续收到了8个包,之后的处理也非常快,几乎没有延迟。这个日志也可以和ping的结果对应上,从ping的源端来看,由于ping的间隔是50ms,seq=676的包发送出去,等待50ms发送seq=677的包,而这个包被延迟了400ms被目的端收到并返回,依次类推,在这400ms期间,共有8个被发出,在处理seq=677的包同时,其他包也被收到并处理,所以从源端看,678-684的包相应时间的以大约50ms的间隔缩短。

感觉像是中断处理慢了,毕竟也没有发生丢包的情况,只是延迟了。借助bcc-tools里的hardirqssoftirqs工具,来看看中断的处理情况:

# /usr/share/bcc/tools/hardirqs 1 100
Tracing hard irq event time... Hit Ctrl-C to end.

HARDIRQ                    TOTAL_usecs
nvme0q52                             5
nvme0q49                            31
nvme0q73                            44
mlx5_async17@pci:0000:1a:00.0          68
acpi                                73
mlx5_async17@pci:0000:1a:00.1          73
nvme0q51                            90
nvme0q76                           120
nvme0q75                           135
ens33f0-0                          782
ens33f1-0                          913

HARDIRQ                    TOTAL_usecs
mlx5_async17@pci:0000:1a:00.0           4
mlx5_async17@pci:0000:1a:00.1           7
megasas0-msix51                     48
megasas0-msix75                     58
nvme0q52                            60
nvme0q76                            61
acpi                                70
nvme0q75                            83
nvme0q49                           104
nvme0q51                           123
nvme0q73                           133
ens33f0-0                          340
ens33f1-0                          598

HARDIRQ                    TOTAL_usecs
acpi                                37
mlx5_async17@pci:0000:1a:00.1          38
mlx5_async17@pci:0000:1a:00.0          72
nvme0q73                           113
nvme0q75                           144
nvme0q51                           148
nvme0q52                           196
ens33f0-0                          374
ens33f1-0                          581
#...
# /usr/share/bcc/tools/softirqs 1 100
Tracing soft irq event time... Hit Ctrl-C to end.

SOFTIRQ          TOTAL_usecs
net_tx                     6
tasklet                  483
sched                   3923
rcu                     6545
net_rx                  9657
timer                  24720

SOFTIRQ          TOTAL_usecs
tasklet                  221
sched                   3256
net_rx                  4408
rcu                     5641
timer                  25195

SOFTIRQ          TOTAL_usecs
tasklet                  287
sched                   4734
net_rx                  6355
rcu                     8406
timer                  25333
#...

并没有发现中断处理慢了的情况,那有没有可能是中断被屏蔽了呢?如果中断被屏蔽了400ms,在这400ms期间,收到的包都会存放在网卡的缓冲区里,但是网卡没办法通过中断的方式通知内核来处理这些包,等到中断被打开后,内核才会收到中断,并开始处理网卡缓冲区里的数据,这样就会出现上面提到的现象。这个假设似乎说得通,但是怎么验证呢?

想起字节跳动几年前开源了一个叫做trace-irqoff的工具,可以用来追踪和定位中断或者软中断关闭的时间。

尝试使用一下这个工具:

# cat /proc/trace_irqoff/distribute
hardirq-off:
         msecs           : count     distribution
        10 -> 19         : 1        |**********                              |
        20 -> 39         : 0        |                                        |
        40 -> 79         : 0        |                                        |
        80 -> 159        : 1        |**********                              |
       160 -> 319        : 2        |********************                    |
       320 -> 639        : 4        |****************************************|

# cat /proc/trace_irqoff/trace_latency
trace_irqoff_latency: 50ms

 hardirq:
 cpu: 0
     COMMAND: kubelet PID: 1162942 LATENCY: 425ms
     save_trace.isra.0+0x18c/0x1b0 [trace_irqoff]
     trace_irqoff_record+0xa0/0xd0 [trace_irqoff]
     trace_irqoff_hrtimer_handler+0x47/0xd6 [trace_irqoff]
     __hrtimer_run_queues+0x12a/0x2c0
     hrtimer_interrupt+0xfc/0x210
     __sysvec_apic_timer_interrupt+0x5f/0x110
     sysvec_apic_timer_interrupt+0x6d/0x90
     asm_sysvec_apic_timer_interrupt+0x16/0x20
     _raw_spin_unlock_irq+0x14/0x25
     blkcg_print_blkgs+0x72/0xe0
     tg_print_rwstat_recursive+0x40/0x50
     seq_read_iter+0x124/0x4b0
     new_sync_read+0x118/0x1a0
     vfs_read+0xf5/0x190
     ksys_read+0x5f/0xe0
     do_syscall_64+0x5c/0x90
     entry_SYSCALL_64_after_hwframe+0x63/0xcd

     COMMAND: kubelet PID: 1162942 LATENCY: 462ms
     save_trace.isra.0+0x18c/0x1b0 [trace_irqoff]
     trace_irqoff_record+0xa0/0xd0 [trace_irqoff]
     trace_irqoff_hrtimer_handler+0x47/0xd6 [trace_irqoff]
     __hrtimer_run_queues+0x12a/0x2c0
     hrtimer_interrupt+0xfc/0x210
     __sysvec_apic_timer_interrupt+0x5f/0x110
     sysvec_apic_timer_interrupt+0x6d/0x90
     asm_sysvec_apic_timer_interrupt+0x16/0x20
     _raw_spin_unlock_irq+0x14/0x25
     blkcg_print_blkgs+0x72/0xe0
     tg_print_rwstat_recursive+0x40/0x50
     seq_read_iter+0x124/0x4b0
     new_sync_read+0x118/0x1a0
     vfs_read+0xf5/0x190
     ksys_read+0x5f/0xe0
     do_syscall_64+0x5c/0x90
     entry_SYSCALL_64_after_hwframe+0x63/0xcd
#...

还真是发现了中断被长时间屏蔽的情况,从这个调用栈看,kubelet进程会读取cgroups中的内容,在这个过程中,内核在处理blkcg_print_blkgs时,屏蔽了中断425ms和462ms。这个时间和ping延迟的400ms十分接近了,那几乎就是因为这个原因导致的问题了。

具体到为什么会出现这么长的中断屏蔽时间,这里就不继续把分析的详细过程贴出来了,最终是找到了cgroups一个内存泄漏的bug,导致blkcg_print_blkgs时需要遍历超多的items,导致了中断屏蔽时间过长。

确实需要感谢一下腾讯和字节跳动开源的这两个工具,极大的加速了问题的定位😁。说起来阿里开源的运维工具SysAK里也有一个irqoff工具,是基于eBPF实现的,相比字节的可能还要灵活一些,不过就得等下次再试用了😊。