一个中断关闭时间太长导致的网络延迟问题
近期线上出现了一个问题,现象是有一台机器,网络出现了不定时的延迟:
# ping -i 0.05 192.168.0.5
#...
64 bytes from 192.168.0.5: icmp_seq=673 ttl=63 time=0.203 ms
64 bytes from 192.168.0.5: icmp_seq=674 ttl=63 time=0.210 ms
64 bytes from 192.168.0.5: icmp_seq=675 ttl=63 time=0.218 ms
64 bytes from 192.168.0.5: icmp_seq=676 ttl=63 time=0.233 ms
64 bytes from 192.168.0.5: icmp_seq=677 ttl=63 time=406 ms
64 bytes from 192.168.0.5: icmp_seq=678 ttl=63 time=354 ms
64 bytes from 192.168.0.5: icmp_seq=679 ttl=63 time=302 ms
64 bytes from 192.168.0.5: icmp_seq=680 ttl=63 time=251 ms
64 bytes from 192.168.0.5: icmp_seq=681 ttl=63 time=199 ms
64 bytes from 192.168.0.5: icmp_seq=682 ttl=63 time=147 ms
64 bytes from 192.168.0.5: icmp_seq=683 ttl=63 time=94.8 ms
64 bytes from 192.168.0.5: icmp_seq=684 ttl=63 time=43.0 ms
64 bytes from 192.168.0.5: icmp_seq=685 ttl=63 time=0.216 ms
64 bytes from 192.168.0.5: icmp_seq=686 ttl=63 time=0.248 ms
#...
以50ms为间隔ping,发现概率性的会出现超过400ms的延迟,但是并没有丢包的现象发生。
首先想到了使用nettrace
工具来分析一下这个问题,nettrace
是一个腾讯开源的基于bpftrace
的网络流量追踪工具,可以追踪到内核网络栈的每一个函数调用。期望能通过这个工具来看一下这个延迟是不是消耗在了内核的网络协议栈上。根据ping结果中响应时间最长的seq=677的包,过滤出来从协议栈接收包到发送回复包的所有函数调用:
# nettrace -p icmp
#...
***************** d8061b00 ***************
[24583464.629102] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629114] [dev_gro_receive ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629248] [__netif_receive_skb_core] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629253] [packet_rcv ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629256] [tcf_classify ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629262] [__netif_receive_skb_core] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629268] [ip_rcv ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629271] [ip_rcv_core ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629276] [nf_hook_slow ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392 *ipv4 in chain: PRE_ROUTING*
[24583464.629284] [ip_route_input_slow ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629293] [fib_validate_source ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629299] [ip_local_deliver ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629301] [nf_hook_slow ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392 *ipv4 in chain: INPUT*
[24583464.629304] [nft_do_chain ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392 *iptables table:filter, chain:INPUT*
[24583464.629311] [ip_local_deliver_finish] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629316] [icmp_rcv ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629321] [icmp_echo ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629325] [icmp_reply ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629413] [consume_skb ] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392 *packet is freed (normally)*
#...
***************** a3faf500,a3faec00,a3faf800 ***************
[24583464.629343] [__ip_local_out ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629347] [nf_hook_slow ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392 *ipv4 in chain: OUTPUT*
[24583464.629350] [nft_do_chain ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392 *iptables table:filter, chain:OUTPUT*
[24583464.629354] [ip_output ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629358] [nf_hook_slow ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392 *ipv4 in chain: POST_ROUTING*
[24583464.629361] [ip_finish_output ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629365] [ip_finish_output2 ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629368] [__dev_queue_xmit ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629372] [tcf_classify ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629376] [skb_clone ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629381] [__dev_queue_xmit ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629385] [dev_hard_start_xmit ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392 *skb is successfully sent to the NIC driver*
[24583464.629389] [bond_dev_queue_xmit ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629393] [__dev_queue_xmit ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629396] [dev_hard_start_xmit ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392 *skb is successfully sent to the NIC driver*
[24583464.629398] [skb_clone ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629401] [packet_rcv ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629404] [consume_skb ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392 *packet is freed (normally)*
[24583464.629409] [consume_skb ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392 *packet is freed (normally)*
[24583464.630375] [consume_skb ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392 *packet is freed (normally)*
#...
从过滤出来的结果看,最左侧的时间一栏,发现从接受icmp请求到将回复请求扔给网卡,整个协议栈只消耗了不到1ms的时间,几乎可以排除是内核网络协议栈的问题。那是怎么回事呢?
尝试继续分析一下nettrace
工具的输出:
# grep -E 'napi_gro_receive_entry|__ip_local_out' trace.log
#...
[24583464.120360] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 675, id: 392
[24583464.120449] [__ip_local_out ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 675, id: 392
[24583464.172360] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 676, id: 392
[24583464.172456] [__ip_local_out ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 676, id: 392
# 上次回复之后,大约等待了450ms,才继续收到下一个ping包,并且在同1ms内连续收到了8个包
[24583464.629102] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 677, id: 392
[24583464.629118] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 678, id: 392
[24583464.629503] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 679, id: 392
[24583464.629515] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 680, id: 392
[24583464.629521] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 681, id: 392
[24583464.629526] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 682, id: 392
[24583464.629531] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 683, id: 392
[24583464.630165] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 684, id: 392
[24583464.629343] [__ip_local_out ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 677, id: 392
[24583464.629452] [__ip_local_out ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 678, id: 392
[24583464.629578] [__ip_local_out ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 679, id: 392
[24583464.629781] [__ip_local_out ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 680, id: 392
[24583464.629886] [__ip_local_out ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 681, id: 392
[24583464.629989] [__ip_local_out ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 682, id: 392
[24583464.630081] [__ip_local_out ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 683, id: 392
[24583464.630282] [__ip_local_out ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 684, id: 392
[24583464.637354] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 685, id: 392
[24583464.637439] [__ip_local_out ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 685, id: 392
[24583464.689368] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 686, id: 392
[24583464.689469] [__ip_local_out ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 686, id: 392
[24583464.741353] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 687, id: 392
[24583464.741462] [__ip_local_out ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 687, id: 392
[24583464.793361] [napi_gro_receive_entry] ICMP: 192.168.0.4 -> 192.168.0.5 ping request, seq: 688, id: 392
[24583464.793477] [__ip_local_out ] ICMP: 192.168.0.5 -> 192.168.0.4 ping reply, seq: 688, id: 392
#...
从日志可以看到,再回复完seq=676的包后,经过了450ms才继续接收到了seq=678的包,并且在同1ms内连续收到了8个包,之后的处理也非常快,几乎没有延迟。这个日志也可以和ping的结果对应上,从ping的源端来看,由于ping的间隔是50ms,seq=676的包发送出去,等待50ms发送seq=677的包,而这个包被延迟了400ms被目的端收到并返回,依次类推,在这400ms期间,共有8个被发出,在处理seq=677的包同时,其他包也被收到并处理,所以从源端看,678-684的包相应时间的以大约50ms的间隔缩短。
感觉像是中断处理慢了,毕竟也没有发生丢包的情况,只是延迟了。借助bcc-tools里的hardirqs
和softirqs
工具,来看看中断的处理情况:
# /usr/share/bcc/tools/hardirqs 1 100
Tracing hard irq event time... Hit Ctrl-C to end.
HARDIRQ TOTAL_usecs
nvme0q52 5
nvme0q49 31
nvme0q73 44
mlx5_async17@pci:0000:1a:00.0 68
acpi 73
mlx5_async17@pci:0000:1a:00.1 73
nvme0q51 90
nvme0q76 120
nvme0q75 135
ens33f0-0 782
ens33f1-0 913
HARDIRQ TOTAL_usecs
mlx5_async17@pci:0000:1a:00.0 4
mlx5_async17@pci:0000:1a:00.1 7
megasas0-msix51 48
megasas0-msix75 58
nvme0q52 60
nvme0q76 61
acpi 70
nvme0q75 83
nvme0q49 104
nvme0q51 123
nvme0q73 133
ens33f0-0 340
ens33f1-0 598
HARDIRQ TOTAL_usecs
acpi 37
mlx5_async17@pci:0000:1a:00.1 38
mlx5_async17@pci:0000:1a:00.0 72
nvme0q73 113
nvme0q75 144
nvme0q51 148
nvme0q52 196
ens33f0-0 374
ens33f1-0 581
#...
# /usr/share/bcc/tools/softirqs 1 100
Tracing soft irq event time... Hit Ctrl-C to end.
SOFTIRQ TOTAL_usecs
net_tx 6
tasklet 483
sched 3923
rcu 6545
net_rx 9657
timer 24720
SOFTIRQ TOTAL_usecs
tasklet 221
sched 3256
net_rx 4408
rcu 5641
timer 25195
SOFTIRQ TOTAL_usecs
tasklet 287
sched 4734
net_rx 6355
rcu 8406
timer 25333
#...
并没有发现中断处理慢了的情况,那有没有可能是中断被屏蔽了呢?如果中断被屏蔽了400ms,在这400ms期间,收到的包都会存放在网卡的缓冲区里,但是网卡没办法通过中断的方式通知内核来处理这些包,等到中断被打开后,内核才会收到中断,并开始处理网卡缓冲区里的数据,这样就会出现上面提到的现象。这个假设似乎说得通,但是怎么验证呢?
想起字节跳动几年前开源了一个叫做trace-irqoff
的工具,可以用来追踪和定位中断或者软中断关闭的时间。
尝试使用一下这个工具:
# cat /proc/trace_irqoff/distribute
hardirq-off:
msecs : count distribution
10 -> 19 : 1 |********** |
20 -> 39 : 0 | |
40 -> 79 : 0 | |
80 -> 159 : 1 |********** |
160 -> 319 : 2 |******************** |
320 -> 639 : 4 |****************************************|
# cat /proc/trace_irqoff/trace_latency
trace_irqoff_latency: 50ms
hardirq:
cpu: 0
COMMAND: kubelet PID: 1162942 LATENCY: 425ms
save_trace.isra.0+0x18c/0x1b0 [trace_irqoff]
trace_irqoff_record+0xa0/0xd0 [trace_irqoff]
trace_irqoff_hrtimer_handler+0x47/0xd6 [trace_irqoff]
__hrtimer_run_queues+0x12a/0x2c0
hrtimer_interrupt+0xfc/0x210
__sysvec_apic_timer_interrupt+0x5f/0x110
sysvec_apic_timer_interrupt+0x6d/0x90
asm_sysvec_apic_timer_interrupt+0x16/0x20
_raw_spin_unlock_irq+0x14/0x25
blkcg_print_blkgs+0x72/0xe0
tg_print_rwstat_recursive+0x40/0x50
seq_read_iter+0x124/0x4b0
new_sync_read+0x118/0x1a0
vfs_read+0xf5/0x190
ksys_read+0x5f/0xe0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
COMMAND: kubelet PID: 1162942 LATENCY: 462ms
save_trace.isra.0+0x18c/0x1b0 [trace_irqoff]
trace_irqoff_record+0xa0/0xd0 [trace_irqoff]
trace_irqoff_hrtimer_handler+0x47/0xd6 [trace_irqoff]
__hrtimer_run_queues+0x12a/0x2c0
hrtimer_interrupt+0xfc/0x210
__sysvec_apic_timer_interrupt+0x5f/0x110
sysvec_apic_timer_interrupt+0x6d/0x90
asm_sysvec_apic_timer_interrupt+0x16/0x20
_raw_spin_unlock_irq+0x14/0x25
blkcg_print_blkgs+0x72/0xe0
tg_print_rwstat_recursive+0x40/0x50
seq_read_iter+0x124/0x4b0
new_sync_read+0x118/0x1a0
vfs_read+0xf5/0x190
ksys_read+0x5f/0xe0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
#...
还真是发现了中断被长时间屏蔽的情况,从这个调用栈看,kubelet
进程会读取cgroups中的内容,在这个过程中,内核在处理blkcg_print_blkgs
时,屏蔽了中断425ms和462ms。这个时间和ping延迟的400ms十分接近了,那几乎就是因为这个原因导致的问题了。
具体到为什么会出现这么长的中断屏蔽时间,这里就不继续把分析的详细过程贴出来了,最终是找到了cgroups一个内存泄漏的bug,导致blkcg_print_blkgs
时需要遍历超多的items,导致了中断屏蔽时间过长。
确实需要感谢一下腾讯和字节跳动开源的这两个工具,极大的加速了问题的定位😁。说起来阿里开源的运维工具SysAK里也有一个irqoff
工具,是基于eBPF实现的,相比字节的可能还要灵活一些,不过就得等下次再试用了😊。