[除錯] Fluentd UDP Log 掉包問題分析

發表於 2020-07-19 更新於 2020-12-13 分類於 Cloud-Native 閱讀次數： Disqus：文章字數： 2k 所需閱讀時間 ≈ 4 分鐘

前言

前陣子與 Backend Team 最終討論出要使用 UDP 方式直接將 JSON format 的 log message 送到 fluentd，實際接上後觀察發現會掉 log，以下是除錯過程的紀錄。

除錯紀錄

一開始不確定問題是 fluentd 丟棄太大的 UDP 封包還是 MTU 導致，於是建立了幾個特定大小的 JSON 檔案來測試：

$ ls -la
-rw-rw-r--  1 relk relk    1400 Jul 16 10:33 size-1400
-rw-rw-r--  1 relk relk    1410 Jul 16 10:39 size-1410
-rw-rw-r--  1 relk relk    1420 Jul 16 10:41 size-1420
-rw-rw-r--  1 relk relk    1430 Jul 16 10:42 size-1430
-rw-rw-r--  1 relk relk    1432 Jul 16 10:54 size-1432
-rw-rw-r--  1 relk relk    1440 Jul 16 10:42 size-1440
-rw-rw-r--  1 relk relk    2000 Jul 16 10:30 size-2000

一開始先從 2000 bytes 的開始丟，檢查 fluentd log 沒發現有收到 2000 bytes 大小的 log，轉而測試 1400 bytes 到 1440 bytes 訊息後，發現超過 1432 bytes 的 log 像是被丟到黑洞一樣，不知去向

1
2
3

$ cat size-2000 > /dev/udp/192.168.100.200/5160
$ cat size-1400 > /dev/udp/192.168.100.200/5160
$ cat size-1432 > /dev/udp/192.168.100.200/5160

使用 tcpdump 來檢查 5160 port 的封包狀況，確認的確是封包過大會有 UDP, bad length 的錯誤訊息：

1
2
3

$ sudo tcpdump -i ens4 -nn -v 'port 5160'
08:15:07.295252 IP (tos 0x0, ttl 64, id 798, offset 0, flags [+], proto UDP (17), length 1460)
    192.168.100.1.5160 > 192.168.100.200.5160: UDP, bad length 2064 > 1432

在 Google 上查了一下，看到有人提到 udp-fragmentation-offload 會導致這個問題，檢查了一下確實沒有開啟這功能

1 2	$ sudo ethtool -k ens4 \| grep udp-fragmentation-offload udp-fragmentation-offload: off

針對這功能研究了下，udp-fragmentation-offload (UFO) 主要是網卡提供硬體 IP Fragmentation 的功能，使得上層 Layer 3 不需要透過軟體 Fragmentation，減少 CPU 資源消耗，因此我認為 UFO 的啟用與否不影響 UDP Packet 的發送。

根據這個思維，利用 tcpdump 來觀察所有發送到目標 fluentd 的封包：

$sudo tcpdump -i ens4 -nn -v -X 'dst host 192.168.100.200'

16:43:18.375508 IP (tos 0x0, ttl 64, id 30236, offset 0, flags [+], proto UDP (17), length 1460)
    192.168.100.1.33543 > 192.168.100.200.5160: UDP, bad length 1440 > 1432
	0x0000:  4500 05b4 761c 2000 4011 c94c 0aa0 0002  [email protected]....
	0x0010:  0ac8 0067 8307 1428 05a8 090f 7b22 6c61  ...g...(....{"la
	0x0020:  6265 6c22 3a22 7265 6c6b 5f61 6363 6573  bel":"relk_acces
	0x0030:  7322 2c22 6d65 7373 6167 6522 3a22 3634  s","message":"64
	0x0040:  3634 3634 3634 3634 3634 3634 3634 3634  6464646464646464
	... 略
	0x0550:  3132 3232 3232 3232 3232 3233 3333 3333  1222222222233333
	0x0560:  3333 3333 3334 3434 3434 3434 3434 3422  333334444444444"
	0x0570:  2c22 686f 7374 6e61 6d65 223a 2273 6973  ,"hostname":"my-
	0x0580:  7970 6875 7322 2c22 7069 6422 3a32 3233  host1","pid":123
	0x0590:  3336 2c22 7469 6d65 7374 616d 7022 3a22  45,"timestamp":"
	0x05a0:  3230 3230 2d30 372d 3136 5431 303a 3131  2020-07-16T10:11
	0x05b0:  3a32 392e                                :29.
16:43:18.375526 IP (tos 0x0, ttl 64, id 30236, offset 1440, flags [none], proto UDP (17), length 28)
    10.160.0.2 > 192.168.100.200: ip-proto-17
	0x0000:  4500 001c 761c 00b4 4011 ee30 0aa0 0002  [email protected]....
	0x0010:  0ac8 0067 3238 3630 3022 7d0a            ...g28600"}.

可以觀察到確實 IP Layer 會將大於 MTU 的封包做切割，而切割後會喪失 UDP Header，但接收端應該要能夠在 IP Layer 重組（IP Header 有 ID、offset），而我們目前測試的架構為 VM > GCP Internal Loadbalancer > Kubernetes Worker Node > Fluentd Pod (Kubernetes)，為了定位問題點，我選擇先從 Kubernetes 內部測試被 IP 分片的 UDP Datagram 是否能組回且被 Fluentd 正確處理：

1
2
3

$ ls -la
-rw-r--r--    1 root     root          2312 Jul 19 12:37 size2300.txt
$ cat size2300.txt | nc -u -w1 fluentd.monitoring 5160

同時間在 Fluentd Pod 上進行 tcpdump：

$ tcpdump -nn -v -A 'src host 10.203.0.20'

12:37:50.159977 IP (tos 0x0, ttl 62, id 42754, offset 0, flags [+], proto UDP (17), length 1460)
    10.203.0.20.46424 > 10.203.3.189.5160: UDP, bad length 2312 > 1432
E..... .>...
...
....X.(	..(
{"label":"relk_access","timestamp":"2020-07-19T12:11:29.286Z","message":"sEgYYUOdiyVk...略

12:37:50.159993 IP (tos 0x0, ttl 62, id 42754, offset 1440, flags [none], proto UDP (17), length 900)
    10.203.0.20 > 10.203.3.189: ip-proto-17
E.......>..L
...
...VW0iQRjLz9QqVBzkfmNThUaOgC3HA4RkkhcoTYBJOS1rw8bwt9SD4xWMhLIa5X1uCrCdfgFnmp3A4I1NSeMYow2mbsa6...略"}

可以看到接收端是有收到被分片過的 Packet，Fluentd 也有正確處理並送到 elasticsearch，因此可以確認說問題是出在 GCP Internal Loadbalancer，根據文檔指出：

Google Cloud does not wait for all fragments; it forwards each fragment as soon as it arrives.

Because subsequent UDP fragments do not contain the destination port, problems can occur in these situations:

If the target pools session affinity is set to NONE (5-tuple affinity), the subsequent fragments may be dropped because the load balancer cannot calculate the 5-tuple hash.

If there is more than one UDP forwarding rule for the same load-balanced IP address, subsequent fragments may arrive at the wrong forwarding rule.

可以看到根據 session affinity 的設定，可能會導致後續 fragments 被送到不正確的目的地或是被丟棄，基本上可以確定這次掉 Log 是 Loadbalancer 導致，查了一下沒有找到能夠調整 session affinity 的 Kubernetes service annotations，暫時擱置著先，再想想要怎麼解決這問題…

2020/07/31 Update:
後續測試時發現部份 log message 會超過 UDP datagram 大小限制 65,535 bytes，只好轉回用 TCP 做為連線協定，回去面對 Long-Lived TCP 連線的負載平衡問題，以及 TCP 連線開銷等等性能挑戰。
可以參考 Before Installation - Fluentd，有提到關於 File Descriptors 及 Network Kernel Parameters 等等的調整

前言

除錯紀錄

參考資料