[踩坑] fluentd daemonset failed to flush the buffer
除錯紀錄
最近同事抱怨 elasticsearch 常常掉資料,請我幫忙檢查下 ES 是不是有問題,看了下 cluster 的健康狀態也正常,node 硬碟剩餘空間也都還不少,手動打了下也有資料,實在摸不著頭緒。回頭往資料源頭查,kubectl logs <application_pod>
看了下也都有正常吐 log 到 stdout,往上檢查到 fluentd 時發現不太對勁,kubectl logs <fluentd_pod> | grep -v info
看了下發現報了 Warning:
1 | 2019-11-17 05:07:05 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:block |
初步檢查看起來是 Buffer 炸掉了,調整一下 Buffer Size 先,順便把 timeout 時間拉長點觀察看看
1 | buffer_chunk_limit "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_CHUNK_LIMIT_SIZE'] || '8M'}" |
隔天看 Log 還是炸掉,後來仔細爬了一下才發現有人提過這個 Issue: Fluentd stopped sending data to ES for somewhile. #525
官方其實 FAQ 就有寫了 XD
Stopped to send events on k8s, why?
fluent-plugin-elasticsearch reloads connection after 10000 requests. (Not correspond to events counts because ES plugin uses bulk API.)
This functionality which is originated from elasticsearch-ruby gem is enabled by default.
Sometimes this reloading functionality bothers users to send events with ES plugin.
On k8s platform, users sometimes shall specify the following settings:
1 | reload_connections false |
調整完後就沒有 timeout 過了,總算搞定了這問題。