[踩坑] fluentd daemonset failed to flush the buffer

發表於 2019-11-24 更新於 2020-11-19 分類於 Cloud-Native 閱讀次數： Disqus：文章字數： 768 所需閱讀時間 ≈ 1 分鐘

除錯紀錄

最近同事抱怨 elasticsearch 常常掉資料，請我幫忙檢查下 ES 是不是有問題，看了下 cluster 的健康狀態也正常，node 硬碟剩餘空間也都還不少，手動打了下也有資料，實在摸不著頭緒。回頭往資料源頭查，kubectl logs <application_pod> 看了下也都有正常吐 log 到 stdout，往上檢查到 fluentd 時發現不太對勁，kubectl logs <fluentd_pod> | grep -v info 看了下發現報了 Warning：

1
2

2019-11-17 05:07:05 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:block
2019-11-17 05:24:05 +0000 [warn]: #0 failed to flush the buffer. retry_time=3 next_retry_seconds=2019-11-17 05:24:32 +0000 chunk="59778cb47a5c5dcf401f4d1c5b2cc88f" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"<elastic_cluster>\", :port=>9200, :scheme=>\"http\", :user=>\"elastic\", :password=>\"obfuscated\", :path=>\"\"}): connect_write timeout reached"

初步檢查看起來是 Buffer 炸掉了，調整一下 Buffer Size 先，順便把 timeout 時間拉長點觀察看看

1
2
3

buffer_chunk_limit "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_CHUNK_LIMIT_SIZE'] || '8M'}"
buffer_queue_limit "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_QUEUE_LIMIT_LENGTH'] || '256'}"
request_timeout 15s

隔天看 Log 還是炸掉，後來仔細爬了一下才發現有人提過這個 Issue: Fluentd stopped sending data to ES for somewhile. #525

官方其實 FAQ 就有寫了 XD

Stopped to send events on k8s, why?
fluent-plugin-elasticsearch reloads connection after 10000 requests. (Not correspond to events counts because ES plugin uses bulk API.)
This functionality which is originated from elasticsearch-ruby gem is enabled by default.
Sometimes this reloading functionality bothers users to send events with ES plugin.
On k8s platform, users sometimes shall specify the following settings:

1
2
3

reload_connections false
reconnect_on_error true
reload_on_failure true

調整完後就沒有 timeout 過了，總算搞定了這問題。