[踩坑] fluentd daemonset failed to flush the buffer

除錯紀錄

最近同事抱怨 elasticsearch 常常掉資料,請我幫忙檢查下 ES 是不是有問題,看了下 cluster 的健康狀態也正常,node 硬碟剩餘空間也都還不少,手動打了下也有資料,實在摸不著頭緒。回頭往資料源頭查,kubectl logs <application_pod> 看了下也都有正常吐 log 到 stdout,往上檢查到 fluentd 時發現不太對勁,kubectl logs <fluentd_pod> | grep -v info 看了下發現報了 Warning:

1
2
2019-11-17 05:07:05 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:block
2019-11-17 05:24:05 +0000 [warn]: #0 failed to flush the buffer. retry_time=3 next_retry_seconds=2019-11-17 05:24:32 +0000 chunk="59778cb47a5c5dcf401f4d1c5b2cc88f" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"<elastic_cluster>\", :port=>9200, :scheme=>\"http\", :user=>\"elastic\", :password=>\"obfuscated\", :path=>\"\"}): connect_write timeout reached"

初步檢查看起來是 Buffer 炸掉了,調整一下 Buffer Size 先,順便把 timeout 時間拉長點觀察看看

1
2
3
buffer_chunk_limit "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_CHUNK_LIMIT_SIZE'] || '8M'}"
buffer_queue_limit "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_QUEUE_LIMIT_LENGTH'] || '256'}"
request_timeout 15s

隔天看 Log 還是炸掉,後來仔細爬了一下才發現有人提過這個 Issue: Fluentd stopped sending data to ES for somewhile. #525

官方其實 FAQ 就有寫了 XD

Stopped to send events on k8s, why?
fluent-plugin-elasticsearch reloads connection after 10000 requests. (Not correspond to events counts because ES plugin uses bulk API.)
This functionality which is originated from elasticsearch-ruby gem is enabled by default.
Sometimes this reloading functionality bothers users to send events with ES plugin.
On k8s platform, users sometimes shall specify the following settings:

1
2
3
reload_connections false
reconnect_on_error true
reload_on_failure true

調整完後就沒有 timeout 過了,總算搞定了這問題。