[除錯] Fluentd UDP Log 掉包問題分析

發表於 2020-07-19 更新於 2020-12-13 分類於 Cloud-Native Disqus：文章字數： 2k 所需閱讀時間 ≈ 4 分鐘

前言

前陣子與 Backend Team 最終討論出要使用 UDP 方式直接將 JSON format 的 log message 送到 fluentd，實際接上後觀察發現會掉 log，以下是除錯過程的紀錄。

除錯紀錄

一開始不確定問題是 fluentd 丟棄太大的 UDP 封包還是 MTU 導致，於是建立了幾個特定大小的 JSON 檔案來測試：

$ ls -la
-rw-rw-r--  1 relk relk    1400 Jul 16 10:33 size-1400
-rw-rw-r--  1 relk relk    1410 Jul 16 10:39 size-1410
-rw-rw-r--  1 relk relk    1420 Jul 16 10:41 size-1420
-rw-rw-r--  1 relk relk    1430 Jul 16 10:42 size-1430
-rw-rw-r--  1 relk relk    1432 Jul 16 10:54 size-1432
-rw-rw-r--  1 relk relk    1440 Jul 16 10:42 size-1440
-rw-rw-r--  1 relk relk    2000 Jul 16 10:30 size-2000

一開始先從 2000 bytes 的開始丟，檢查 fluentd log 沒發現有收到 2000 bytes 大小的 log，轉而測試 1400 bytes 到 1440 bytes 訊息後，發現超過 1432 bytes 的 log 像是被丟到黑洞一樣，不知去向

1
2
3

$ cat size-2000 > /dev/udp/192.168.100.200/5160
$ cat size-1400 > /dev/udp/192.168.100.200/5160
$ cat size-1432 > /dev/udp/192.168.100.200/5160

使用 tcpdump 來檢查 5160 port 的封包狀況，確認的確是封包過大會有 UDP, bad length 的錯誤訊息：

1
2
3

$ sudo tcpdump -i ens4 -nn -v 'port 5160'
08:15:07.295252 IP (tos 0x0, ttl 64, id 798, offset 0, flags [+], proto UDP (17), length 1460)
    192.168.100.1.5160 > 192.168.100.200.5160: UDP, bad length 2064 > 1432

閱讀全文 »

[Elastic] 利用 Filebeat 來收集與解析 Kubernetes nginx ingress logs

發表於 2020-06-14 更新於 2020-11-19 分類於 Elastic Disqus：文章字數： 465 所需閱讀時間 ≈ 1 分鐘

前言

最近在配置 Filebeat 在 Kubernetes 上解析 nginx-ingress logs 時遇到了一些困難，主要是 autodiscover 與 hints 部份在新舊版本上有些差異，這邊將我最後測試成功的配置給記錄下來

環境

GKE Container-Optimized OS
Filebeat: 7.7.1
ElasticSearch: 7.7.1
Kubernetes/ingress-nginx: 0.32.0

配置

helm chart: `elastic/filebeat`

filebeatConfig:
  filebeat.yml: |
    filebeat.autodiscover:
      providers:
        - type: kubernetes
          hints.enabled: true
          hints.default_config.enabled: false

    output.elasticsearch:
      host: '${NODE_NAME}'
      hosts: '${ELASTICSEARCH_HOSTS:elasticsearch-master:9200}'
      protocol: http
      username: '${ELASTICSEARCH_USERNAME}'
      password: '${ELASTICSEARCH_PASSWORD}'

閱讀全文 »

[Raspberry Pi] 設定開機自動掛載 exFAT 格式行動硬碟

發表於 2020-05-30 更新於 2020-11-19 分類於 Linux Disqus：文章字數： 136 所需閱讀時間 ≈ 1 分鐘

1. 接上行動硬碟，檢查一下被分配到哪個裝置代號

1 2	# 查看分割表狀態 sudo fdisk -l

Disk /dev/mmcblk0: 29.7 GiB, 31914983424 bytes, 62333952 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xda84cd12

Device         Boot  Start      End  Sectors  Size Id Type
/dev/mmcblk0p1 *      2048   526335   524288  256M  c W95 FAT32 (LBA)
/dev/mmcblk0p2      526336 62333918 61807583 29.5G 83 Linux


Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x2ac1c561

Device     Boot Start        End    Sectors  Size Id Type
/dev/sda1  *     2048 3907026943 3907024896  1.8T  c W95 FAT32 (LBA)

閱讀全文 »

[Kubernetes] 在 GKE 上同時啟用 internal 與 external 兩種 nginx ingress controller

發表於 2020-05-17 更新於 2020-11-19 分類於 Cloud-Native Disqus：文章字數： 787 所需閱讀時間 ≈ 1 分鐘

前言

GKE 上原生的 Ingress Controller 限制非常多，需要設定 ServiceType=NodePort 才能使用，
於是我選擇 kubernetes/ingress-nginx 來作為 ingress 使用，不要和我一樣一開始裝成 nginxinc/kubernetes-ingress 的 Ingress Controller XD

這張圖是 Kube Ingress Controller 在 GKE 上的架構，除了 external 以外我還需要 internal 的 ingress

閱讀全文 »

[除錯] G-Suite 被擋信問題排除

發表於 2020-05-13 更新於 2020-11-19 分類於 MIS Disqus：文章字數： 250 所需閱讀時間 ≈ 1 分鐘

除錯紀錄

最近公司同仁有在抱怨常常寄信到特定單位會寄丟，利用 Google CheckMX 檢查發現

1	SPF must allow Google servers to send mail on behalf of your domain.

根據說明文件表示，需要新增一筆 SPF TXT Record

1	v=spf1 include:_spf.google.com ~all

閱讀全文 »

[除錯] SSL Certificate Troubleshooting 記錄

發表於 2020-05-13 更新於 2020-11-19 分類於 Linux Disqus：文章字數： 232 所需閱讀時間 ≈ 1 分鐘

利用 openssl-cli 來測試目標 domain

1	openssl s_client -connect example.com:443

檢視輸出訊息，會顯示 Certificate chain、Server certificate、Verification、SSL handshake 等資訊

---
Certificate chain
 0 s:OU = Domain Control Validated, CN = *.example.com
   i:C = US, ST = Arizona, L = Scottsdale, O = "GoDaddy.com, Inc.", OU = http://certs.godaddy.com/repository/, CN = Go Daddy Secure Certificate Authority - G2
---
SSL handshake has read 2283 bytes and written 420 bytes
Verification error: unable to verify the first certificate
---

閱讀全文 »

[DevOps] 利用 Jenkins 與 Ansible 搭建 CI/CD Pipeline

發表於 2020-01-31 更新於 2020-11-19 分類於 DevOps Disqus：文章字數： 1.5k 所需閱讀時間 ≈ 3 分鐘

簡易 CI/CD Pipeline 設計原則

範例用 GitHub Repository

遵循 GitHub Flow，master 分支永遠是經過驗證且可佈署的
建立新 Pull Request 時，自動 deploy 到 Staging 環境並執行 Smoke Testing，等團隊其他成員或主管完成 Code Review 以及測試後才能 Merge 至 master
只有被 Tag 的 commit，才能 deploy 到 Production 環境（手動）
為了方便測試，可以手動 deploy 其他 branch 到 Staging 環境

Pipeline 設定

[踩坑] fluentd daemonset failed to flush the buffer

發表於 2019-11-24 更新於 2020-11-19 分類於 Cloud-Native Disqus：文章字數： 768 所需閱讀時間 ≈ 1 分鐘

除錯紀錄

最近同事抱怨 elasticsearch 常常掉資料，請我幫忙檢查下 ES 是不是有問題，看了下 cluster 的健康狀態也正常，node 硬碟剩餘空間也都還不少，手動打了下也有資料，實在摸不著頭緒。回頭往資料源頭查，kubectl logs <application_pod> 看了下也都有正常吐 log 到 stdout，往上檢查到 fluentd 時發現不太對勁，kubectl logs <fluentd_pod> | grep -v info 看了下發現報了 Warning：

1
2

2019-11-17 05:07:05 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:block
2019-11-17 05:24:05 +0000 [warn]: #0 failed to flush the buffer. retry_time=3 next_retry_seconds=2019-11-17 05:24:32 +0000 chunk="59778cb47a5c5dcf401f4d1c5b2cc88f" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"<elastic_cluster>\", :port=>9200, :scheme=>\"http\", :user=>\"elastic\", :password=>\"obfuscated\", :path=>\"\"}): connect_write timeout reached"

初步檢查看起來是 Buffer 炸掉了，調整一下 Buffer Size 先，順便把 timeout 時間拉長點觀察看看

1
2
3

buffer_chunk_limit "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_CHUNK_LIMIT_SIZE'] || '8M'}"
buffer_queue_limit "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_QUEUE_LIMIT_LENGTH'] || '256'}"
request_timeout 15s

隔天看 Log 還是炸掉，後來仔細爬了一下才發現有人提過這個 Issue: Fluentd stopped sending data to ES for somewhile. #525

閱讀全文 »

[隨筆] Install Linux Mint 19.2 on Dell Inspiron 7375

發表於 2019-11-10 更新於 2020-11-19 分類於 Linux Disqus：文章字數： 1.6k 所需閱讀時間 ≈ 3 分鐘

前言

最近搞了台 Ryzen 筆電來跑 Linux，筆記下安裝上的細節

安裝步驟

安裝部份就跳過了，開機後到 Grub 選單時，按 e 編輯開機選項，找到 Linux 開頭那行，移除 quite，加上 noapic noacpi irqpoll
登入後，打開 Update Manager > View > Linux Kernels 選擇最新的穩定版 (當前是 5.3.0-19)並安裝

打開 Terminal，編輯 /etc/default/grub，修改成以下資訊：

sudo vi /etc/default/grub

1	GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=on ivrs_ioapic[4]=00:14.0 ivrs_ioapic[5]=00:00.2 splash"

2020-05-15 Update: 改用以下這段，忘了從哪邊抄來的，系統比較不會凍結
kernel: 5.3.0-40-generic

1	GRUB_CMDLINE_LINUX_DEFAULT="vga=current ivrs_ioapic[4]=00:14.0 ivrs_ioapic[5]=00:00.2 iommu=pt idle=nomwait acpi_backlight=vendor acpi_enforce_resources=lax scsi_mod.use_blk_mq=1"

sudo update-grub
閱讀全文 »

[CentOS] Zabbix + Grafana 安裝

發表於 2019-05-29 更新於 2020-11-19 分類於 Linux Disqus：文章字數： 720 所需閱讀時間 ≈ 1 分鐘

Zabbix 安裝步驟

安裝環境：CentOS 7

1. 關閉 SELINUX

vim /etc/selinux/config

## 將 SELINUX 修改為 disabled ##
config>SELINUX=disabled

reboot

2. 安裝 Zabbix 4.0 LTS

1
2
3

rpm -Uvh https://repo.zabbix.com/zabbix/4.0/rhel/7/x86_64/zabbix-release-4.0-1.el7.noarch.rpm
yum clean all
yum install zabbix-server-mysql zabbix-web-mysql zabbix-agent

閱讀全文 »

Relk's 工作手札

[除錯] Fluentd UDP Log 掉包問題分析

前言

除錯紀錄

[Elastic] 利用 Filebeat 來收集與解析 Kubernetes nginx ingress logs

前言

環境

配置

helm chart: `elastic/filebeat`

[Raspberry Pi] 設定開機自動掛載 exFAT 格式行動硬碟

1. 接上行動硬碟，檢查一下被分配到哪個裝置代號

[Kubernetes] 在 GKE 上同時啟用 internal 與 external 兩種 nginx ingress controller

前言

[除錯] G-Suite 被擋信問題排除

除錯紀錄

[除錯] SSL Certificate Troubleshooting 記錄

[DevOps] 利用 Jenkins 與 Ansible 搭建 CI/CD Pipeline

簡易 CI/CD Pipeline 設計原則

Pipeline 設定

Jenkinsfile 說明

[踩坑] fluentd daemonset failed to flush the buffer

除錯紀錄

[隨筆] Install Linux Mint 19.2 on Dell Inspiron 7375

前言

安裝步驟

[CentOS] Zabbix + Grafana 安裝

Zabbix 安裝步驟

前言

除錯紀錄

前言

環境

配置

helm chart: elastic/filebeat

1. 接上行動硬碟，檢查一下被分配到哪個裝置代號

前言

除錯紀錄

簡易 CI/CD Pipeline 設計原則

Pipeline 設定

Jenkinsfile 說明

除錯紀錄

前言

安裝步驟

Zabbix 安裝步驟

helm chart: `elastic/filebeat`