潍坊网站建设服务跟,网站开发常用的开发工具,深圳南山做网站公司,创办个人网站### 生产ETCD集群监控核心指标
etcd服务存活状态
up{job~kubernetes-etcd.*}0
说明#xff1a;up0代表服务挂掉 etcd是否有脱离情况 etcd_server_has_leader{job~kubernetes-etcd.*}0 说明#xff1a;每个instance#xff0c;该值应该都… ### 生产ETCD集群监控核心指标
etcd服务存活状态
up{job~kubernetes-etcd.*}0
说明up0代表服务挂掉 etcd是否有脱离情况 etcd_server_has_leader{job~kubernetes-etcd.*}0 说明每个instance该值应该都为1否则这个节点可能已经离开集群最好在发生过半这样的情况前介入 etcd改变次数
increase(etcd_server_leader_changes_seen_total{job~kubernetes-etcd.*}[1h]) 3
说明这个指标metrics类型为counter即它是单调递增的可以监控该值的变化率如果发现变化率高说明集群的负载过高或者网络连接可能不稳定 leader选举失败 rate(etcd_server_proposals_failed_total{job~kubernetes-etcd.*}[15m])!0 说明该值的类型也是counter。proposal字面意思是“提案”客户端的一个写操作可以认为是一个提案提案需要集群内的Etcd实例来“表决”如果上述值不为零说明有proposal没有提交成功如果经常这样说明集群leader选举失败或者集群有过半节点离线 http访问5分钟内失败百分比待定
sum by(method) (rate(etcd_http_failed_total{job~kubernetes-etcd.*}[5m])) / sum by(method) (rate(etcd_http_received_total{job~kubernetes-etcd.*}[5m])) 0.05
etcd集群切主次数
changes(etcd_server_leader_changes_seen_total{job~.*}[1d])1
WAL文件顺序写入的持久化时间
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job~.*}[5m]))0.5
说明Etcd的持久化保证依赖WAL和快照机制这些全靠硬盘的IO表现。如果硬盘的性能不佳在高负载情况下将严重拖慢Etcd的处理速度因此在生产环境中建议使用SSD来替代传统机械硬盘。可以通过监控 etcd_disk_backend_commit_duration_seconds_bucket的0.99分位数来衡量硬盘的表现情况 如果该值仅几个毫秒说明你的Etcd比较健康 磁盘使用率 (etcd_mvcc_db_total_size_in_bytes{}/etcd_server_quota_backend_bytes{}) * 10080
prometheus的yaml配置 - job_name: kubernetes-etcd-19scheme: httpstls_config:cert_file: /usr/local/prometheus/ssl/kube-etcd-19.pemkey_file: /usr/local/prometheus/ssl/kube-etcd-19-key.peminsecure_skip_verify: truescrape_interval: 120sstatic_configs:- targets: [110.152.117.19:2379]- job_name: kubernetes-etcd-20scheme: httpstls_config:cert_file: /usr/local/prometheus/ssl/kube-etcd-20.pemkey_file: /usr/local/prometheus/ssl/kube-etcd-20-key.peminsecure_skip_verify: truescrape_interval: 120sstatic_configs:- targets: [110.152.117.20:2379]- job_name: kubernetes-etcd-21scheme: httpstls_config:cert_file: /usr/local/prometheus/ssl/kube-etcd-21.pemkey_file: /usr/local/prometheus/ssl/kube-etcd-21-key.peminsecure_skip_verify: truescrape_interval: 120sstatic_configs:- targets: [110.152.117.21:2379]prometheus的rules配置文件
groups:
- name: 公共事业部ETCD集群监控 #project name取公司名称rules:- alert: ETCD服务存活状态活监控expr: up{job~kubernetes-etcd.*}0for: 30slabels:severity: 重要team: ops-gt-monitoralert_type: ETCD告警alert_host: {{ $labels.service }}alert_value: {{ $value }}alert_subject: ETCD告警annotations:summary: ETCD集群监控description: ETCD集群已经离开集群(资源信息:{{ $labels.instance }})请尽快处理- alert: ETCD是否有脱离监控expr: etcd_server_has_leader{job~kubernetes-etcd.*}0for: 30slabels:severity: 重要team: ops-gt-monitoralert_type: ETCD告警alert_host: {{ $labels.service }}alert_value: {{ $value }}alert_subject: ETCD告警annotations:summary: ETCD集群监控description: ETCD集群宕机或掉线(资源信息:{{ $labels.instance }})请尽快处理- alert: ETCD改变次数监控expr: increase(etcd_server_leader_changes_seen_total{job~kubernetes-etcd.*}[1h]) 3for: 30slabels:severity: 重要team: ops-gt-monitoralert_type: ETCD告警alert_host: {{ $labels.service }}alert_value: {{ $value }}alert_subject: ETCD告警annotations:summary: ETCD集群监控description: ETCD集群负载过高或者网络连接不稳定(资源信息:{{ $labels.instance }})请尽快处理- alert: ETCD选举监控expr: rate(etcd_server_proposals_failed_total{job~kubernetes-etcd.*}[15m])!0for: 30slabels:severity: 重要team: ops-gt-monitoralert_type: ETCD告警alert_host: {{ $labels.service }}alert_value: {{ $value }}alert_subject: ETCD告警annotations:summary: ETCD集群监控description: ETCD集群leader选举失败{{ $value }}(资源信息:{{ $labels.instance }})请尽快处理 - alert: ETCD切主次数监控expr: changes(etcd_server_leader_changes_seen_total{job~.*}[1d])1for: 30slabels:severity: 重要team: ops-gt-monitoralert_type: ETCD告警alert_host: {{ $labels.service }}alert_value: {{ $value }}alert_subject: ETCD告警annotations:summary: ETCD集群监控description: ETCD集群切主次数{{ $value }}(资源信息:{{ $labels.instance }})请尽快处理- alert: ETCD集群WAL写入时间expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job~.*}[5m]))0.5for: 30slabels:severity: 重要team: ops-gt-monitoralert_type: ETCD告警alert_host: {{ $labels.service }}alert_value: {{ $value }}alert_subject: ETCD告警annotations:summary: ETCD集群监控description: ETCD集群WAL文件顺序写入的持久化时间{{ $value }}(资源信息:{{ $labels.instance }})请尽快处理- alert: ETCD集群磁盘使用率expr: (etcd_mvcc_db_total_size_in_bytes{}/etcd_server_quota_backend_bytes{}) * 10080for: 30slabels:severity: 重要team: ops-gt-monitoralert_type: ETCD告警alert_host: {{ $labels.service }}alert_value: {{ $value }}alert_subject: ETCD告警annotations:summary: ETCD集群监控description: ETCD集群磁盘使用率{{ $value }}(资源信息:{{ $labels.instance }})请尽快处理