颍上做网站,做彩票网站网址,快三彩票网站开发,wordpress 即时作者#xff1a;朱亚光#xff0c;之江实验室工程师#xff0c;云原生/开源爱好者。 KubeSphere 边缘节点的可观测性
在边缘计算场景下#xff0c;KubeSphere 基于 KubeEdge 实现应用与工作负载在云端与边缘节点的统一分发与管理#xff0c;解决在海量边、端设备上完成应… 作者朱亚光之江实验室工程师云原生/开源爱好者。 KubeSphere 边缘节点的可观测性
在边缘计算场景下KubeSphere 基于 KubeEdge 实现应用与工作负载在云端与边缘节点的统一分发与管理解决在海量边、端设备上完成应用交付、运维、管控的需求。
根据 KubeSphere 的支持矩阵只有 1.23.x 版本的 K8s 支持边缘计算而且 KubeSphere 界面也没有边缘节点资源使用率等监控信息的显示。 本文基于 KubeSphere 和 KubeEdge 构建云边一体化计算平台通过 Prometheus 来监控 Nvidia Jetson 边缘设备状态实现 KubeSphere 在边缘节点的可观测性。
组件版本KubeSphere3.4.1containerd1.7.2K8s1.26.0KubeEdge1.15.1Jetson 型号NVIDIA Jetson Xavier NX (16GB ram)Jtop4.2.7JetPack5.1.3-b29Docker24.0.5
部署 K8s 环境
参考 KubeSphere 部署文档。通过 KubeKey 可以快速部署一套 K8s 集群。
// all in one 方式部署一台 单 master 的 k8s 集群./kk create cluster --with-kubernetes v1.26.0 --with-kubesphere v3.4.1 --container-manager containerd
部署 KubeEdge 环境
参考 在 KubeSphere 上部署最新版的 KubeEdge部署 KubeEdge。
开启边缘节点日志查询功能
vim /etc/kubeedge/config/edgecore.yaml enabletrue
开启后可以方便查询 pod 日志定位问题。
修改 KubeSphere 配置
开启 KubeEdge 边缘节点插件
修改 configmap--ClusterConfiguration advertiseAddress 设置为 cloudhub 所在的物理机地址 KubeSphere 开启边缘节点文档链接https://www.kubesphere.io/zh/docs/v3.3/pluggable-components/kubeedge/。 修改完发现可以显示边缘节点但是没有 CPU 和 内存信息发现边缘节点没有 node-exporter 这个 pod。 修改 node-exporter 亲和性
kubectl get ds -n kubesphere-monitoring-system 发现不会部署到边缘节点上。 修改为 spec:affinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms:- matchExpressions:- key: node-role.kubernetes.io/edgetest -- 修改这里让亲和性失效operator: DoesNotExist
node-exporter 是部署在边缘节点上了但是 pods 起不来。
通过kubectl edit 该失败的 pod我们发现 node-exporter 这个pod 里面有两个容器其中 kube-rbac-proxy 这个容器启动失败。看这个容器的日志发现是 kube-rbac-proxy 想要获取 KUBERNETES_SERVICE_HOST 和 KUBERNETES_SERVICE_PORT 这两个环境变量但是获取失败所以容器启动失败。
在 K8s 的集群中当创建 pod 时会在 pod 中增加 KUBERNETES_SERVICE_HOST 和 KUBERNETES_SERVICE_PORT 这两个环境变量用于 pod 内的进程对 kube-apiserver 的访问但是在 KubeEdge 的 edge 节点上创建的 pod 中这两个环境变量存在但它是空的。
向 KubeEdge 的开发人员咨询他们说会在 KubeEdge 1.17 版本上增加这两个环境变量的设置。参考如下 https://github.com/wackxu/kubeedge/blob/4a7c00783de9b11e56e56968b2cc950a7d32a403/docs/proposals/edge-pod-list-watch-natively.md。
另一方面推荐安装 EdgeMesh安装之后在 edge 的 pod 上就可以访问 kubernetes.default.svc.cluster.local:443 了。
EdgeMesh 部署
配置 cloudcore configmap kubectl edit cm cloudcore -n kubeedge 设置 dynamicControllertrue. 修改完 重启 cloudcore kubectl delete pod cloudcore-776ffcbbb9-s6ff8 -n kubeedge 配置 edgecore 模块配置 metaServertrue 和 clusterDNS $ vim /etc/kubeedge/config/edgecore.yamlmodules:...metaManager:metaServer:enable: true //配置这里
...modules:...edged:...tailoredKubeletConfig:...clusterDNS: //配置这里- 169.254.96.16
...//重启edgecore
$ systemctl restart edgecore
修改完验证是否修改成功。
$ curl 127.0.0.1:10550/api/v1/services{apiVersion:v1,items:[{apiVersion:v1,kind:Service,metadata:{creationTimestamp:2021-04-14T06:30:05Z,labels:{component:apiserver,provider:kubernetes},name:kubernetes,namespace:default,resourceVersion:147,selfLink:default/services/kubernetes,uid:55eeebea-08cf-4d1a-8b04-e85f8ae112a9},spec:{clusterIP:10.96.0.1,ports:[{name:https,port:443,protocol:TCP,targetPort:6443}],sessionAffinity:None,type:ClusterIP},status:{loadBalancer:{}}},{apiVersion:v1,kind:Service,metadata:{annotations:{prometheus.io/port:9153,prometheus.io/scrape:true},creationTimestamp:2021-04-14T06:30:07Z,labels:{k8s-app:kube-dns,kubernetes.io/cluster-service:true,kubernetes.io/name:KubeDNS},name:kube-dns,namespace:kube-system,resourceVersion:203,selfLink:kube-system/services/kube-dns,uid:c221ac20-cbfa-406b-812a-c44b9d82d6dc},spec:{clusterIP:10.96.0.10,ports:[{name:dns,port:53,protocol:UDP,targetPort:53},{name:dns-tcp,port:53,protocol:TCP,targetPort:53},{name:metrics,port:9153,protocol:TCP,targetPort:9153}],selector:{k8s-app:kube-dns},sessionAffinity:None,type:ClusterIP},status:{loadBalancer:{}}}],kind:ServiceList,metadata:{resourceVersion:377360,selfLink:/api/v1/services}}安装 EdgeMesh git clone https://github.com/kubeedge/edgemesh.git
cd edgemeshkubectl apply -f build/crds/istio/kubectl apply -f build/agent/resources/
dnsPolicy
EdgeMesh 部署完成后edge 节点上的 node-exporter 中的两个境变量还是空的也无法访问 kubernetes.default.svc.cluster.local:443原因是该 pod 中 DNS 服务器配置错误应该是 169.254.96.16 的但是却是跟宿主机一样的 DNS 配置。
kubectl exec -it node-exporter-hcmfg -n kubesphere-monitoring-system -- sh
Defaulted container node-exporter out of: node-exporter, kube-rbac-proxy
$ cat /etc/resolv.conf
nameserver 127.0.0.53
将 dnsPolicy 修改为 ClusterFirstWithHostNet之后重启 node-exporterDNS 的配置正确。
kubectl edit ds node-exporter -n kubesphere-monitoring-system dnsPolicy: ClusterFirstWithHostNethostNetwork: true
添加环境变量
vim /etc/systemd/system/edgecore.service EnvironmentMETASERVER_DUMMY_IPkubernetes.default.svc.cluster.local
EnvironmentMETASERVER_DUMMY_PORT443
修改完重启 edgecore。
systemctl daemon-reload
systemctl restart edgecore
node-exporter 变成 running!!!!
在边缘节点 curl http://127.0.0.1:9100/metrics 可以发现采集到了边缘节点的数据。
最后我们可以将 KubeSphere 的 K8s 服务通过 NodePort 暴露出来。就可以在页面查看。
apiVersion: v1
kind: Service
metadata:labels:app.kubernetes.io/component: prometheusapp.kubernetes.io/instance: k8sapp.kubernetes.io/name: prometheusapp.kubernetes.io/part-of: kube-prometheusapp.kubernetes.io/version: 2.39.1name: prometheus-k8s-nodeportnamespace: kubesphere-monitoring-system
spec:ports:- port: 9090targetPort: 9090protocol: TCPnodePort: 32143selector:app.kubernetes.io/component: prometheusapp.kubernetes.io/instance: k8sapp.kubernetes.io/name: prometheusapp.kubernetes.io/part-of: kube-prometheussessionAffinity: ClientIPsessionAffinityConfig:clientIP:timeoutSeconds: 10800type: NodePort
通过访问 master IP 32143 端口就可以访问边缘节点 node-exporter 数据。 然后界面上也出现了 CPU 和内存的信息。 搞定了 CPU 和内存接下来就是 GPU 了。
监控 Jetson GPU 状态
安装 Jtop
首先 Jetson 是一个 ARM 设备所以无法运行 nvidia-smi 需要安装 Jtop。
sudo apt-get install python3-pip python3-dev -y
sudo -H pip3 install jetson-stats
sudo systemctl restart jtop.service
安装 Jetson GPU Exporter
参考博客制作 Jetson GPU Exporter 镜像并且对应的 Grafana 仪表盘都有。 Dockerfile FROM python:3-buster
RUN pip install --upgrade pip pip install -U jetson-stats prometheus-client
RUN mkdir -p /root
COPY jetson_stats_prometheus_collector.py /root/jetson_stats_prometheus_collector.py
WORKDIR /root
USER root
RUN chmod x /root/jetson_stats_prometheus_collector.py
ENTRYPOINT [python3, /root/jetson_stats_prometheus_collector.py] jetson_stats_prometheus_collector.py 代码 #!/usr/bin/python3
# -*- coding: utf-8 -*-import atexit
import os
from jtop import jtop, JtopException
from prometheus_client.core import InfoMetricFamily, GaugeMetricFamily, REGISTRY, CounterMetricFamily
from prometheus_client import make_wsgi_app
from wsgiref.simple_server import make_serverclass CustomCollector(object):def __init__(self):atexit.register(self.cleanup)self._jetson jtop()self._jetson.start()def cleanup(self):print(Closing jetson-stats connection...)self._jetson.close()def collect(self):# spin传入true表示不会等待下一次数据读取完成if self._jetson.ok(spinTrue):## Board info#i InfoMetricFamily(gpu_info_board, Board sys info, labels[board_info])i.add_metric([info], {machine: self._jetson.board[info][machine] if machine in self._jetson.board.get(info, {}) else self._jetson.board[hardware][Module],jetpack: self._jetson.board[info][jetpack] if jetpack in self._jetson.board.get(info, {}) else self._jetson.board[hardware][Jetpack],l4t: self._jetson.board[info][L4T] if L4T in self._jetson.board.get(info, {}) else self._jetson.board[hardware][L4T]})yield ii InfoMetricFamily(gpu_info_hardware, Board hardware info, labels[board_hw])i.add_metric([hardware], {codename: self._jetson.board[hardware].get(Codename, self._jetson.board[hardware].get(CODENAME, unknown)),soc: self._jetson.board[hardware].get(SoC, self._jetson.board[hardware].get(SOC, unknown)),module: self._jetson.board[hardware].get(P-Number, self._jetson.board[hardware].get(MODULE, unknown)),board: self._jetson.board[hardware].get(699-level Part Number, self._jetson.board[hardware].get(BOARD, unknown)),cuda_arch_bin: self._jetson.board[hardware].get(CUDA Arch BIN, self._jetson.board[hardware].get(CUDA_ARCH_BIN, unknown)),serial_number: self._jetson.board[hardware].get(Serial Number, self._jetson.board[hardware].get(SERIAL_NUMBER, unknown)),})yield i## NV power mode#i InfoMetricFamily(gpu_nvpmode, NV power mode, labels[nvpmode])i.add_metric([mode], {mode: self._jetson.nvpmodel.name})yield i## System uptime#g GaugeMetricFamily(gpu_uptime, System uptime, labels[uptime])days self._jetson.uptime.daysseconds self._jetson.uptime.secondshours seconds//3600minutes (seconds//60) % 60g.add_metric([days], days)g.add_metric([hours], hours)g.add_metric([minutes], minutes)yield g## CPU usage#g GaugeMetricFamily(gpu_usage_cpu, CPU % schedutil, labels[cpu])g.add_metric([cpu_1], self._jetson.stats[CPU1] if (CPU1 in self._jetson.stats and isinstance(self._jetson.stats[CPU1], int)) else 0)g.add_metric([cpu_2], self._jetson.stats[CPU2] if (CPU2 in self._jetson.stats and isinstance(self._jetson.stats[CPU2], int)) else 0)g.add_metric([cpu_3], self._jetson.stats[CPU3] if (CPU3 in self._jetson.stats and isinstance(self._jetson.stats[CPU3], int)) else 0)g.add_metric([cpu_4], self._jetson.stats[CPU4] if (CPU4 in self._jetson.stats and isinstance(self._jetson.stats[CPU4], int)) else 0)g.add_metric([cpu_5], self._jetson.stats[CPU5] if (CPU5 in self._jetson.stats and isinstance(self._jetson.stats[CPU5], int)) else 0)g.add_metric([cpu_6], self._jetson.stats[CPU6] if (CPU6 in self._jetson.stats and isinstance(self._jetson.stats[CPU6], int)) else 0)g.add_metric([cpu_7], self._jetson.stats[CPU7] if (CPU7 in self._jetson.stats and isinstance(self._jetson.stats[CPU7], int)) else 0)g.add_metric([cpu_8], self._jetson.stats[CPU8] if (CPU8 in self._jetson.stats and isinstance(self._jetson.stats[CPU8], int)) else 0)yield g## GPU usage#g GaugeMetricFamily(gpu_usage_gpu, GPU % schedutil, labels[gpu])g.add_metric([val], self._jetson.stats[GPU])yield g## Fan usage#g GaugeMetricFamily(gpu_usage_fan, Fan usage, labels[fan])g.add_metric([speed], self._jetson.fan.get(speed, self._jetson.fan.get(pwmfan, {speed: [0] })[speed][0]))yield g## Sensor temperatures#g GaugeMetricFamily(gpu_temperatures, Sensor temperatures, labels[temperature])keys [AO, GPU, Tdiode, AUX, CPU, thermal, Tboard]for key in keys:if key in self._jetson.temperature:g.add_metric([key.lower()], self._jetson.temperature[key][temp] if isinstance(self._jetson.temperature[key], dict) else self._jetson.temperature.get(key, 0))yield g## Power#g GaugeMetricFamily(gpu_usage_power, Power usage, labels[power])if isinstance(self._jetson.power, dict):g.add_metric([cv], self._jetson.power[rail][VDD_CPU_CV][avg] if VDD_CPU_CV in self._jetson.power[rail] else self._jetson.power[rail].get(CV, { avg: 0 }).get(avg))g.add_metric([gpu], self._jetson.power[rail][VDD_GPU_SOC][avg] if VDD_GPU_SOC in self._jetson.power[rail] else self._jetson.power[rail].get(GPU, { avg: 0 }).get(avg))g.add_metric([sys5v], self._jetson.power[rail][VIN_SYS_5V0][avg] if VIN_SYS_5V0 in self._jetson.power[rail] else self._jetson.power[rail].get(SYS5V, { avg: 0 }).get(avg))if isinstance(self._jetson.power, tuple):g.add_metric([cv], self._jetson.power[1][CV][cur] if CV in self._jetson.power[1] else 0)g.add_metric([gpu], self._jetson.power[1][GPU][cur] if GPU in self._jetson.power[1] else 0)g.add_metric([sys5v], self._jetson.power[1][SYS5V][cur] if SYS5V in self._jetson.power[1] else 0)yield g## Processes#try:processes self._jetson.processes# key exists in dicti InfoMetricFamily(gpu_processes, Process usage, labels[process])for index in range(len(processes)):i.add_metric([info], {pid: str(processes[index][0]),user: processes[index][1],gpu: processes[index][2],type: processes[index][3],priority: str(processes[index][4]),state: processes[index][5],cpu: str(processes[index][6]),memory: str(processes[index][7]),gpu_memory: str(processes[index][8]),name: processes[index][9],})yield iexcept AttributeError:# key doesnt exist in dicti 0if __name__ __main__:port os.environ.get(PORT, 9998)REGISTRY.register(CustomCollector())app make_wsgi_app()httpd make_server(, int(port), app)print(Serving on port: , port)try:httpd.serve_forever()except KeyboardInterrupt:print(Goodbye!)记得给 Jetson 的板子打标签确保 GPU 的 Exporter 在 Jetson 上执行。否则在其他 node 上执行会因为采集不到数据而报错. kubectl label node edge-wpx machine.typejetson 新建 KubeSphere 资源
新建 ServiceAccount、DaemonSet、Service、servicemonitor目的是将 jetson-exporter 采集到的数据提供给 KubeSphere 的 Prometheus。
apiVersion: v1
kind: ServiceAccount
metadata:labels:app.kubernetes.io/component: exporterapp.kubernetes.io/name: jetson-exporterapp.kubernetes.io/part-of: kube-prometheusapp.kubernetes.io/version: 1.0.0name: jetson-exporternamespace: kubesphere-monitoring-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:labels:app.kubernetes.io/component: exporterapp.kubernetes.io/name: jetson-exporterapp.kubernetes.io/part-of: kube-prometheusapp.kubernetes.io/version: 1.0.0name: jetson-exporternamespace: kubesphere-monitoring-system
spec:revisionHistoryLimit: 10selector:matchLabels:app.kubernetes.io/component: exporterapp.kubernetes.io/name: jetson-exporterapp.kubernetes.io/part-of: kube-prometheustemplate:metadata:labels:app.kubernetes.io/component: exporterapp.kubernetes.io/name: jetson-exporterapp.kubernetes.io/part-of: kube-prometheusapp.kubernetes.io/version: 1.0.0spec:affinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms:- matchExpressions:- key: node-role.kubernetes.io/edgeoperator: Existscontainers:- image: jetson-status-exporter:v1imagePullPolicy: IfNotPresentname: jetson-exporterresources:limits:cpu: 1memory: 500Mirequests:cpu: 102mmemory: 180Miports:- containerPort: 9998hostPort: 9998name: httpprotocol: TCPterminationMessagePath: /dev/termination-logterminationMessagePolicy: FilevolumeMounts:- mountPath: /run/jtop.sockname: jtop-sockreadOnly: truednsPolicy: ClusterFirstWithHostNethostNetwork: truehostPID: truenodeSelector:kubernetes.io/os: linuxmachine.type: jetsonrestartPolicy: AlwaysschedulerName: default-schedulerserviceAccount: jetson-exporterterminationGracePeriodSeconds: 30tolerations:- operator: Existsvolumes:- hostPath:path: /run/jtop.socktype: Socketname: jtop-sockupdateStrategy:rollingUpdate:maxSurge: 0maxUnavailable: 1type: RollingUpdate
---
apiVersion: v1
kind: Service
metadata:labels:app.kubernetes.io/component: exporterapp.kubernetes.io/name: jetson-exporterapp.kubernetes.io/part-of: kube-prometheusapp.kubernetes.io/version: 1.0.0name: jetson-exporternamespace: kubesphere-monitoring-system
spec:clusterIP: NoneclusterIPs:- NoneinternalTrafficPolicy: ClusteripFamilies:- IPv4ipFamilyPolicy: SingleStackports:- name: httpport: 9998protocol: TCPtargetPort: httpselector:app.kubernetes.io/component: exporterapp.kubernetes.io/name: jetson-exporterapp.kubernetes.io/part-of: kube-prometheussessionAffinity: Nonetype: ClusterIP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:labels:app.kubernetes.io/component: exporterapp.kubernetes.io/name: jetson-exporterapp.kubernetes.io/part-of: kube-prometheusapp.kubernetes.io/vendor: kubesphereapp.kubernetes.io/version: 1.0.0name: jetson-exporternamespace: kubesphere-monitoring-system
spec:endpoints:- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/tokeninterval: 1mport: httprelabelings:- action: replaceregex: (.*)replacement: $1sourceLabels:- __meta_kubernetes_pod_node_nametargetLabel: instance- action: labeldropregex: (service|endpoint|container)scheme: httptlsConfig:insecureSkipVerify: truejobLabel: app.kubernetes.io/nameselector:matchLabels:app.kubernetes.io/component: exporterapp.kubernetes.io/name: jetson-exporterapp.kubernetes.io/part-of: kube-prometheus部署完成后jetson-exporter pod running。 重启 Prometheus pod重新加载配置后可以在 Prometheus 界面看到新增加的 GPU exporter 的 target。
kubectl delete pod prometheus-k8s-0 -n kubesphere-monitoring-system 在 KubeSphere 前端查看 GPU 监控数据
前端需要修改 KubeSphere 的 console 的代码这里属于前端内容这里就不详细说明了。
其次将 Prometheus 的 SVC 端口暴露出来通过 nodeport 的方式将 Prometheus 的端口暴露出来前端通过 http 接口来查询 GPU 的状态。
apiVersion: v1
kind: Service
metadata:labels:app.kubernetes.io/component: prometheusapp.kubernetes.io/instance: k8sapp.kubernetes.io/name: prometheusapp.kubernetes.io/part-of: kube-prometheusapp.kubernetes.io/version: 2.39.1name: prometheus-k8s-nodeportnamespace: kubesphere-monitoring-system
spec:ports:- port: 9090targetPort: 9090protocol: TCPnodePort: 32143selector:app.kubernetes.io/component: prometheusapp.kubernetes.io/instance: k8sapp.kubernetes.io/name: prometheusapp.kubernetes.io/part-of: kube-prometheussessionAffinity: ClientIPsessionAffinityConfig:clientIP:timeoutSeconds: 10800type: NodePort http 接口 查询瞬时值
get http://masterip:32143/api/v1/query?querygpu_info_board_infotime1711431293.686
get http://masterip:32143/api/v1/query?querygpu_info_hardware_infotime1711431590.574
get http://masterip:32143/api/v1/query?querygpu_usage_gputime1711431590.574
其中query为查询字段名time是查询的时间查询某个时间段的采集值
get http://10.11.140.87:32143/api/v1/query_range?querygpu_usage_gpustart1711428221.998end1711431821.998step14
其中query为查询字段名start和end是起始结束时间step是间隔时间
这样就成功在 KubeSphere监控 KubeEdge 边缘节点 Jetson 的 GPU 状态了。 总结
基于 KubeEdge我们在 KubeSphere 的前端界面上实现了边缘设备的可观测性包括 GPU 信息的可观测性。
对于边缘节点 CPU、内存状态的监控首先修改亲和性让 KubeSphere 自带的 node-exporter 能够采集边缘节点监控数据接下来利用 KubeEdge 的 EdgeMesh 将采集的数据提供给 KubeSphere 的 Prometheus。这样就实现了 CPU、内存信息的监控。
对于边缘节点 GPU 状态的监控安装 jtop 获取 GPU 使用率、温度等数据然后开发 Jetson GPU Exporter将 jtop 获取的信息发送给 KubeSphere 的 Prometheus通过修改 KubeSphere 前端 ks-console 的代码在界面上通过 http 接口获取 Prometheus 数据这样就实现了 GPU 使用率等信息监控。 本文由博客一文多发平台 OpenWrite 发布