prometheus operator
prometheus operator
此文包含prometheus-operator部署安装,收集相关指标、数据持久化、使用prometheusalert+alertmanager实现飞书、钉钉、邮箱等告警。
一、Prometheus介绍
Prometheus 开源的系统监控和报警框架,本身也是一个时序数据库(TSDB)
1.1 prometheus特性
- 一个多维的数据模型,具有由指标名称和键值对标识的时序序列数据
- 使用
PromQL
查询和聚合数据,可以非常灵活的对数据进行检索 - 不依赖额外的数据存储,
Prometheus
本身就是一个时序数据库,提供本地存储和分布式存储,并且每个Prometheus
都是自治的 - 应用程序暴露
Metrics
接口,Prometheus
通过基于HTTP的Pull模型采集数据,同时可以使用PushGateway
进行Push数据 Prometheus
同时支持动态服务发现和静态配置发现目标机器- 支持多种图形化和仪表盘,和Grafan绝配
1.2 prometheus架构
- Prometheus Server
核心组件,主要负责一下功能:
- 数据抓取(
Scraping
):Prometheus
通过HTTP定期从配置的目标(如应用程序、服务器、数据库等)抓取指标数据。这些目标称为”Exporters
“,他们暴露出Prometheus
格式的指标 - 数据存储(
Storage
):抓取到的指标数据被存储在本地磁盘上的时间序列数据库中。Prometheus
使用了一种高效的存储格式,称为”TSDB(Time Series Database)
“,他能够高效的处理大量时间序列数据。 - 查询(
Querying
):Prometheus
提供了一个强大的查询语句,称为PromQL(Prometheus Query Language)
,允许用户实时查询和分析存储的时间序列数据。
Exporters
:主要用来采集监控数据,比如主机的监控数据可以通过 node_exporter采集,MySQL 的监控数据可以通过mysql_exporter 采集,之后 Exporter 暴露一个接口,比如/metrics,Prometheus 可以通过该接口采集到数据
Node Exporter
:用于收集主机级别的指标,如CPU、内存、磁盘等。Blackbox Exporter
:用于探测网络服务的可用性,如HTTP、HTTPS、DNS等Mysql Exporter
:用于收集Mysql数据库的指标- Redis、Mongo….
https://prometheus.io/docs/instrumenting/exporters/
Pushgateway
:可选组件,用于接受短暂任务(如批量处理作业)推送的指标。由于这些任务的生命周期可能很短,无法被Prometheus
直接抓取,所以可以将指标推送到Pushgateway
,然后由Prometheus
定期从Pushgateway
抓取。Alertmanager
:独立组件,负责处理Prometheus发送的报警,支持以下功能:
- 分组(
Grouping
):将相似的告警分组,减少通知的数量。 - 抑制(
Inhibition
):将某些告警已经触发时,抑制其他告警的通知。 - 静默(
Silencing
):在特定的时间段内静默某些告警的通知。 - 通知(
Notification
):通过各种渠道(邮箱、微信、飞书等)发送告警通知。
- WebUI:Prometheus 提供了一个简单的 Web UI,用于执行查询和查看指标数据。此外,Prometheus 还可以与 Grafana 等第三方可视化工具集成,以提供更强大的可视化和仪表板功能。
Service Discovery
:Prometheus 支持多种服务发现机制,以便自动发现需要监控的目标。常见的服务发现机制包括:
静态配置:在 Prometheus 配置文件中手动指定目标。
DNS 服务发现:通过 DNS 记录自动发现目标。
文件:通过配置文件发现
consul自动发现
Kubernetes 服务发现:通过 Kubernetes API 自动发现 Pod 和 Service。
下面是他的架构图
二、在k8s上部署prometheus
Prometheus Operator
基于k8s自定义资源定义(CRD)来编排Prometheus
、AlertManager
和其他监控资源,目前支持的CRD如下:
Prometheus
:定义如何部署PrometheusAlertManager
:定义如何部署AlertmanagerServiceMonitor
:指定如何通过service实现服务发现
./nodeExporter-serviceMonitor.yaml
PodMonitor
:以声明方式指定如何监控kubernetes pod组Probe
:指定静态探针配置
./setup/0alertmanagerCustomResourceDefinition.yaml
ScrapeConfig
:指定要添加到Prometheus的抓取配置,此CRD有助于抓取kubernetes集群外部的资源PrometheusRule
:定义Prometheus报警规则
./nodeExporter-prometheusRule.yaml
AlertmanagerConfig
:指定Alertmanager配置,允许将告警路由到自定义接收器并设置禁止规则
./alertmanager-prometheusRule.yaml
2.1 安装
kube-prometheus项目地址:https://github.com/prometheus-operator/kube-prometheus/
首先需要通过该项目地址,找到和自己 Kubernetes 版本对应的 Kube Prometheus Stack 的版本:
[root@k8s-master01 prometheus]# git clone -b release-0.13 https://github.com/prometheus-operator/kubeprometheus.git
[root@k8s-master01 prometheus]# cd kube-prometheus-main/manifests/
安装Prometheus Operator
[root@k8s-master01 manifests]# kubectl create -f setup/
[root@k8s-master01 manifests]# kubectl get pods -n monitoring
NAME READY STATUS RESTARTS AGE
prometheus-operator-74cb44ccf7-t8m25 2/2 Running 0 25h
Operator容器启动后,安装prometheus stack,这里需要修改相关镜像为本地镜像,否则拉取不到
[root@k8s-master01 manifests]# kubectl create -f .
[root@k8s-master01 manifests]# kubectl get pods -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 2/2 Running 0 25h
alertmanager-main-1 2/2 Running 0 25h
alertmanager-main-2 2/2 Running 0 25h
blackbox-exporter-597d86cf5c-xd2m7 3/3 Running 0 25h
grafana-674557f4bd-sgnbb 1/1 Running 0 24h
kube-state-metrics-56f84757db-nn66j 3/3 Running 0 24h
mysql-exporter-65bdf76bb9-klq8f 1/1 Running 0 16h
node-exporter-4qbgp 2/2 Running 0 25h
node-exporter-bwqcs 2/2 Running 0 25h
node-exporter-bz9pk 2/2 Running 0 25h
node-exporter-cdzm4 2/2 Running 0 25h
node-exporter-ds4zd 2/2 Running 0 25h
node-exporter-j6kb8 2/2 Running 0 25h
node-exporter-vgdjw 2/2 Running 0 25h
node-exporter-zcfb6 2/2 Running 0 25h
node-exporter-zjjb9 2/2 Running 0 25h
prometheus-adapter-7786cd46-6cfhd 1/1 Running 0 25h
prometheus-adapter-7786cd46-qwrmb 1/1 Running 0 25h
prometheus-k8s-0 2/2 Running 0 19h
prometheus-k8s-1 2/2 Running 0 19h
prometheus-operator-74cb44ccf7-t8m25 2/2 Running 0 25h
修改grafana和prometheus service为NodePort类型
[root@k8s-master01 manifests]# kubectl get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager-main ClusterIP 10.96.41.127 <none> 9093/TCP,8080/TCP 25h
alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 25h
blackbox-exporter ClusterIP 10.96.145.255 <none> 9115/TCP,19115/TCP 25h
grafana NodePort 10.96.20.246 <none> 3000:30336/TCP 25h
kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 25h
mysql-exporter ClusterIP 10.96.74.237 <none> 9104/TCP 17h
node-exporter ClusterIP None <none> 9100/TCP 25h
prometheus-adapter ClusterIP 10.96.122.86 <none> 443/TCP 25h
prometheus-k8s NodePort 10.96.206.87 <none> 9090:32159/TCP,8080:32744/TCP 25h
prometheus-operated ClusterIP None <none> 9090/TCP 25h
prometheus-operator ClusterIP None <none> 8443/TCP 25h
2.1续 数据持久化
在 Kubernetes 集群中部署 Prometheus Operator 时,需要对某些组件进行持久化处理,以确保在 Pod 重启或迁移时数据不会丢失,我这里用的是root-ceph存储,部署及使用方法可以看这篇文章:搭建rook-ceph
- Prometheus
Prometheus 是一个时间序列数据库,用于存储监控数据。持久化 Prometheus 数据是非常关键的,因为这些数据用于历史查询、报警和分析。[root@k8s-master01 manifests]# cat prometheus-prometheus.yaml # 添加以下内容 storage: ephemeral: volumeClaimTemplate: spec: storageClassName: rook-ceph-block accessModes: ["ReadWriteOnce"] resources: requests: storage: 10Gi
- Alertmanager
Alertmanager 负责处理从 Prometheus 发送的警报,并将这些警报路由到适当的接收器(如电子邮件、Slack 等)。持久化 Alertmanager 数据可以保存告警历史和静默规则。[root@k8s-master01 manifests]# cat alertmanager-alertmanager.yaml # 添加以下内容 storage: ephemeral: volumeClaimTemplate: spec: storageClassName: rook-ceph-block accessModes: ["ReadWriteOnce"] resources: requests: storage: 10Gi
- Grafana
如果集成了 Grafana 用于数据展示,同样需要持久化 Grafana 的配置和仪表板数据。[root@k8s-master01 manifests]# cat grafana-deployment.yaml # 修改以下内容 volumes: - name: grafana-storage ephemeral: volumeClaimTemplate: spec: accessModes: ["ReadWriteOnce"] storageClassName: "rook-ceph-block" resources: requests: storage: 1Gi
2.2 监控数据来源
在k8s中prometheus的监控流程是:
创建ServiceMonitor
注册监控目标,通过selector
匹配service
,然后operator
就会自动发现ServiceMonitor
,之后解析成prometheus配置
目前比较常用的Exporter 如下
类型 | Exporter |
---|---|
数据库 | MySQL Exporter, Redis Exporter, MongoDB Exporter, MSSQL Exporter |
硬件 | Apcupsd Exporter,IoT Edison Exporter, IPMI Exporter, Node Exporter |
消息队列 | Beanstalkd Exporter, Kafka Exporter, NSQ Exporter, RabbitMQ Exporter |
存储 | Ceph Exporter, Gluster Exporter, HDFS Exporter, ScaleIO Exporter |
HTTP服务 | Apache Exporter, HAProxy Exporter, Nginx Exporter |
API服务 | AWS ECS Exporter, Docker Cloud Exporter, Docker Hub Exporter, GitHub Exporter |
日志 | Fluentd Exporter, Grok Exporter |
监控系统 | Collectd Exporter, Graphite Exporter, InfluxDB Exporter, Nagios Exporter, SNMP Exporter |
其他 | Blackbox Exporter, JIRA Exporter, Jenkins Exporter, Confluence Exporter |
2.3 云原生应用Etcd监控
测试访问Etcd Metrics接口:
这里要指定自己的etcd证书还有地址,我这里是用kubeadm安装的k8s集群,有3个master,这里随便找的一台master测试的,
[root@k8s-master01 manifests]# curl -s --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key https://192.168.31.20:2379/metrics -l -k|tail -5
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 4728
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0
证书的位置可以在 Etcd 配置文件中获得(注意配置文件的位置,不同的集群位置可能不同,Kubeadm 安装方式可能会在/etc/kubernetes/manifests/etcd.yml
中,二进制可能在/usr/lib/systemd/system/etcd.service
中):
####### 2.3.1 Etcd Service 创建
首先要配置etcd的service和endpoint
[root@k8s-master01 prometheus]# vim etcd-svc.yaml
apiVersion: v1
kind: Endpoints
metadata:
labels:
app: etcd-prom
name: etcd-prom
namespace: kube-system
subsets:
- addresses:
- ip: 192.168.31.20
- ip: 192.168.31.21
- ip: 192.168.31.22
ports:
- name: https-metrics
port: 2379
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
labels:
app: etcd-prom
name: etcd-prom
namespace: kube-system
spec:
ports:
- name: https-metrics
port: 2379
protocol: TCP
targetPort: 2379
type: ClusterIP
需要修改addresses
为自己etcd
的集群,另外需要注意port的名称为https-metrics
,需要和后面创建的ServiceMonitor
中的保持一直。之后创建该资源并查看service
的clusterIP
[root@k8s-master01 prometheus]# kubectl get svc -n kube-system etcd-prom
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
etcd-prom ClusterIP 10.96.94.176 <none> 2379/TCP 20h
通过ClusterIP测试,可以看到通过service
的Clusterip
可以访问到etcd指标
[root@k8s-master01 prometheus]# curl -s --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key https://10.96.94.176:2379/metrics -l -k|tail -5
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 4760
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0
因为etcd是需要通过https
访问,所以需要把etcd证书文件挂载到prometheus
容器(我这里是通过operator
部署的,所以只需要修改Promehteus
资源即可)
创建Etcd证书的Secret
(证书路径根据实际环境进行更改)
[root@k8s-master01 manifests]# kubectl create secret generic etcd-ssl --from-file=/etc/kubernetes/pki/etcd/ca.crt --from-file=/etc/kubernetes/pki/etcd/server.crt --from-file=/etc/kubernetes/pki/etcd/server.key -n monitoring
[root@k8s-master01 manifests]# kubectl get secret -n monitoring
NAME TYPE DATA AGE
etcd-ssl Opaque 3 20h
挂载secret,
[root@k8s-master01 manifests]# kubectl edit prometheus -n monitoring k8s
...
secrets:
- etcd-ssl
...
保存退出后,Prometheus 的 Pod 会自动重启,重启完成后,查看证书是否挂载(任意一个Prometheus 的 Pod 均可):
[root@k8s-master01 manifests]# kubectl get pods -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 2/2 Running 0 25h
alertmanager-main-1 2/2 Running 0 25h
alertmanager-main-2 2/2 Running 0 25h
blackbox-exporter-597d86cf5c-xd2m7 3/3 Running 0 25h
grafana-674557f4bd-sgnbb 1/1 Running 0 25h
kube-state-metrics-56f84757db-nn66j 3/3 Running 0 24h
mysql-exporter-65bdf76bb9-klq8f 1/1 Running 0 17h
node-exporter-4qbgp 2/2 Running 0 25h
node-exporter-bwqcs 2/2 Running 0 25h
node-exporter-bz9pk 2/2 Running 0 25h
node-exporter-cdzm4 2/2 Running 0 25h
node-exporter-ds4zd 2/2 Running 0 25h
node-exporter-j6kb8 2/2 Running 0 25h
node-exporter-vgdjw 2/2 Running 0 25h
node-exporter-zcfb6 2/2 Running 0 25h
node-exporter-zjjb9 2/2 Running 0 25h
prometheus-adapter-7786cd46-6cfhd 1/1 Running 0 25h
prometheus-adapter-7786cd46-qwrmb 1/1 Running 0 25h
prometheus-k8s-0 2/2 Running 0 20h
prometheus-k8s-1 2/2 Running 0 20h
prometheus-operator-74cb44ccf7-t8m25 2/2 Running 0 25h
可以通过describe 查看具体prometheus目录,是在/etc/prometheus/
下
[root@k8s-master01 manifests]# kubectl describe pods -n monitoring prometheus-k8s-0
...
Args:
--watch-interval=0
--listen-address=:8081
--config-file=/etc/prometheus/config/prometheus.yaml.gz
--config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
--watched-dir=/etc/prometheus/rules/prometheus-k8s-rulefiles-0
...
所以进到prometheus容器查看,已经成功挂载了
[root@k8s-master01 manifests]# kubectl exec -it prometheus-k8s-0 -n monitoring sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
/prometheus $
/prometheus $ ls /etc/prometheus/
certs config_out console_libraries consoles prometheus.yml rules secrets web_config
/prometheus $ ls /etc/prometheus/secrets/
etcd-ssl
/prometheus $ ls /etc/prometheus/secrets/etcd-ssl/
ca.crt server.crt server.key
####### 2.3.2 Etcd ServiceMonitor 创建
之后创建Etcd的ServiceMonitor
[root@k8s-master01 prometheus]# cat etcd-servicemonitors.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: etcd
namespace: monitoring
labels:
app: etcd
spec:
jobLabel: k8s-app
endpoints:
- interval: 30s
port: https-metrics # 这个 port 对应 Service.spec.ports.name
scheme: https
tlsConfig:
caFile: /etc/prometheus/secrets/etcd-ssl/ca.crt # 证书路径
certFile: /etc/prometheus/secrets/etcd-ssl/server.crt
keyFile: /etc/prometheus/secrets/etcd-ssl/server.key
insecureSkipVerify: true # 关闭证书校验
selector:
matchLabels:
app: etcd-prom # 跟 svc 的 lables 保持一致
namespaceSelector:
matchNames:
- kube-system
[root@k8s-master01 prometheus]# kubectl create -f etcd-servicemonitors.yaml
创建完成后,在prometheus的ui界面可以看到相关配置。
grafana配置就略过了,直接找到合适的模板然后指定数据源就可以展示了
grafana模板:https://grafana.com/grafana/dashboards/?dataSource=prometheus&search=schedu
2.4 非云原生应用Etcd监控
我这里以mysql为例,大体流程分为k8s集群内和集群外两种方式:
- 如果mysql部署在k8s集群内,需要创建mysql 的
service
,然后创建mysql的exporter
,在exporter中指定mysql的信息,然后再创建servicemonitor
指定selector
为exporter
相同,这样prometheus就可以自动发现serviceMonitor
了 - 如果mysql部署在k8s集群外,则创建mysql的
exporter
,在exporter中指定mysql的信息,然后再创建servicemonitor
指定selector为exporter
相同指定集群外部的mysql地址即可。
我这里演示只以集群外
2.4.1 部署 mysql exporter
我这里使用prometheus官方的最新版exporter部署
https://github.com/prometheus/mysqld_exporter
使用0.15.0以后版本的exporter:
已经不支持**DATA_SOURCE_NAME 变量了,如果使用最新版部署则需要挂载.my.cnf文件到exporter镜像中,否则会报错**
mysqld_exporter.go:225 level=info msg=“Error parsing host config“ file=.my.cnf err=“no configuration
[root@k8s-master01 prometheus]# cat .my.cnf
[client]
user=exporter
password=exporter
host=192.168.1.185
port=3306
[root@k8s-master01 prometheus]# cat mysql-exporter.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
labels:
app: mysql-exporter
name: mysql-exporter
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: mysql-exporter
strategy: {}
template:
metadata:
labels:
app: mysql-exporter
spec:
containers:
- image: harbor.dujie.com/dycloud/mysqld-exporter
name: mysqld-exporter
resources:
requests:
cpu: 100m
memory: 100Mi
ports:
- containerPort: 9104
volumeMounts:
- name: mysql
mountPath: ".my.cnf"
subPath: ".my.cnf"
volumes:
- name: mysql
configMap:
name: mysqlcnf
---
apiVersion: v1
kind: Service
metadata:
name: mysql-exporter
namespace: monitoring
labels:
app: mysql-exporter
spec:
type: ClusterIP
selector:
app: mysql-exporter
ports:
- name: api
port: 9104
protocol: TCP
创建完成之后可以验证svc是否可以成功访问mysql的metrics
[root@k8s-master01 prometheus]# kubectl get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
mysql-exporter ClusterIP 10.96.74.237 <none> 9104/TCP 17h
[root@k8s-master01 prometheus]# curl -s 10.96.74.237:9104/metrics |tail -5
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 4195
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0
使用0.15.0以前版本的exporter:
[root@k8s-master01 prometheus]# cat mysql-exporter.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mysql-exporter
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: mysql-exporter
template:
metadata:
labels:
app: mysql-exporter
spec:
containers:
- name: mysql-exporter
image: registry.cn-beijing.aliyuncs.com/dotbalo/mysqld-exporter
env:
- name: DATA_SOURCE_NAME
value: "exporter:exporter@(192.168.1.185:3306)/"
imagePullPolicy: IfNotPresent
ports:
- containerPort: 9104
---
apiVersion: v1
kind: Service
metadata:
name: mysql-exporter
namespace: monitoring
labels:
app: mysql-exporter
spec:
type: ClusterIP
selector:
app: mysql-exporter
ports:
- name: api
port: 9104
protocol: TCP
2.4.2 部署 myslq serviceMonitor
注意selector 要和上面的service一致
[root@k8s-master01 prometheus]# vim mysql-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: mysql-exporter
namespace: monitoring
labels:
app: mysql-exporter
spec:
jobLabel: k8s-app
endpoints:
- port: api
interval: 30s
scheme: http
selector:
matchLabels:
app: mysql-exporter #
namespaceSelector:
matchNames:
- monitoring
创建完成后稍等一会可以在prometheus 界面中看到了
ServiceMonitor 监控失败排查步骤
- 确认ServiceMonitor是否成功创建
[root@k8s-master01 prometheus]# kubectl get servicemonitors.monitoring.coreos.com -n monitoring
NAME AGE
alertmanager-main 26h
blackbox-exporter 26h
coredns 26h
etcd 20h
grafana 26h
kube-apiserver 26h
kube-controller-manager 145m
kube-scheduler 96m
kube-state-metrics 26h
kubelet 26h
mysql-exporter 17h
node-exporter 26h
prometheus-adapter 26h
prometheus-k8s 26h
prometheus-operator 26h
- 确认ServiceMonitor标签是否配置正确
[root@k8s-master01 prometheus]# kubectl get servicemonitors.monitoring.coreos.com -n monitoring mysql-exporter -o yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"monitoring.coreos.com/v1","kind":"ServiceMonitor","metadata":{"annotations":{},"labels":{"app":"mysql-exporter"},"name":"mysql-exporter","namespace":"monitoring"},"spec":{"endpoints":[{"interval":"30s","port":"api","scheme":"http"}],"jobLabel":"k8s-app","namespaceSelector":{"matchNames":["monitoring"]},"selector":{"matchLabels":{"app":"mysql-exporter"}}}}
creationTimestamp: "2024-07-10T09:33:07Z"
generation: 1
labels:
app: mysql-exporter
name: mysql-exporter
namespace: monitoring
resourceVersion: "323737"
uid: bbfba7ef-a9f7-46ba-b60a-c3e2b1d31ebc
spec:
endpoints:
- interval: 30s
port: api
scheme: http
jobLabel: k8s-app
namespaceSelector:
matchNames:
- monitoring
selector:
matchLabels:
app: mysql-exporter
- 确认Prometheus是否生成了相关配置
- 确认存在
ServiceMonitor
匹配的ServiceserviceMonitor
和service的selector标签保持一致
- 确认通过Service能够访问程序的Metrics接口
使用serviceip可以成功获取到Metrics接口
- 确认Service的端口和Scheme和ServiceMonitor一致
这里需要保持一致,servicemonitor也可以直接指定端口,但是推荐使用name匹配,因为如果service端口更改,servicemonitor根据name匹配不需要修改了
2.5 controller-manager/scheduler 监控
部署完operator之后可以看到controller-manager
和scheduler
是没有监控指标的,根据上面的排查步骤之后,可以看到他们两个组件默认并没有部署service,所以需要部署对应的service
[root@k8s-master01 prometheus]# cat controller-manager-svc.yml
apiVersion: v1
kind: Endpoints
metadata:
labels:
app: controllermanager-prom
name: controllermanager-prom
namespace: kube-system
subsets:
- addresses:
- ip: 192.168.31.20
- ip: 192.168.31.21
- ip: 192.168.31.22
ports:
- name: controllmanager-metrics
port: 10257
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
labels:
app: controllermanager-prom
name: controllermanager-prom
namespace: kube-system
spec:
ports:
- name: controllmanager-metrics
port: 10257
protocol: TCP
targetPort: 10257
type: ClusterIP
[root@k8s-master01 prometheus]# cat scheduler-svc.yml
apiVersion: v1
kind: Endpoints
metadata:
labels:
app: scheduler-prom
name: scheduler-prom
namespace: kube-system
subsets:
- addresses:
- ip: 192.168.31.20
- ip: 192.168.31.21
- ip: 192.168.31.22
ports:
- name: scheduler-metrics
port: 10259
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
labels:
app: scheduler-prom
name: scheduler-prom
namespace: kube-system
spec:
ports:
- name: scheduler-metrics
port: 10259
protocol: TCP
targetPort: 10259
type: ClusterIP
创建service后测试Metrics发现会报403
创建role和rolebind之后就可以访问到了
[root@k8s-master01 prometheus]# cat role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: metrics-reader
rules:
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
[root@k8s-master01 prometheus]# cat rolebind.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: metrics-reader-binding
subjects:
- kind: User
name: kubernetes
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: metrics-reader
apiGroup: rbac.authorization.k8s.io
然后需要修改他两个组件的绑定ip,kubeadm默认是127.0.0.1,prometheus访问不到,每台master都需要更改
然后修改他俩的servicemonitor的selector,让他们和刚才创建的service的selector对应
[root@k8s-master01 prometheus]# kubectl edit servicemonitors.monitoring.coreos.com -n monitoring kube-controller-manager
...
selector:
matchLabels:
app: controllermanager-prom
修改完成后prometheus就会自动发现他们的指标了
2.5 blackbox 监控域名
https://github.com/prometheus/blackbox_exporter
新版PrometheusStack
已经默认安装了BlackboxExoporter
,可以通过下面命令查看
[root@k8s-master01 manifests]# kubectl get pods -n monitoring -l app.kubernetes.io/name=blackbox-exporter
NAME READY STATUS RESTARTS AGE
blackbox-exporter-597d86cf5c-xd2m7 3/3 Running 0 5d3h
# 同时也会创建一个 Service,可以通过该 Service 访问 Blackbox Exporter 并传递一些参数:
[root@k8s-master01 manifests]# kubectl get svc -n monitoring -l app.kubernetes.io/name=blackbox-exporter
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
blackbox-exporter ClusterIP 10.96.145.255 <none> 9115/TCP,19115/TCP 5d3h
比如检测下 dycloud.fun(使用任何一个公网域名或者公司内的域名探测即可)网站的状态,可以通过如下命令进行检查:
[root@k8s-master01 manifests]# curl -s "http://10.96.145.255:19115/probe?target=dycloud.fun&module=http_2xx" |tail -5
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1
probe是接口地址,taget是检测的目标,module是使用哪个模块进行探测。如果集群中没有配置BlackboxExporter
,可以参考:
https://github.com/prometheus/blackbox_exporter 进行安装
2.6 Prometheus 静态配置
首先创建一个空文件,然后通过该文件创建一个Secret,那么这个Secret就可以作为Prometheus的静态配置:
[root@k8s-master01 prometheus]# touch prometheus-additional.yaml
[root@k8s-master01 prometheus]# kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
创建完,需要添加相关配置:
[root@k8s-master01 prometheus]# kubectl edit prometheus -n monitoring k8s
additionalScrapeConfigs:
key: prometheus-additional.yaml
name: additional-configs
optional: true
添加完之后不需要重启prometheus,然后再prometheus-additional.yaml
文件内编辑一些静态配置,
[root@k8s-master01 prometheus]# cat prometheus-additional.yaml
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx] # Look for a HTTP 200 response.
static_configs:
- targets:
- https://dycloud.fun # Target to probe with http.
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:19115 # The blackbox exporter's real hostname:port.
targets
:探测的目标params
:使用哪个模块进行探测replacement:
Blackbox Exporter的地址
可以看到此处的内容,和传统配置的内容一致,只需要添加对应的 job 即可。之后通过该文件更新该 Secret。之后通过该文件更新secret
[root@k8s-master01 prometheus]# kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml --dry-run=client -o yaml |kubectl apply -f - -nmonitoring
secret/additional-configs configured
更新完成后,稍等一分钟即可在 Prometheus Web UI 看到该配置:
然后就可以在grafana中导入黑盒监控的模板(模板id:13659):
2.7 Prometheus 监控windows(外部)主机
监控 Linux 的 Exporter 是:https://github.com/prometheus/node_exporter ,监控 Windows主机的 Exporter 是:https://github.com/prometheus-community/windows_exporter
首 先 下 载 对 应 的 Exporter 至 Windows 主 机 ( MSI 文 件 下 载 地 址 :https://github.com/prometheus-community/windows_exporter/releases ):
下载完成后,双击打开即可完成安装,之后可以在任务管理器上看到对应的进程:
Windows Exporter 会暴露一个 9182 端口,可以通过该端口访问到 Windows 的监控数据。
接下来在静态配置文件中添加以下配置:
- job_name: 'WindowsServerMonitor'
static_configs:
- targets:
- "1.1.1.1:9182"
labels:
server_type: 'windows'
relabel_configs:
- source_labels: [__address__]
target_label: instance
Targets 配置的是监控主机,如果是多台 Windows 主机,配置多行即可,当然每台主机都需要配置 Exporter。之后可以在 Prometheus Web UI 看到监控数据
之后导入模板(地址:https://grafana.com/grafana/dashboards/12566)即可:
三、使用PrometheusAlert实现钉钉告警
官网:https://github.com/feiyu563/PrometheusAlert
PrometheusAlert是开源的运维告警中心消息转发系统,支持主流的监控系统Prometheus、Zabbix,日志系统Graylog2,Graylog3、数据可视化系统Grafana、SonarQube。阿里云-云监控,以及所有支持WebHook接口的系统发出的预警消息,支持将收到的这些消息发送到钉钉,微信,email,飞书,腾讯短信,腾讯电话,阿里云短信,阿里云电话,华为短信,百度云短信,容联云电话,七陌短信,七陌语音,TeleGram,百度Hi(如流),Kafka等。
项目架构:
PrometheusAlert 后端使用了 beego 框架,前端使用了 AdminLTE (基于 Bootstrap 和 Jquery)模板。
├── cmd: 脚本
├── conf: 配置
├── controllers:控制器
├── db:默认的 sqlite 数据
├── doc:文档
├── docker-entrypoint.sh:容器运行入口文件
├── Dockerfile
├── example:示例文件
├── go.mod
├── go.sum
├── LICENSE
├── main.go
├── Makefile
├── models:模型
├── PrometheusAlert:二进制文件
├── PrometheusAlertVoicePlugin
├── README.MD
├── routers:路由
├── static:静态资源
├── swagger
├── tests:测试
├── views:前端模板
└── zabbixclient
在k8s中安装
#Kubernetes中运行可以直接执行以下命令行即可(注意默认的部署模版中未挂载模版数据库文件 db/PrometheusAlertDB.db,为防止模版数据丢失,请自行增加挂载配置 )
kubectl apply -n monitoring -f https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/example/kubernetes/PrometheusAlert-Deployment.yaml
[root@k8s-master01 prometheus]# cat PrometheusAlert-Deployment.yaml
# apiVersion: v1
# kind: Namespace
# metadata:
# name: monitoring
---
apiVersion: v1
data:
app.conf: |
#---------------------↓全局配置-----------------------
appname = PrometheusAlert
#登录用户名
login_user=prometheusalert
#登录密码
login_password=prometheusalert
#监听地址
httpaddr = "0.0.0.0"
#监听端口
httpport = 8080
runmode = dev
#设置代理 proxy = http://123.123.123.123:8080
proxy =
#开启JSON请求
copyrequestbody = true
#告警消息标题
title=PrometheusAlert
#链接到告警平台地址
GraylogAlerturl=http://graylog.org
#钉钉告警 告警logo图标地址
logourl=https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png
#钉钉告警 恢复logo图标地址
rlogourl=https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png
#短信告警级别(等于3就进行短信告警) 告警级别定义 0 信息,1 警告,2 一般严重,3 严重,4 灾难
messagelevel=3
#电话告警级别(等于4就进行语音告警) 告警级别定义 0 信息,1 警告,2 一般严重,3 严重,4 灾难
phonecalllevel=4
#默认拨打号码(页面测试短信和电话功能需要配置此项)
defaultphone=xxxxxxxx
#故障恢复是否启用电话通知0为关闭,1为开启
phonecallresolved=0
#是否前台输出file or console
logtype=file
#日志文件路径
logpath=logs/prometheusalertcenter.log
#转换Prometheus,graylog告警消息的时区为CST时区(如默认已经是CST时区,请勿开启)
prometheus_cst_time=0
#数据库驱动,支持sqlite3,mysql,postgres如使用mysql或postgres,请开启db_host,db_port,db_user,db_password,db_name的注释
db_driver=sqlite3
#db_host=127.0.0.1
#db_port=3306
#db_user=root
#db_password=root
#db_name=prometheusalert
#是否开启告警记录 0为关闭,1为开启
AlertRecord=0
#是否开启告警记录定时删除 0为关闭,1为开启
RecordLive=0
#告警记录定时删除周期,单位天
RecordLiveDay=7
# 是否将告警记录写入es7,0为关闭,1为开启
alert_to_es=0
# es地址,是[]string
# beego.Appconfig.Strings读取配置为[]string,使用";"而不是","
to_es_url=http://localhost:9200
# to_es_url=http://es1:9200;http://es2:9200;http://es3:9200
# es用户和密码
# to_es_user=username
# to_es_pwd=password
# 长连接最大空闲数
maxIdleConns=100
# 热更新配置文件
open-hotreload=0
#---------------------↓webhook-----------------------
#是否开启钉钉告警通道,可同时开始多个通道0为关闭,1为开启
open-dingding=1
#默认钉钉机器人地址
#ddurl=https://oapi.dingtalk.com/robot/send?access_token=cf7a7becfb3889c1d75b31be7fa100e14576d6730a9634752692c098b81a93b1&secret=SEC30cfb8f6cfc8565251a90ec14afbed005c3b84b1a75ccfa16cf1fcf736a4a48f&at=15811047166
ddurl=https://oapi.dingtalk.com/robot/send?access_token=cf7a7becfb3889c1d75b31be7fa100e14576d6730a9634752692c098b81a93b1
#是否开启 @所有人(0为关闭,1为开启)
dd_isatall=1
#是否开启钉钉机器人加签,0为关闭,1为开启
# 使用方法:https://oapi.dingtalk.com/robot/send?access_token=XXXXXX&secret=mysecret
open-dingding-secret=0
#是否开启微信告警通道,可同时开始多个通道0为关闭,1为开启
open-weixin=1
#默认企业微信机器人地址
wxurl=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxx
#是否开启飞书告警通道,可同时开始多个通道0为关闭,1为开启
open-feishu=1
#默认飞书机器人地址
fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/61132219-e82a-4aae-9186-3664efa9d8c1
# webhook 发送 http 请求的 contentType, 如 application/json, application/x-www-form-urlencoded,不配置默认 application/json
wh_contenttype=application/json
#---------------------↓腾讯云接口-----------------------
#是否开启腾讯云短信告警通道,可同时开始多个通道0为关闭,1为开启
open-txdx=0
#腾讯云短信接口key
TXY_DX_appkey=xxxxx
#腾讯云短信模版ID 腾讯云短信模版配置可参考 prometheus告警:{1}
TXY_DX_tpl_id=xxxxx
#腾讯云短信sdk app id
TXY_DX_sdkappid=xxxxx
#腾讯云短信签名 根据自己审核通过的签名来填写
TXY_DX_sign=腾讯云
#是否开启腾讯云电话告警通道,可同时开始多个通道0为关闭,1为开启
open-txdh=0
#腾讯云电话接口key
TXY_DH_phonecallappkey=xxxxx
#腾讯云电话模版ID
TXY_DH_phonecalltpl_id=xxxxx
#腾讯云电话sdk app id
TXY_DH_phonecallsdkappid=xxxxx
#---------------------↓华为云接口-----------------------
#是否开启华为云短信告警通道,可同时开始多个通道0为关闭,1为开启
open-hwdx=0
#华为云短信接口key
HWY_DX_APP_Key=xxxxxxxxxxxxxxxxxxxxxx
#华为云短信接口Secret
HWY_DX_APP_Secret=xxxxxxxxxxxxxxxxxxxxxx
#华为云APP接入地址(端口接口地址)
HWY_DX_APP_Url=https://rtcsms.cn-north-1.myhuaweicloud.com:10743
#华为云短信模板ID
HWY_DX_Templateid=xxxxxxxxxxxxxxxxxxxxxx
#华为云签名名称,必须是已审核通过的,与模板类型一致的签名名称,按照自己的实际签名填写
HWY_DX_Signature=华为云
#华为云签名通道号
HWY_DX_Sender=xxxxxxxxxx
#---------------------↓阿里云接口-----------------------
#是否开启阿里云短信告警通道,可同时开始多个通道0为关闭,1为开启
open-alydx=0
#阿里云短信主账号AccessKey的ID
ALY_DX_AccessKeyId=xxxxxxxxxxxxxxxxxxxxxx
#阿里云短信接口密钥
ALY_DX_AccessSecret=xxxxxxxxxxxxxxxxxxxxxx
#阿里云短信签名名称
ALY_DX_SignName=阿里云
#阿里云短信模板ID
ALY_DX_Template=xxxxxxxxxxxxxxxxxxxxxx
#是否开启阿里云电话告警通道,可同时开始多个通道0为关闭,1为开启
open-alydh=0
#阿里云电话主账号AccessKey的ID
ALY_DH_AccessKeyId=xxxxxxxxxxxxxxxxxxxxxx
#阿里云电话接口密钥
ALY_DH_AccessSecret=xxxxxxxxxxxxxxxxxxxxxx
#阿里云电话被叫显号,必须是已购买的号码
ALY_DX_CalledShowNumber=xxxxxxxxx
#阿里云电话文本转语音(TTS)模板ID
ALY_DH_TtsCode=xxxxxxxx
#---------------------↓容联云接口-----------------------
#是否开启容联云电话告警通道,可同时开始多个通道0为关闭,1为开启
open-rlydh=0
#容联云基础接口地址
RLY_URL=https://app.cloopen.com:8883/2013-12-26/Accounts/
#容联云后台SID
RLY_ACCOUNT_SID=xxxxxxxxxxx
#容联云api-token
RLY_ACCOUNT_TOKEN=xxxxxxxxxx
#容联云app_id
RLY_APP_ID=xxxxxxxxxxxxx
#---------------------↓邮件配置-----------------------
#是否开启邮件
open-email=0
#邮件发件服务器地址
Email_host=smtp.qq.com
#邮件发件服务器端口
Email_port=465
#邮件帐号
Email_user=xxxxxxx@qq.com
#邮件密码
Email_password=xxxxxx
#邮件标题
Email_title=运维告警
#默认发送邮箱
Default_emails=xxxxx@qq.com,xxxxx@qq.com
#---------------------↓七陌云接口-----------------------
#是否开启七陌短信告警通道,可同时开始多个通道0为关闭,1为开启
open-7moordx=0
#七陌账户ID
7MOOR_ACCOUNT_ID=Nxxx
#七陌账户APISecret
7MOOR_ACCOUNT_APISECRET=xxx
#七陌账户短信模板编号
7MOOR_DX_TEMPLATENUM=n
#注意:七陌短信变量这里只用一个var1,在代码里写死了。
#-----------
#是否开启七陌webcall语音通知告警通道,可同时开始多个通道0为关闭,1为开启
open-7moordh=0
#请在七陌平台添加虚拟服务号、文本节点
#七陌账户webcall的虚拟服务号
7MOOR_WEBCALL_SERVICENO=xxx
# 文本节点里被替换的变量,我配置的是text。如果被替换的变量不是text,请修改此配置
7MOOR_WEBCALL_VOICE_VAR=text
#---------------------↓telegram接口-----------------------
#是否开启telegram告警通道,可同时开始多个通道0为关闭,1为开启
open-tg=0
#tg机器人token
TG_TOKEN=xxxxx
#tg消息模式 个人消息或者频道消息 0为关闭(推送给个人),1为开启(推送给频道)
TG_MODE_CHAN=0
#tg用户ID
TG_USERID=xxxxx
#tg频道name或者id, 频道name需要以@开始
TG_CHANNAME=xxxxx
#tg api地址, 可以配置为代理地址
#TG_API_PROXY="https://api.telegram.org/bot%s/%s"
#---------------------↓workwechat接口-----------------------
#是否开启workwechat告警通道,可同时开始多个通道0为关闭,1为开启
open-workwechat=0
# 企业ID
WorkWechat_CropID=xxxxx
# 应用ID
WorkWechat_AgentID=xxxx
# 应用secret
WorkWechat_AgentSecret=xxxx
# 接受用户
WorkWechat_ToUser="zhangsan|lisi"
# 接受部门
WorkWechat_ToParty="ops|dev"
# 接受标签
WorkWechat_ToTag=""
# 消息类型, 暂时只支持markdown
# WorkWechat_Msgtype = "markdown"
#---------------------↓百度云接口-----------------------
#是否开启百度云短信告警通道,可同时开始多个通道0为关闭,1为开启
open-baidudx=0
#百度云短信接口AK(ACCESS_KEY_ID)
BDY_DX_AK=xxxxx
#百度云短信接口SK(SECRET_ACCESS_KEY)
BDY_DX_SK=xxxxx
#百度云短信ENDPOINT(ENDPOINT参数需要用指定区域的域名来进行定义,如服务所在区域为北京,则为)
BDY_DX_ENDPOINT=http://smsv3.bj.baidubce.com
#百度云短信模版ID,根据自己审核通过的模版来填写(模版支持一个参数code:如prometheus告警:{code})
BDY_DX_TEMPLATE_ID=xxxxx
#百度云短信签名ID,根据自己审核通过的签名来填写
TXY_DX_SIGNATURE_ID=xxxxx
#---------------------↓百度Hi(如流)-----------------------
#是否开启百度Hi(如流)告警通道,可同时开始多个通道0为关闭,1为开启
open-ruliu=0
#默认百度Hi(如流)机器人地址
BDRL_URL=https://api.im.baidu.com/api/msg/groupmsgsend?access_token=xxxxxxxxxxxxxx
#百度Hi(如流)群ID
BDRL_ID=123456
#---------------------↓bark接口-----------------------
#是否开启telegram告警通道,可同时开始多个通道0为关闭,1为开启
open-bark=0
#bark默认地址, 建议自行部署bark-server
BARK_URL=https://api.day.app
#bark key, 多个key使用分割
BARK_KEYS=xxxxx
# 复制, 推荐开启
BARK_COPY=1
# 历史记录保存,推荐开启
BARK_ARCHIVE=1
# 消息分组
BARK_GROUP=PrometheusAlert
#---------------------↓语音播报-----------------------
#语音播报需要配合语音播报插件才能使用
#是否开启语音播报通道,0为关闭,1为开启
open-voice=1
VOICE_IP=127.0.0.1
VOICE_PORT=9999
#---------------------↓飞书机器人应用-----------------------
#是否开启feishuapp告警通道,可同时开始多个通道0为关闭,1为开启
open-feishuapp=1
# APPID
FEISHU_APPID=cli_xxxxxxxxxxxxx
# APPSECRET
FEISHU_APPSECRET=xxxxxxxxxxxxxxxxxxxxxx
# 可填飞书 用户open_id、user_id、union_ids、部门open_department_id
AT_USER_ID="xxxxxxxx"
#---------------------↓告警组-----------------------
# 有其他新增的配置段,请放在告警组的上面
# 暂时仅针对 PrometheusContronller 中的 /prometheus/alert 路由
# 告警组如果放在了 wx, dd... 那部分的上分,beego section 取 url 值不太对。
# 所以这里使用 include 来包含另告警组配置
# 是否启用告警组功能
open-alertgroup=0
# 自定义的告警组既可以写在这里,也可以写在单独的文件里。
# 写在单独的告警组配置里更便于修改。
# include "alertgroup.conf"
#---------------------↓kafka地址-----------------------
# kafka服务器的地址
open-kafka=1
kafka_server = 127.0.0.1:9092
# 写入消息的kafka topic
kafka_topic = devops
# 用户标记该消息是来自PrometheusAlert,一般无需修改
kafka_key = PrometheusAlert
user.csv: |
2019年4月10日,15888888881,小张,15999999999,备用联系人小陈,15999999998,备用联系人小赵
2019年4月11日,15888888882,小李,15999999999,备用联系人小陈,15999999998,备用联系人小赵
2019年4月12日,15888888883,小王,15999999999,备用联系人小陈,15999999998,备用联系人小赵
2019年4月13日,15888888884,小宋,15999999999,备用联系人小陈,15999999998,备用联系人小赵
kind: ConfigMap
metadata:
name: prometheus-alert-center-conf
namespace: monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: prometheus-alert-center
alertname: prometheus-alert-center
name: prometheus-alert-center
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus-alert-center
alertname: prometheus-alert-center
template:
metadata:
labels:
app: prometheus-alert-center
alertname: prometheus-alert-center
spec:
initContainers:
- name: init-time-sync
image: harbor.dujie.com/dycloud/ntpdate
command:
- sh
- -c
- "ntpdate ntp.aliyun.com"
securityContext:
privileged: true
containers:
- image: harbor.dujie.com/dycloud/prometheus-alert:v4.9.1
name: prometheus-alert-center
env:
- name: TZ
value: "Asia/Shanghai"
ports:
- containerPort: 8080
name: http
resources:
limits:
cpu: 200m
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
volumeMounts:
- name: prometheus-alert-center-conf-map
mountPath: /app/conf/app.conf
subPath: app.conf
- name: prometheus-alert-center-conf-map
mountPath: /app/user.csv
subPath: user.csv
- name: prometheus-alert-volume
mountPath: /app/db
volumes:
- name: prometheus-alert-volume
ephemeral:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: "rook-ceph-block"
resources:
requests:
storage: 1Gi
- name: prometheus-alert-center-conf-map
configMap:
name: prometheus-alert-center-conf
items:
- key: app.conf
path: app.conf
- key: user.csv
path: user.csv
---
apiVersion: v1
kind: Service
metadata:
labels:
alertname: prometheus-alert-center
name: prometheus-alert-center
namespace: monitoring
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '8080'
spec:
ports:
- name: http
port: 8080
targetPort: http
selector:
app: prometheus-alert-center
type: NodePort
---
# apiVersion: networking.k8s.io/v1beta1
# kind: Ingress
# metadata:
# annotations:
# kubernetes.io/ingress.class: nginx
# name: prometheus-alert-center
# namespace: monitoring
# spec:
# rules:
# - host: alert-center.local
# http:
# paths:
# - backend:
# serviceName: prometheus-alert-center
# servicePort: 8080
# path: /
#启动后可使用浏览器打开以下地址查看:http://[YOUR-PrometheusAlert-URL]:8080
#默认登录帐号和密码在app.conf中有配置
打开钉钉,进入钉钉群中,选择群设置–>智能群助手–>添加机器人–>自定义,可参下图:
新版本的钉钉加了安全设置,只需选择安全设置中的 自定义关键词 即可,并将关键词设置为 Prometheus或者app.conf中设置的title值均可,参考下图
复制图中的Webhook地址,并填入PrometheusAlert配置文件app.conf中对应配置项即可。
PS: 钉钉机器人目前已经支持 @某人 ,使用该功能需要取得对应用户的钉钉关联手机号码,如下图:
钉钉相关配置
** PrometheusAlert-Deployment.yaml**
#---------------------↓全局配置-----------------------
#告警消息标题
title=PrometheusAlert
#钉钉告警 告警logo图标地址
logourl=https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png
#钉钉告警 恢复logo图标地址
rlogourl=https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png
#---------------------↓webhook-----------------------
#是否开启钉钉告警通道,可同时开始多个通道0为关闭,1为开启
open-dingding=1
#默认钉钉机器人地址
ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxx
#是否开启 @所有人(0为关闭,1为开启)
dd_isatall=1
#是否开启钉钉机器人加签,0为关闭,1为开启
# 使用方法:https://oapi.dingtalk.com/robot/send?access_token=XXXXXX&secret=mysecret
open-dingding-secret=0
修改完之后需要apply此文件,然后修改alertmanager-secret.yaml
先查看下prometheusalert的svc地址和端口
[root@k8s-master01 manifests]# kubectl get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
prometheus-alert-center NodePort 10.96.167.90 <none> 8080:30621/TCP 3d20h
[root@k8s-master01 manifests]# cat alertmanager-secret.yaml
apiVersion: v1
kind: Secret
metadata:
labels:
app.kubernetes.io/component: alert-router
app.kubernetes.io/instance: main
app.kubernetes.io/name: alertmanager
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 0.27.0
name: alertmanager-main
namespace: monitoring
stringData:
alertmanager.yaml: |-
"global":
"resolve_timeout": "5m"
smtp_from: "15811047166@163.com"
smtp_smarthost: "smtp.163.com:465"
smtp_hello: "163.com"
smtp_auth_username: "15811047166@163.com"
smtp_auth_password: "KWLDKEYTJSMWYDRV"
smtp_require_tls: false
"inhibit_rules":
- "equal":
- "alertname"
"source_matchers":
- "severity = critical"
"target_matchers":
- "severity =~ warning|info"
- "equal":
- "alertname"
"source_matchers":
- "severity = warning"
"target_matchers":
- "severity = info"
- "equal":
- "namespace"
"source_matchers":
- "alertname = InfoInhibitor"
"target_matchers":
- "severity = info"
"receivers":
- "name": "Default"
"email_configs":
- "to": "15811047166@163.com"
"send_resolved": true
- "name": "webhook"
webhook_configs: # 此url填写prometheusalert svc地址和端口,后面是固定格式type是媒介类型,fs表示飞书,dd表示钉钉,tpl是模板名,后面的url是飞书机器人的webhook连接
- url: "http://10.96.167.90:8080/prometheusalert?type=fs&tpl=prometheus-fs&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxx"
- "name": "Watchdog"
- "name": "dingding"
webhook_configs: # 此url填写prometheusalert svc地址和端口,后面是固定格式type是媒介类型,fs表示飞书,dd表示钉钉,tpl是模板名,后面的url是钉钉机器人的webhook连接
- url: "http://10.96.167.90:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxx"
- "name": "Critical"
- "name": "null"
"route":
"group_by":
- "alertname"
"group_interval": "2m"
"group_wait": "10s"
"receiver": "Default"
"repeat_interval": "2m"
"routes":
- "matchers":
- "alertname = Watchdog"
"receiver": "Watchdog"
- "matchers":
- "alertname = InfoInhibitor"
"receiver": "null"
- "matchers":
- "severity = critical"
"receiver": "dingding"
- "matchers":
- "severity = warning"
"receiver": "dingding"
type: Opaque
修改完成后可以查看alertmanager和 prometheusalert的日志
[root@k8s-master01 manifests]# kubectl logs -f -n monitoring alertmanager-main-0
ts=2024-07-15T02:25:59.706Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2024-07-15T02:25:59.707Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
出现此日志表示加载配置成功
可以到prometheusalert页面
填写自定义模板,可以用下面的模板,很好看
{{ $var := .externalURL}}{{ $status := .status}}{{ range $k,$v:=.alerts }} {{if eq $status "resolved"}}
## [告警恢复-通知]({{$var}})
#### 监控指标: {{$v.labels.alertname}}
{{ if eq $v.labels.severity "warning" }}
#### 告警级别: **<font color="#E6A23C">{{$v.labels.severity}}</font>**
{{ else if eq $v.labels.severity "critical" }}
#### 告警级别: **<font color="#F56C6C">{{$v.labels.severity}}</font>**
{{ end }}
#### 当前状态: **<font color="#67C23A" size=4>已恢复</font>**
#### 故障主机: {{$v.labels.instance}}
* ###### 告警阈值: {{$v.labels.threshold}}
* ###### 开始时间: {{GetCSTtime $v.startsAt}}
* ###### 恢复时间: {{GetCSTtime $v.endsAt}}
#### 告警恢复: <font color="#67C23A">已恢复,{{$v.annotations.description}}</font>
{{ else }}
## [监控告警-通知]({{$var}})
#### 监控指标: {{$v.labels.alertname}}
{{ if eq $v.labels.severity "warning" }}
#### 告警级别: **<font color="#E6A23C" size=4>{{$v.labels.severity}}</font>**
#### 当前状态: **<font color="#E6A23C">需要处理</font>**
{{ else if eq $v.labels.severity "critical" }}
#### 告警级别: **<font color="#F56C6C" size=4>{{$v.labels.severity}}</font>**
#### 当前状态: **<font color="#F56C6C">需要处理</font>**
{{ end }}
#### 故障主机: {{$v.labels.instance}}
* ###### 告警阈值: {{$v.labels.threshold}}
* ###### 持续时间: {{$v.labels.for_time}}
* ###### 触发时间: {{GetCSTtime $v.startsAt}}
{{ if eq $v.labels.severity "warning" }}
#### 告警触发: <font color="#E6A23C">{{$v.annotations.description}}</font>
{{ else if eq $v.labels.severity "critical" }}
#### 告警触发: <font color="#F56C6C">{{$v.annotations.description}}</font>
{{ end }}
{{ end }}
{{ end }}
填写钉钉机器人地址后点击保存模板,然后点击模板测试
查看日志
2024/07/15 03:48:13.583 [D] [value.go:586] [1721015293583397817] sss
2024/07/15 03:48:13.583 [D] [server.go:2936] | 172.16.32.128| 200 | 456.082µs| match| POST /prometheusalert r:/prometheusalert
如果没有错误表示成功了
此时可以根据一个监控指标验证是否可以成功报警,我这里用kubelet来演示,首先查看所有Rule文件
[root@k8s-master01 manifests]# pwd
/root/yaml/prometheus/kube-prometheus-main/manifests
[root@k8s-master01 manifests]# ll *Rule*
-rw-r--r-- 1 root root 6979 7月 9 01:29 alertmanager-prometheusRule.yaml
-rw-r--r-- 1 root root 1418 7月 9 01:29 grafana-prometheusRule.yaml
-rw-r--r-- 1 root root 4301 7月 9 01:29 kubePrometheus-prometheusRule.yaml
-rw-r--r-- 1 root root 73239 7月 9 01:29 kubernetesControlPlane-prometheusRule.yaml
-rw-r--r-- 1 root root 3830 7月 12 16:39 kubeStateMetrics-prometheusRule.yaml
-rw-r--r-- 1 root root 19720 7月 9 01:29 nodeExporter-prometheusRule.yaml
-rw-r--r-- 1 root root 6591 7月 9 01:29 prometheusOperator-prometheusRule.yaml
-rw-r--r-- 1 root root 17256 7月 9 01:29 prometheus-prometheusRule.yaml
# 或者通过这个方式查询
[root@k8s-master01 prometheus]# kubectl get prometheusrule -n monitoring
NAME AGE
alertmanager-main-rules 5d3h
grafana-rules 5d3h
kube-prometheus-rules 5d3h
kube-state-metrics-rules 2d20h
kubernetes-monitoring-rules 5d3h
node-exporter-rules 5d3h
prometheus-k8s-prometheus-rules 5d3h
prometheus-operator-rules 5d3h
labels:告警的标签,用于告警的路由
然后根据文件名,去添加你需要的指标,我这里在kubeStateMetrics-prometheusRule.yaml
文件添加了(不固定,想在哪个文件添加都可以,前提你的监控指标要和文件名相对应最好),在文件最后面添加
[root@k8s-master01 manifests]# vim kubeStateMetrics-prometheusRule.yaml
- alert: K8S Node节点状态NotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 1m
labels:
severity: critical
level: "高"
annotations:
summary: K8s节点状态异常 (instance {{ $labels.instance }})
description: "节点 {{ $labels.node }} NotReady,请尽快处理! \n "
owner: "运维工程师-杜杰"
panelURL: "http://192.168.31.21:32159/alerts?search="
alertURL: "http://192.168.31.20:30336/d/3138fa155d5915769fbded898ac09fd9/kubernetes-kubelet?orgId=1&refresh=10s&var-datasource=default&var-cluster=&var-instance=Al
l"
[root@k8s-master01 manifests]# kubectl apply -f kubeStateMetrics-prometheusRule.yaml
alert
:告警策略的名称annotations
:告警注释信息,一般写为告警信息expr
:告警表达式for
:评估等待时间,告警持续多久才会发送告警数据
以上的值都可以在模板中进行调用。
更新完成之后依然可以查看日志,等待更新之后就可以进行测试
[root@k8s-master01 manifests]# kubectl logs -f -n monitoring alertmanager-main-0
ts=2024-07-15T02:25:59.706Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2024-07-15T02:25:59.707Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
出现此日志表示加载配置成功
告警测试
[root@k8s-master01 ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master01 Ready control-plane 5d23h v1.30.2
k8s-master02 Ready control-plane 5d23h v1.30.2
k8s-master03 Ready control-plane 5d23h v1.30.2
k8s-node01 Ready <none> 5d23h v1.30.2
k8s-node02 Ready <none> 5d23h v1.30.2
k8s-node03 Ready <none> 5d23h v1.30.2
k8s-node04 Ready <none> 5d23h v1.30.2
k8s-node05 Ready <none> 5d23h v1.30.2
k8s-node06 Ready <none> 5d23h v1.30.2
# 关掉其中一台节点的kubelet,我这里关掉node06节点,稍等一会后等待状态为Not_Ready
[root@k8s-node06 ~]# systemctl stop kubelet
[root@k8s-master01 ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master01 Ready control-plane 5d23h v1.30.2
k8s-master02 Ready control-plane 5d23h v1.30.2
k8s-master03 Ready control-plane 5d23h v1.30.2
k8s-node01 Ready <none> 5d23h v1.30.2
k8s-node02 Ready <none> 5d23h v1.30.2
k8s-node03 Ready <none> 5d23h v1.30.2
k8s-node04 Ready <none> 5d23h v1.30.2
k8s-node05 Ready <none> 5d23h v1.30.2
k8s-node06 NotReady <none> 5d23h v1.30.2
稍等一会可以查看钉钉已经收到报警了,这个时间是根据你alertmanager 和Rule里面的for时间来报警的,我上面for的时间设置为1分钟,表示查询表达式的值必须持续1分钟都为真才会报警,目的是为了防止偶发的短暂问题而触发报警。
四、使用prometheus Alert实现飞书告警
和钉钉实现类似,这里只写修改的内容
4.1 修改PromethesuAlert配置
fsurl请填自己飞书机器人的webhook地址
[root@k8s-master01 prometheus]# vim PrometheusAlert-Deployment.yaml
#是否开启飞书告警通道,可同时开始多个通道0为关闭,1为开启
open-feishu=1
#默认飞书机器人地址
fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/61xxxxx
# webhook 发送 http 请求的 contentType, 如 application/json, application/x-www-form-urlencoded,不配置默认 application/json
wh_contenttype=application/json
4.2 修改alertmanger配置
[root@k8s-master01 manifests]# vim alertmanager-secret.yaml
...
"receivers":
- "name": "Default"
"email_configs":
- "to": "15811047166@163.com"
"send_resolved": true
- "name": "webhook"
webhook_configs:
- url: "http://10.96.167.90:8080/prometheusalert?type=fs&tpl=prometheus-fs&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/61132219-e82a-4aae-9186-3664efa9d8c1"
...
"route":
"group_by":
- "alertname"
"group_interval": "2m"
"group_wait": "10s"
"receiver": "Default"
"repeat_interval": "2m"
"routes":
- "matchers":
- "alertname = Watchdog"
"receiver": "Watchdog"
- "matchers":
- "alertname = InfoInhibitor"
"receiver": "null"
- "matchers":
- "severity = critical"
"receiver": "webhook"
...
重启他们之后即可发送
五、域名访问延迟告警
域名访问延迟告警:
假设需要对域名访问延迟进行监控,访问延迟大于 0.01 秒进行告警(我这里是测试才用的0.01,生产环境需要按照各自业务设置),此时可以创建一个PrometheusRule 如下:
[root@k8s-master01 manifests]# cat blackbox.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: blackbox-exporter
prometheus: k8s
role: alert-rules
name: blackbox
namespace: monitoring
spec:
groups:
- name: blackbox-exporter
rules:
- alert: DomainAccessDelayExceeds1s
annotations:
description: 域名:{{ $labels.instance }} 探测延迟大于 0.01 秒,当前延迟为:{{ $value }}
summary: 域名探测,访问延迟超过 0.01 秒
expr: sum(probe_http_duration_seconds{job=~"blackbox"}) by (instance) > 0.01
for: 1m
labels:
severity: warning
type: blackbox
创建并查看该 PrometheusRule:
[root@k8s-master01 manifests]# kubectl create -f blackbox.yaml
prometheusrule.monitoring.coreos.com/blackbox created
[root@k8s-master01 manifests]# kubectl get -f blackbox.yaml
NAME AGE
blackbox 65s
之后也可以在 Prometheus 的 Web UI 看到此规则:
然后可以看到刚才配置的钉钉已经收到告警了