prometheus operator

此文包含prometheus-operator部署安装,收集相关指标、数据持久化、使用prometheusalert+alertmanager实现飞书、钉钉、邮箱等告警。

一、Prometheus介绍

Prometheus 开源的系统监控和报警框架,本身也是一个时序数据库(TSDB)

1.1 prometheus特性
  • 一个多维的数据模型,具有由指标名称和键值对标识的时序序列数据
  • 使用PromQL查询和聚合数据,可以非常灵活的对数据进行检索
  • 不依赖额外的数据存储,Prometheus本身就是一个时序数据库,提供本地存储和分布式存储,并且每个Prometheus都是自治的
  • 应用程序暴露Metrics接口,Prometheus通过基于HTTP的Pull模型采集数据,同时可以使用PushGateway进行Push数据
  • Prometheus同时支持动态服务发现和静态配置发现目标机器
  • 支持多种图形化和仪表盘,和Grafan绝配
1.2 prometheus架构
  1. Prometheus Server

核心组件,主要负责一下功能:

  • 数据抓取(Scraping):Prometheus通过HTTP定期从配置的目标(如应用程序、服务器、数据库等)抓取指标数据。这些目标称为”Exporters“,他们暴露出Prometheus格式的指标
  • 数据存储(Storage):抓取到的指标数据被存储在本地磁盘上的时间序列数据库中。Prometheus使用了一种高效的存储格式,称为”TSDB(Time Series Database)“,他能够高效的处理大量时间序列数据。
  • 查询(Querying):Prometheus提供了一个强大的查询语句,称为PromQL(Prometheus Query Language),允许用户实时查询和分析存储的时间序列数据。
  1. Exporters:主要用来采集监控数据,比如主机的监控数据可以通过 node_exporter采集,MySQL 的监控数据可以通过mysql_exporter 采集,之后 Exporter 暴露一个接口,比如/metrics,Prometheus 可以通过该接口采集到数据
  • Node Exporter:用于收集主机级别的指标,如CPU、内存、磁盘等。
  • Blackbox Exporter:用于探测网络服务的可用性,如HTTP、HTTPS、DNS等
  • Mysql Exporter:用于收集Mysql数据库的指标
  • Redis、Mongo….

https://prometheus.io/docs/instrumenting/exporters/

  1. Pushgateway:可选组件,用于接受短暂任务(如批量处理作业)推送的指标。由于这些任务的生命周期可能很短,无法被Prometheus直接抓取,所以可以将指标推送到Pushgateway,然后由Prometheus定期从Pushgateway抓取。
  2. Alertmanager:独立组件,负责处理Prometheus发送的报警,支持以下功能:
  • 分组(Grouping):将相似的告警分组,减少通知的数量。
  • 抑制(Inhibition):将某些告警已经触发时,抑制其他告警的通知。
  • 静默(Silencing):在特定的时间段内静默某些告警的通知。
  • 通知(Notification):通过各种渠道(邮箱、微信、飞书等)发送告警通知。
  1. WebUI:Prometheus 提供了一个简单的 Web UI,用于执行查询和查看指标数据。此外,Prometheus 还可以与 Grafana 等第三方可视化工具集成,以提供更强大的可视化和仪表板功能。
  2. Service Discovery:Prometheus 支持多种服务发现机制,以便自动发现需要监控的目标。常见的服务发现机制包括:
  • 静态配置:在 Prometheus 配置文件中手动指定目标。

  • DNS 服务发现:通过 DNS 记录自动发现目标。

  • 文件:通过配置文件发现

  • consul自动发现

  • Kubernetes 服务发现:通过 Kubernetes API 自动发现 Pod 和 Service。

下面是他的架构图

image

二、在k8s上部署prometheus

Prometheus Operator基于k8s自定义资源定义(CRD)来编排PrometheusAlertManager和其他监控资源,目前支持的CRD如下:

  • Prometheus:定义如何部署Prometheus
  • AlertManager:定义如何部署Alertmanager
  • ServiceMonitor:指定如何通过service实现服务发现

./nodeExporter-serviceMonitor.yaml

  • PodMonitor:以声明方式指定如何监控kubernetes pod组
  • Probe:指定静态探针配置

./setup/0alertmanagerCustomResourceDefinition.yaml

  • ScrapeConfig:指定要添加到Prometheus的抓取配置,此CRD有助于抓取kubernetes集群外部的资源
  • PrometheusRule:定义Prometheus报警规则

./nodeExporter-prometheusRule.yaml

  • AlertmanagerConfig:指定Alertmanager配置,允许将告警路由到自定义接收器并设置禁止规则

./alertmanager-prometheusRule.yaml

2.1 安装

kube-prometheus项目地址:https://github.com/prometheus-operator/kube-prometheus/

首先需要通过该项目地址,找到和自己 Kubernetes 版本对应的 Kube Prometheus Stack 的版本:

image

[root@k8s-master01 prometheus]# git clone -b release-0.13 https://github.com/prometheus-operator/kubeprometheus.git
[root@k8s-master01 prometheus]# cd kube-prometheus-main/manifests/

安装Prometheus Operator

[root@k8s-master01 manifests]# kubectl create -f setup/
[root@k8s-master01 manifests]# kubectl get pods -n monitoring 
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-74cb44ccf7-t8m25   2/2     Running   0          25h

Operator容器启动后,安装prometheus stack,这里需要修改相关镜像为本地镜像,否则拉取不到

[root@k8s-master01 manifests]#  kubectl create -f .
[root@k8s-master01 manifests]# kubectl get pods -n monitoring 
NAME                                   READY   STATUS    RESTARTS   AGE
alertmanager-main-0                    2/2     Running   0          25h
alertmanager-main-1                    2/2     Running   0          25h
alertmanager-main-2                    2/2     Running   0          25h
blackbox-exporter-597d86cf5c-xd2m7     3/3     Running   0          25h
grafana-674557f4bd-sgnbb               1/1     Running   0          24h
kube-state-metrics-56f84757db-nn66j    3/3     Running   0          24h
mysql-exporter-65bdf76bb9-klq8f        1/1     Running   0          16h
node-exporter-4qbgp                    2/2     Running   0          25h
node-exporter-bwqcs                    2/2     Running   0          25h
node-exporter-bz9pk                    2/2     Running   0          25h
node-exporter-cdzm4                    2/2     Running   0          25h
node-exporter-ds4zd                    2/2     Running   0          25h
node-exporter-j6kb8                    2/2     Running   0          25h
node-exporter-vgdjw                    2/2     Running   0          25h
node-exporter-zcfb6                    2/2     Running   0          25h
node-exporter-zjjb9                    2/2     Running   0          25h
prometheus-adapter-7786cd46-6cfhd      1/1     Running   0          25h
prometheus-adapter-7786cd46-qwrmb      1/1     Running   0          25h
prometheus-k8s-0                       2/2     Running   0          19h
prometheus-k8s-1                       2/2     Running   0          19h
prometheus-operator-74cb44ccf7-t8m25   2/2     Running   0          25h

修改grafana和prometheus service为NodePort类型

[root@k8s-master01 manifests]# kubectl get svc -n monitoring 
NAME                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                         AGE
alertmanager-main       ClusterIP   10.96.41.127    <none>        9093/TCP,8080/TCP               25h
alertmanager-operated   ClusterIP   None            <none>        9093/TCP,9094/TCP,9094/UDP      25h
blackbox-exporter       ClusterIP   10.96.145.255   <none>        9115/TCP,19115/TCP              25h
grafana                 NodePort    10.96.20.246    <none>        3000:30336/TCP                  25h
kube-state-metrics      ClusterIP   None            <none>        8443/TCP,9443/TCP               25h
mysql-exporter          ClusterIP   10.96.74.237    <none>        9104/TCP                        17h
node-exporter           ClusterIP   None            <none>        9100/TCP                        25h
prometheus-adapter      ClusterIP   10.96.122.86    <none>        443/TCP                         25h
prometheus-k8s          NodePort    10.96.206.87    <none>        9090:32159/TCP,8080:32744/TCP   25h
prometheus-operated     ClusterIP   None            <none>        9090/TCP                        25h
prometheus-operator     ClusterIP   None            <none>        8443/TCP                        25h

image

2.1续 数据持久化

在 Kubernetes 集群中部署 Prometheus Operator 时,需要对某些组件进行持久化处理,以确保在 Pod 重启或迁移时数据不会丢失,我这里用的是root-ceph存储,部署及使用方法可以看这篇文章:搭建rook-ceph

  1. Prometheus
    Prometheus 是一个时间序列数据库,用于存储监控数据。持久化 Prometheus 数据是非常关键的,因为这些数据用于历史查询、报警和分析。
    [root@k8s-master01 manifests]# cat prometheus-prometheus.yaml
    # 添加以下内容
      storage:
        ephemeral:
          volumeClaimTemplate:
            spec:
              storageClassName: rook-ceph-block
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 10Gi
    
  2. Alertmanager
    Alertmanager 负责处理从 Prometheus 发送的警报,并将这些警报路由到适当的接收器(如电子邮件、Slack 等)。持久化 Alertmanager 数据可以保存告警历史和静默规则。
    [root@k8s-master01 manifests]# cat alertmanager-alertmanager.yaml 
    # 添加以下内容
      storage:
        ephemeral:
          volumeClaimTemplate:
            spec:
              storageClassName: rook-ceph-block
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 10Gi
    
  3. Grafana
    如果集成了 Grafana 用于数据展示,同样需要持久化 Grafana 的配置和仪表板数据。
    [root@k8s-master01 manifests]# cat grafana-deployment.yaml 
    # 修改以下内容
          volumes:
          - name: grafana-storage
            ephemeral:
              volumeClaimTemplate:
                spec:
                  accessModes: ["ReadWriteOnce"]
                  storageClassName: "rook-ceph-block"
                  resources:
                    requests:
                      storage: 1Gi
    
    
2.2 监控数据来源

在k8s中prometheus的监控流程是:

创建ServiceMonitor注册监控目标,通过selector匹配service,然后operator就会自动发现ServiceMonitor,之后解析成prometheus配置

image

目前比较常用的Exporter 如下

类型 Exporter
数据库 MySQL Exporter, Redis Exporter, MongoDB Exporter, MSSQL Exporter
硬件 Apcupsd Exporter,IoT Edison Exporter, IPMI Exporter, Node Exporter
消息队列 Beanstalkd Exporter, Kafka Exporter, NSQ Exporter, RabbitMQ Exporter
存储 Ceph Exporter, Gluster Exporter, HDFS Exporter, ScaleIO Exporter
HTTP服务 Apache Exporter, HAProxy Exporter, Nginx Exporter
API服务 AWS ECS Exporter, Docker Cloud Exporter, Docker Hub Exporter, GitHub Exporter
日志 Fluentd Exporter, Grok Exporter
监控系统 Collectd Exporter, Graphite Exporter, InfluxDB Exporter, Nagios Exporter, SNMP Exporter
其他 Blackbox Exporter, JIRA Exporter, Jenkins Exporter, Confluence Exporter
2.3 云原生应用Etcd监控

测试访问Etcd Metrics接口:

这里要指定自己的etcd证书还有地址,我这里是用kubeadm安装的k8s集群,有3个master,这里随便找的一台master测试的,

[root@k8s-master01 manifests]# curl -s  --cert /etc/kubernetes/pki/etcd/server.crt  --key /etc/kubernetes/pki/etcd/server.key  https://192.168.31.20:2379/metrics -l -k|tail -5
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 4728
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0

证书的位置可以在 Etcd 配置文件中获得(注意配置文件的位置,不同的集群位置可能不同,Kubeadm 安装方式可能会在/etc/kubernetes/manifests/etcd.yml 中,二进制可能在/usr/lib/systemd/system/etcd.service中):

image

####### 2.3.1 Etcd Service 创建
首先要配置etcd的service和endpoint

[root@k8s-master01 prometheus]# vim etcd-svc.yaml 
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    app: etcd-prom
  name: etcd-prom
  namespace: kube-system
subsets:
- addresses:
  - ip: 192.168.31.20   
  - ip: 192.168.31.21
  - ip: 192.168.31.22
  ports:
  - name: https-metrics
    port: 2379
    protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: etcd-prom
  name: etcd-prom
  namespace: kube-system
spec:
  ports:
  - name: https-metrics
    port: 2379
    protocol: TCP
    targetPort: 2379
  type: ClusterIP

需要修改addresses为自己etcd的集群,另外需要注意port的名称为https-metrics,需要和后面创建的ServiceMonitor中的保持一直。之后创建该资源并查看serviceclusterIP

[root@k8s-master01 prometheus]# kubectl get svc -n kube-system etcd-prom 
NAME        TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
etcd-prom   ClusterIP   10.96.94.176   <none>        2379/TCP   20h

通过ClusterIP测试,可以看到通过serviceClusterip 可以访问到etcd指标

[root@k8s-master01 prometheus]# curl -s  --cert /etc/kubernetes/pki/etcd/server.crt  --key /etc/kubernetes/pki/etcd/server.key  https://10.96.94.176:2379/metrics -l -k|tail -5
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 4760
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0

因为etcd是需要通过https访问,所以需要把etcd证书文件挂载到prometheus容器(我这里是通过operator部署的,所以只需要修改Promehteus资源即可)

创建Etcd证书的Secret(证书路径根据实际环境进行更改)

[root@k8s-master01 manifests]# kubectl create secret generic  etcd-ssl --from-file=/etc/kubernetes/pki/etcd/ca.crt  --from-file=/etc/kubernetes/pki/etcd/server.crt  --from-file=/etc/kubernetes/pki/etcd/server.key  -n monitoring 

[root@k8s-master01 manifests]# kubectl get secret -n monitoring 
NAME                             TYPE     DATA   AGE
etcd-ssl                         Opaque   3      20h

挂载secret,

[root@k8s-master01 manifests]# kubectl edit  prometheus -n monitoring  k8s 
...
  secrets:
  - etcd-ssl
...

image

保存退出后,Prometheus 的 Pod 会自动重启,重启完成后,查看证书是否挂载(任意一个Prometheus 的 Pod 均可):

[root@k8s-master01 manifests]# kubectl get pods -n monitoring 
NAME                                   READY   STATUS    RESTARTS   AGE
alertmanager-main-0                    2/2     Running   0          25h
alertmanager-main-1                    2/2     Running   0          25h
alertmanager-main-2                    2/2     Running   0          25h
blackbox-exporter-597d86cf5c-xd2m7     3/3     Running   0          25h
grafana-674557f4bd-sgnbb               1/1     Running   0          25h
kube-state-metrics-56f84757db-nn66j    3/3     Running   0          24h
mysql-exporter-65bdf76bb9-klq8f        1/1     Running   0          17h
node-exporter-4qbgp                    2/2     Running   0          25h
node-exporter-bwqcs                    2/2     Running   0          25h
node-exporter-bz9pk                    2/2     Running   0          25h
node-exporter-cdzm4                    2/2     Running   0          25h
node-exporter-ds4zd                    2/2     Running   0          25h
node-exporter-j6kb8                    2/2     Running   0          25h
node-exporter-vgdjw                    2/2     Running   0          25h
node-exporter-zcfb6                    2/2     Running   0          25h
node-exporter-zjjb9                    2/2     Running   0          25h
prometheus-adapter-7786cd46-6cfhd      1/1     Running   0          25h
prometheus-adapter-7786cd46-qwrmb      1/1     Running   0          25h
prometheus-k8s-0                       2/2     Running   0          20h
prometheus-k8s-1                       2/2     Running   0          20h
prometheus-operator-74cb44ccf7-t8m25   2/2     Running   0          25h

可以通过describe 查看具体prometheus目录,是在/etc/prometheus/

[root@k8s-master01 manifests]# kubectl describe pods -n monitoring  prometheus-k8s-0 
...
    Args:
      --watch-interval=0
      --listen-address=:8081
      --config-file=/etc/prometheus/config/prometheus.yaml.gz
      --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
      --watched-dir=/etc/prometheus/rules/prometheus-k8s-rulefiles-0

...

所以进到prometheus容器查看,已经成功挂载了

[root@k8s-master01 manifests]# kubectl exec -it prometheus-k8s-0 -n monitoring  sh 
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
/prometheus $ 
/prometheus $ ls /etc/prometheus/
certs              config_out         console_libraries  consoles           prometheus.yml     rules              secrets            web_config
/prometheus $ ls /etc/prometheus/secrets/
etcd-ssl
/prometheus $ ls /etc/prometheus/secrets/etcd-ssl/
ca.crt      server.crt  server.key

####### 2.3.2 Etcd ServiceMonitor 创建
之后创建Etcd的ServiceMonitor

[root@k8s-master01 prometheus]# cat etcd-servicemonitors.yaml 
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
 name: etcd
 namespace: monitoring
 labels:
   app: etcd
spec:
 jobLabel: k8s-app
 endpoints:
   - interval: 30s
     port: https-metrics # 这个 port 对应 Service.spec.ports.name
     scheme: https
     tlsConfig:
       caFile: /etc/prometheus/secrets/etcd-ssl/ca.crt # 证书路径
       certFile: /etc/prometheus/secrets/etcd-ssl/server.crt
       keyFile: /etc/prometheus/secrets/etcd-ssl/server.key
       insecureSkipVerify: true # 关闭证书校验
 selector:
   matchLabels:
     app: etcd-prom # 跟 svc 的 lables 保持一致
 namespaceSelector:
   matchNames:
    - kube-system

[root@k8s-master01 prometheus]# kubectl create -f etcd-servicemonitors.yaml 

创建完成后,在prometheus的ui界面可以看到相关配置。

image

grafana配置就略过了,直接找到合适的模板然后指定数据源就可以展示了

grafana模板:https://grafana.com/grafana/dashboards/?dataSource=prometheus&search=schedu

2.4 非云原生应用Etcd监控

我这里以mysql为例,大体流程分为k8s集群内和集群外两种方式:

  • 如果mysql部署在k8s集群内,需要创建mysql 的service,然后创建mysql的exporter,在exporter中指定mysql的信息,然后再创建servicemonitor指定selectorexporter相同,这样prometheus就可以自动发现serviceMonitor
  • 如果mysql部署在k8s集群外,则创建mysql的exporter,在exporter中指定mysql的信息,然后再创建servicemonitor指定selector为exporter相同指定集群外部的mysql地址即可。

我这里演示只以集群外

2.4.1 部署 mysql exporter

我这里使用prometheus官方的最新版exporter部署

https://github.com/prometheus/mysqld_exporter

使用0.15.0以后版本的exporter:

image

已经不支持**DATA_SOURCE_NAME 变量了,如果使用最新版部署则需要挂载.my.cnf文件到exporter镜像中,否则会报错**

mysqld_exporter.go:225 level=info msg=“Error parsing host config“ file=.my.cnf err=“no configuration

image

[root@k8s-master01 prometheus]# cat .my.cnf 
[client]
user=exporter
password=exporter
host=192.168.1.185
port=3306
[root@k8s-master01 prometheus]# cat mysql-exporter.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: mysql-exporter
  name: mysql-exporter
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mysql-exporter
  strategy: {}
  template:
    metadata:
      labels:
        app: mysql-exporter
    spec:
      containers:
      - image: harbor.dujie.com/dycloud/mysqld-exporter
        name: mysqld-exporter
        resources:
          requests:
            cpu: 100m
            memory: 100Mi
        ports:
        - containerPort: 9104
        volumeMounts:
        - name: mysql
          mountPath: ".my.cnf"
          subPath: ".my.cnf"
      volumes:
      - name: mysql
        configMap:
          name: mysqlcnf
---
apiVersion: v1
kind: Service
metadata:
 name: mysql-exporter
 namespace: monitoring
 labels:
   app: mysql-exporter
spec:
 type: ClusterIP
 selector:
   app: mysql-exporter
 ports:
 - name: api
   port: 9104
   protocol: TCP

创建完成之后可以验证svc是否可以成功访问mysql的metrics

[root@k8s-master01 prometheus]# kubectl get svc -n monitoring 
NAME                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                         AGE
mysql-exporter          ClusterIP   10.96.74.237    <none>        9104/TCP                        17h

[root@k8s-master01 prometheus]# curl -s  10.96.74.237:9104/metrics |tail -5
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 4195
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0

使用0.15.0以前版本的exporter:

[root@k8s-master01 prometheus]# cat mysql-exporter.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
 name: mysql-exporter
 namespace: monitoring
spec:
 replicas: 1
 selector:
   matchLabels:
     app: mysql-exporter
 template:
   metadata:
     labels:
       app: mysql-exporter
   spec:
     containers:
     - name: mysql-exporter
       image: registry.cn-beijing.aliyuncs.com/dotbalo/mysqld-exporter 
       env:
       - name: DATA_SOURCE_NAME
         value: "exporter:exporter@(192.168.1.185:3306)/"
       imagePullPolicy: IfNotPresent
       ports:
       - containerPort: 9104
---
apiVersion: v1
kind: Service
metadata:
 name: mysql-exporter
 namespace: monitoring
 labels:
   app: mysql-exporter
spec:
 type: ClusterIP
 selector:
   app: mysql-exporter
 ports:
 - name: api
   port: 9104
   protocol: TCP
2.4.2 部署 myslq serviceMonitor

注意selector 要和上面的service一致

[root@k8s-master01 prometheus]# vim mysql-servicemonitor.yaml 
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
 name: mysql-exporter
 namespace: monitoring
 labels:
   app: mysql-exporter
spec:
 jobLabel: k8s-app
 endpoints:
 - port: api
   interval: 30s
   scheme: http
 selector:
   matchLabels:
     app: mysql-exporter   #
 namespaceSelector:
   matchNames:
   - monitoring

创建完成后稍等一会可以在prometheus 界面中看到了

image

ServiceMonitor 监控失败排查步骤

  1. 确认ServiceMonitor是否成功创建
[root@k8s-master01 prometheus]# kubectl get servicemonitors.monitoring.coreos.com  -n monitoring 
NAME                      AGE
alertmanager-main         26h
blackbox-exporter         26h
coredns                   26h
etcd                      20h
grafana                   26h
kube-apiserver            26h
kube-controller-manager   145m
kube-scheduler            96m
kube-state-metrics        26h
kubelet                   26h
mysql-exporter            17h
node-exporter             26h
prometheus-adapter        26h
prometheus-k8s            26h
prometheus-operator       26h
  1. 确认ServiceMonitor标签是否配置正确
[root@k8s-master01 prometheus]# kubectl get servicemonitors.monitoring.coreos.com  -n monitoring  mysql-exporter  -o yaml 
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"monitoring.coreos.com/v1","kind":"ServiceMonitor","metadata":{"annotations":{},"labels":{"app":"mysql-exporter"},"name":"mysql-exporter","namespace":"monitoring"},"spec":{"endpoints":[{"interval":"30s","port":"api","scheme":"http"}],"jobLabel":"k8s-app","namespaceSelector":{"matchNames":["monitoring"]},"selector":{"matchLabels":{"app":"mysql-exporter"}}}}
  creationTimestamp: "2024-07-10T09:33:07Z"
  generation: 1
  labels:
    app: mysql-exporter
  name: mysql-exporter
  namespace: monitoring
  resourceVersion: "323737"
  uid: bbfba7ef-a9f7-46ba-b60a-c3e2b1d31ebc
spec:
  endpoints:
  - interval: 30s
    port: api
    scheme: http
  jobLabel: k8s-app
  namespaceSelector:
    matchNames:
    - monitoring
  selector:
    matchLabels:
      app: mysql-exporter
  1. 确认Prometheus是否生成了相关配置

image

  1. 确认存在ServiceMonitor匹配的Service
    serviceMonitor和service的selector标签保持一致

image

  1. 确认通过Service能够访问程序的Metrics接口

使用serviceip可以成功获取到Metrics接口

image

  1. 确认Service的端口和Scheme和ServiceMonitor一致

这里需要保持一致,servicemonitor也可以直接指定端口,但是推荐使用name匹配,因为如果service端口更改,servicemonitor根据name匹配不需要修改了

image

2.5 controller-manager/scheduler 监控

部署完operator之后可以看到controller-managerscheduler是没有监控指标的,根据上面的排查步骤之后,可以看到他们两个组件默认并没有部署service,所以需要部署对应的service

[root@k8s-master01 prometheus]# cat controller-manager-svc.yml 
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    app: controllermanager-prom
  name: controllermanager-prom
  namespace: kube-system
subsets:
- addresses:
  - ip: 192.168.31.20
  - ip: 192.168.31.21
  - ip: 192.168.31.22
  ports:
  - name: controllmanager-metrics
    port: 10257
    protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: controllermanager-prom
  name: controllermanager-prom
  namespace: kube-system
spec:
  ports:
  - name: controllmanager-metrics
    port: 10257
    protocol: TCP
    targetPort: 10257
  type: ClusterIP
[root@k8s-master01 prometheus]# cat scheduler-svc.yml 
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    app: scheduler-prom
  name: scheduler-prom
  namespace: kube-system
subsets:
- addresses:
  - ip: 192.168.31.20
  - ip: 192.168.31.21
  - ip: 192.168.31.22
  ports:
  - name: scheduler-metrics
    port: 10259
    protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: scheduler-prom
  name: scheduler-prom
  namespace: kube-system
spec:
  ports:
  - name: scheduler-metrics
    port: 10259
    protocol: TCP
    targetPort: 10259
  type: ClusterIP

创建service后测试Metrics发现会报403

image

创建role和rolebind之后就可以访问到了

[root@k8s-master01 prometheus]# cat role.yaml 
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: metrics-reader
rules:
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]

[root@k8s-master01 prometheus]# cat rolebind.yaml 
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: metrics-reader-binding
subjects:
- kind: User
  name: kubernetes
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: metrics-reader
  apiGroup: rbac.authorization.k8s.io

然后需要修改他两个组件的绑定ip,kubeadm默认是127.0.0.1,prometheus访问不到,每台master都需要更改

image

image

然后修改他俩的servicemonitor的selector,让他们和刚才创建的service的selector对应

image

[root@k8s-master01 prometheus]# kubectl edit servicemonitors.monitoring.coreos.com  -n monitoring kube-controller-manager 
...
  selector:
    matchLabels:
      app: controllermanager-prom

修改完成后prometheus就会自动发现他们的指标了

image

2.5 blackbox 监控域名

https://github.com/prometheus/blackbox_exporter

新版PrometheusStack已经默认安装了BlackboxExoporter,可以通过下面命令查看

[root@k8s-master01 manifests]# kubectl get pods -n monitoring  -l app.kubernetes.io/name=blackbox-exporter
NAME                                 READY   STATUS    RESTARTS   AGE
blackbox-exporter-597d86cf5c-xd2m7   3/3     Running   0          5d3h
# 同时也会创建一个 Service,可以通过该 Service 访问 Blackbox Exporter 并传递一些参数:
[root@k8s-master01 manifests]# kubectl get svc -n monitoring  -l app.kubernetes.io/name=blackbox-exporter
NAME                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)              AGE
blackbox-exporter   ClusterIP   10.96.145.255   <none>        9115/TCP,19115/TCP   5d3h

比如检测下 dycloud.fun(使用任何一个公网域名或者公司内的域名探测即可)网站的状态,可以通过如下命令进行检查:

[root@k8s-master01 manifests]# curl -s "http://10.96.145.255:19115/probe?target=dycloud.fun&module=http_2xx" |tail -5
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1

probe是接口地址,taget是检测的目标,module是使用哪个模块进行探测。如果集群中没有配置BlackboxExporter,可以参考:

https://github.com/prometheus/blackbox_exporter 进行安装

2.6 Prometheus 静态配置

首先创建一个空文件,然后通过该文件创建一个Secret,那么这个Secret就可以作为Prometheus的静态配置:

[root@k8s-master01 prometheus]# touch  prometheus-additional.yaml 
[root@k8s-master01 prometheus]# kubectl create secret generic  additional-configs --from-file=prometheus-additional.yaml  -n monitoring 

创建完,需要添加相关配置:

[root@k8s-master01 prometheus]# kubectl edit prometheus -n monitoring  k8s 
  additionalScrapeConfigs:
    key: prometheus-additional.yaml
    name: additional-configs
    optional: true

image

添加完之后不需要重启prometheus,然后再prometheus-additional.yaml文件内编辑一些静态配置,

[root@k8s-master01 prometheus]# cat prometheus-additional.yaml 
- job_name: 'blackbox'
  metrics_path: /probe
  params:
    module: [http_2xx] # Look for a HTTP 200 response.
  static_configs:
    - targets:
      - https://dycloud.fun # Target to probe with http.
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: blackbox-exporter:19115 # The blackbox exporter's real hostname:port.
  • targets:探测的目标
  • params:使用哪个模块进行探测
  • replacement:Blackbox Exporter的地址

可以看到此处的内容,和传统配置的内容一致,只需要添加对应的 job 即可。之后通过该文件更新该 Secret。之后通过该文件更新secret

[root@k8s-master01 prometheus]# kubectl create secret generic  additional-configs --from-file=prometheus-additional.yaml  --dry-run=client -o yaml |kubectl  apply -f  - -nmonitoring 
secret/additional-configs configured

更新完成后,稍等一分钟即可在 Prometheus Web UI 看到该配置:

image

然后就可以在grafana中导入黑盒监控的模板(模板id:13659):

image

2.7 Prometheus 监控windows(外部)主机

监控 Linux 的 Exporter 是:https://github.com/prometheus/node_exporter ,监控 Windows主机的 Exporter 是:https://github.com/prometheus-community/windows_exporter

首 先 下 载 对 应 的 Exporter 至 Windows 主 机 ( MSI 文 件 下 载 地 址 :https://github.com/prometheus-community/windows_exporter/releases ):

image

下载完成后,双击打开即可完成安装,之后可以在任务管理器上看到对应的进程:

image

Windows Exporter 会暴露一个 9182 端口,可以通过该端口访问到 Windows 的监控数据。

接下来在静态配置文件中添加以下配置:

Text
- job_name: 'WindowsServerMonitor' static_configs: - targets: - "1.1.1.1:9182" labels: server_type: 'windows' relabel_configs: - source_labels: [__address__] target_label: instance

Targets 配置的是监控主机,如果是多台 Windows 主机,配置多行即可,当然每台主机都需要配置 Exporter。之后可以在 Prometheus Web UI 看到监控数据

image

之后导入模板(地址:https://grafana.com/grafana/dashboards/12566)即可:

image

三、使用PrometheusAlert实现钉钉告警

官网:https://github.com/feiyu563/PrometheusAlert

PrometheusAlert是开源的运维告警中心消息转发系统,支持主流的监控系统Prometheus、Zabbix,日志系统Graylog2,Graylog3、数据可视化系统Grafana、SonarQube。阿里云-云监控,以及所有支持WebHook接口的系统发出的预警消息,支持将收到的这些消息发送到钉钉,微信,email,飞书,腾讯短信,腾讯电话,阿里云短信,阿里云电话,华为短信,百度云短信,容联云电话,七陌短信,七陌语音,TeleGram,百度Hi(如流),Kafka等。

image

项目架构:

PrometheusAlert 后端使用了 beego 框架,前端使用了 AdminLTE (基于 Bootstrap 和 Jquery)模板。

├── cmd: 脚本
├── conf: 配置
├── controllers:控制器
├── db:默认的 sqlite 数据
├── doc:文档
├── docker-entrypoint.sh:容器运行入口文件
├── Dockerfile
├── example:示例文件
├── go.mod
├── go.sum
├── LICENSE
├── main.go
├── Makefile
├── models:模型
├── PrometheusAlert:二进制文件
├── PrometheusAlertVoicePlugin
├── README.MD
├── routers:路由
├── static:静态资源
├── swagger
├── tests:测试
├── views:前端模板
└── zabbixclient

在k8s中安装

Text
#Kubernetes中运行可以直接执行以下命令行即可(注意默认的部署模版中未挂载模版数据库文件 db/PrometheusAlertDB.db,为防止模版数据丢失,请自行增加挂载配置 ) kubectl apply -n monitoring -f https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/example/kubernetes/PrometheusAlert-Deployment.yaml

[root@k8s-master01 prometheus]# cat PrometheusAlert-Deployment.yaml 
# apiVersion: v1
# kind: Namespace
# metadata:
#   name: monitoring
---  
apiVersion: v1
data:
  app.conf: |
    #---------------------↓全局配置-----------------------
    appname = PrometheusAlert
    #登录用户名
    login_user=prometheusalert
    #登录密码
    login_password=prometheusalert
    #监听地址
    httpaddr = "0.0.0.0"
    #监听端口
    httpport = 8080
    runmode = dev
    #设置代理 proxy = http://123.123.123.123:8080
    proxy =
    #开启JSON请求
    copyrequestbody = true
    #告警消息标题
    title=PrometheusAlert
    #链接到告警平台地址
    GraylogAlerturl=http://graylog.org
    #钉钉告警 告警logo图标地址
    logourl=https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png
    #钉钉告警 恢复logo图标地址
    rlogourl=https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png
    #短信告警级别(等于3就进行短信告警) 告警级别定义 0 信息,1 警告,2 一般严重,3 严重,4 灾难
    messagelevel=3
    #电话告警级别(等于4就进行语音告警) 告警级别定义 0 信息,1 警告,2 一般严重,3 严重,4 灾难
    phonecalllevel=4
    #默认拨打号码(页面测试短信和电话功能需要配置此项)
    defaultphone=xxxxxxxx
    #故障恢复是否启用电话通知0为关闭,1为开启
    phonecallresolved=0
    #是否前台输出file or console
    logtype=file
    #日志文件路径
    logpath=logs/prometheusalertcenter.log
    #转换Prometheus,graylog告警消息的时区为CST时区(如默认已经是CST时区,请勿开启)
    prometheus_cst_time=0
    #数据库驱动,支持sqlite3,mysql,postgres如使用mysql或postgres,请开启db_host,db_port,db_user,db_password,db_name的注释
    db_driver=sqlite3
    #db_host=127.0.0.1
    #db_port=3306
    #db_user=root
    #db_password=root
    #db_name=prometheusalert
    #是否开启告警记录 0为关闭,1为开启
    AlertRecord=0
    #是否开启告警记录定时删除 0为关闭,1为开启
    RecordLive=0
    #告警记录定时删除周期,单位天
    RecordLiveDay=7
    # 是否将告警记录写入es7,0为关闭,1为开启
    alert_to_es=0
    # es地址,是[]string
    # beego.Appconfig.Strings读取配置为[]string,使用";"而不是","
    to_es_url=http://localhost:9200
    # to_es_url=http://es1:9200;http://es2:9200;http://es3:9200
    # es用户和密码
    # to_es_user=username
    # to_es_pwd=password
    # 长连接最大空闲数
    maxIdleConns=100
    # 热更新配置文件
    open-hotreload=0
    
    #---------------------↓webhook-----------------------
    #是否开启钉钉告警通道,可同时开始多个通道0为关闭,1为开启
    open-dingding=1
    #默认钉钉机器人地址
    #ddurl=https://oapi.dingtalk.com/robot/send?access_token=cf7a7becfb3889c1d75b31be7fa100e14576d6730a9634752692c098b81a93b1&secret=SEC30cfb8f6cfc8565251a90ec14afbed005c3b84b1a75ccfa16cf1fcf736a4a48f&at=15811047166
    ddurl=https://oapi.dingtalk.com/robot/send?access_token=cf7a7becfb3889c1d75b31be7fa100e14576d6730a9634752692c098b81a93b1
    #是否开启 @所有人(0为关闭,1为开启)
    dd_isatall=1
    #是否开启钉钉机器人加签,0为关闭,1为开启
    # 使用方法:https://oapi.dingtalk.com/robot/send?access_token=XXXXXX&secret=mysecret
    open-dingding-secret=0
    
    #是否开启微信告警通道,可同时开始多个通道0为关闭,1为开启
    open-weixin=1
    #默认企业微信机器人地址
    wxurl=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxx
    
    #是否开启飞书告警通道,可同时开始多个通道0为关闭,1为开启
    open-feishu=1
    #默认飞书机器人地址
    fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/61132219-e82a-4aae-9186-3664efa9d8c1
    # webhook 发送 http 请求的 contentType, 如 application/json, application/x-www-form-urlencoded,不配置默认 application/json
    wh_contenttype=application/json
    
    #---------------------↓腾讯云接口-----------------------
    #是否开启腾讯云短信告警通道,可同时开始多个通道0为关闭,1为开启
    open-txdx=0
    #腾讯云短信接口key
    TXY_DX_appkey=xxxxx
    #腾讯云短信模版ID 腾讯云短信模版配置可参考 prometheus告警:{1}
    TXY_DX_tpl_id=xxxxx
    #腾讯云短信sdk app id
    TXY_DX_sdkappid=xxxxx
    #腾讯云短信签名 根据自己审核通过的签名来填写
    TXY_DX_sign=腾讯云
    
    #是否开启腾讯云电话告警通道,可同时开始多个通道0为关闭,1为开启
    open-txdh=0
    #腾讯云电话接口key
    TXY_DH_phonecallappkey=xxxxx
    #腾讯云电话模版ID
    TXY_DH_phonecalltpl_id=xxxxx
    #腾讯云电话sdk app id
    TXY_DH_phonecallsdkappid=xxxxx
    
    #---------------------↓华为云接口-----------------------
    #是否开启华为云短信告警通道,可同时开始多个通道0为关闭,1为开启
    open-hwdx=0
    #华为云短信接口key
    HWY_DX_APP_Key=xxxxxxxxxxxxxxxxxxxxxx
    #华为云短信接口Secret
    HWY_DX_APP_Secret=xxxxxxxxxxxxxxxxxxxxxx
    #华为云APP接入地址(端口接口地址)
    HWY_DX_APP_Url=https://rtcsms.cn-north-1.myhuaweicloud.com:10743
    #华为云短信模板ID
    HWY_DX_Templateid=xxxxxxxxxxxxxxxxxxxxxx
    #华为云签名名称,必须是已审核通过的,与模板类型一致的签名名称,按照自己的实际签名填写
    HWY_DX_Signature=华为云
    #华为云签名通道号
    HWY_DX_Sender=xxxxxxxxxx
    
    #---------------------↓阿里云接口-----------------------
    #是否开启阿里云短信告警通道,可同时开始多个通道0为关闭,1为开启
    open-alydx=0
    #阿里云短信主账号AccessKey的ID
    ALY_DX_AccessKeyId=xxxxxxxxxxxxxxxxxxxxxx
    #阿里云短信接口密钥
    ALY_DX_AccessSecret=xxxxxxxxxxxxxxxxxxxxxx
    #阿里云短信签名名称
    ALY_DX_SignName=阿里云
    #阿里云短信模板ID
    ALY_DX_Template=xxxxxxxxxxxxxxxxxxxxxx
    
    #是否开启阿里云电话告警通道,可同时开始多个通道0为关闭,1为开启
    open-alydh=0
    #阿里云电话主账号AccessKey的ID
    ALY_DH_AccessKeyId=xxxxxxxxxxxxxxxxxxxxxx
    #阿里云电话接口密钥
    ALY_DH_AccessSecret=xxxxxxxxxxxxxxxxxxxxxx
    #阿里云电话被叫显号,必须是已购买的号码
    ALY_DX_CalledShowNumber=xxxxxxxxx
    #阿里云电话文本转语音(TTS)模板ID
    ALY_DH_TtsCode=xxxxxxxx
    
    #---------------------↓容联云接口-----------------------
    #是否开启容联云电话告警通道,可同时开始多个通道0为关闭,1为开启
    open-rlydh=0
    #容联云基础接口地址
    RLY_URL=https://app.cloopen.com:8883/2013-12-26/Accounts/
    #容联云后台SID
    RLY_ACCOUNT_SID=xxxxxxxxxxx
    #容联云api-token
    RLY_ACCOUNT_TOKEN=xxxxxxxxxx
    #容联云app_id
    RLY_APP_ID=xxxxxxxxxxxxx
    
    #---------------------↓邮件配置-----------------------
    #是否开启邮件
    open-email=0
    #邮件发件服务器地址
    Email_host=smtp.qq.com
    #邮件发件服务器端口
    Email_port=465
    #邮件帐号
    Email_user=xxxxxxx@qq.com
    #邮件密码
    Email_password=xxxxxx
    #邮件标题
    Email_title=运维告警
    #默认发送邮箱
    Default_emails=xxxxx@qq.com,xxxxx@qq.com
    
    #---------------------↓七陌云接口-----------------------
    #是否开启七陌短信告警通道,可同时开始多个通道0为关闭,1为开启
    open-7moordx=0
    #七陌账户ID
    7MOOR_ACCOUNT_ID=Nxxx
    #七陌账户APISecret
    7MOOR_ACCOUNT_APISECRET=xxx
    #七陌账户短信模板编号
    7MOOR_DX_TEMPLATENUM=n
    #注意:七陌短信变量这里只用一个var1,在代码里写死了。
    #-----------
    #是否开启七陌webcall语音通知告警通道,可同时开始多个通道0为关闭,1为开启
    open-7moordh=0
    #请在七陌平台添加虚拟服务号、文本节点
    #七陌账户webcall的虚拟服务号
    7MOOR_WEBCALL_SERVICENO=xxx
    # 文本节点里被替换的变量,我配置的是text。如果被替换的变量不是text,请修改此配置
    7MOOR_WEBCALL_VOICE_VAR=text
    
    #---------------------↓telegram接口-----------------------
    #是否开启telegram告警通道,可同时开始多个通道0为关闭,1为开启
    open-tg=0
    #tg机器人token
    TG_TOKEN=xxxxx
    #tg消息模式 个人消息或者频道消息 0为关闭(推送给个人),1为开启(推送给频道)
    TG_MODE_CHAN=0
    #tg用户ID
    TG_USERID=xxxxx
    #tg频道name或者id, 频道name需要以@开始
    TG_CHANNAME=xxxxx
    #tg api地址, 可以配置为代理地址
    #TG_API_PROXY="https://api.telegram.org/bot%s/%s"
    
    #---------------------↓workwechat接口-----------------------
    #是否开启workwechat告警通道,可同时开始多个通道0为关闭,1为开启
    open-workwechat=0
    # 企业ID
    WorkWechat_CropID=xxxxx
    # 应用ID
    WorkWechat_AgentID=xxxx
    # 应用secret
    WorkWechat_AgentSecret=xxxx
    # 接受用户
    WorkWechat_ToUser="zhangsan|lisi"
    # 接受部门
    WorkWechat_ToParty="ops|dev"
    # 接受标签
    WorkWechat_ToTag=""
    # 消息类型, 暂时只支持markdown
    # WorkWechat_Msgtype = "markdown"
    
    #---------------------↓百度云接口-----------------------
    #是否开启百度云短信告警通道,可同时开始多个通道0为关闭,1为开启
    open-baidudx=0
    #百度云短信接口AK(ACCESS_KEY_ID)
    BDY_DX_AK=xxxxx
    #百度云短信接口SK(SECRET_ACCESS_KEY)
    BDY_DX_SK=xxxxx
    #百度云短信ENDPOINT(ENDPOINT参数需要用指定区域的域名来进行定义,如服务所在区域为北京,则为)
    BDY_DX_ENDPOINT=http://smsv3.bj.baidubce.com
    #百度云短信模版ID,根据自己审核通过的模版来填写(模版支持一个参数code:如prometheus告警:{code})
    BDY_DX_TEMPLATE_ID=xxxxx
    #百度云短信签名ID,根据自己审核通过的签名来填写
    TXY_DX_SIGNATURE_ID=xxxxx
    
    #---------------------↓百度Hi(如流)-----------------------
    #是否开启百度Hi(如流)告警通道,可同时开始多个通道0为关闭,1为开启
    open-ruliu=0
    #默认百度Hi(如流)机器人地址
    BDRL_URL=https://api.im.baidu.com/api/msg/groupmsgsend?access_token=xxxxxxxxxxxxxx
    #百度Hi(如流)群ID
    BDRL_ID=123456
    #---------------------↓bark接口-----------------------
    #是否开启telegram告警通道,可同时开始多个通道0为关闭,1为开启
    open-bark=0
    #bark默认地址, 建议自行部署bark-server
    BARK_URL=https://api.day.app
    #bark key, 多个key使用分割
    BARK_KEYS=xxxxx
    # 复制, 推荐开启
    BARK_COPY=1
    # 历史记录保存,推荐开启
    BARK_ARCHIVE=1
    # 消息分组
    BARK_GROUP=PrometheusAlert
    
    #---------------------↓语音播报-----------------------
    #语音播报需要配合语音播报插件才能使用
    #是否开启语音播报通道,0为关闭,1为开启
    open-voice=1
    VOICE_IP=127.0.0.1
    VOICE_PORT=9999
    
    #---------------------↓飞书机器人应用-----------------------
    #是否开启feishuapp告警通道,可同时开始多个通道0为关闭,1为开启
    open-feishuapp=1
    # APPID
    FEISHU_APPID=cli_xxxxxxxxxxxxx
    # APPSECRET
    FEISHU_APPSECRET=xxxxxxxxxxxxxxxxxxxxxx
    # 可填飞书 用户open_id、user_id、union_ids、部门open_department_id
    AT_USER_ID="xxxxxxxx"
    
    
    #---------------------↓告警组-----------------------
    # 有其他新增的配置段,请放在告警组的上面
    # 暂时仅针对 PrometheusContronller 中的 /prometheus/alert 路由
    # 告警组如果放在了 wx, dd... 那部分的上分,beego section 取 url 值不太对。
    # 所以这里使用 include 来包含另告警组配置
    
    # 是否启用告警组功能
    open-alertgroup=0
    
    # 自定义的告警组既可以写在这里,也可以写在单独的文件里。
    # 写在单独的告警组配置里更便于修改。
    # include "alertgroup.conf"
    
    #---------------------↓kafka地址-----------------------
    # kafka服务器的地址
    open-kafka=1
    kafka_server = 127.0.0.1:9092
    # 写入消息的kafka topic
    kafka_topic = devops
    # 用户标记该消息是来自PrometheusAlert,一般无需修改
    kafka_key = PrometheusAlert
  user.csv: |
    2019年4月10日,15888888881,小张,15999999999,备用联系人小陈,15999999998,备用联系人小赵
    2019年4月11日,15888888882,小李,15999999999,备用联系人小陈,15999999998,备用联系人小赵
    2019年4月12日,15888888883,小王,15999999999,备用联系人小陈,15999999998,备用联系人小赵
    2019年4月13日,15888888884,小宋,15999999999,备用联系人小陈,15999999998,备用联系人小赵
kind: ConfigMap
metadata:
  name: prometheus-alert-center-conf
  namespace: monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: prometheus-alert-center
    alertname: prometheus-alert-center
  name: prometheus-alert-center
  namespace: monitoring  
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-alert-center
      alertname: prometheus-alert-center
  template:
    metadata:
      labels:
        app: prometheus-alert-center
        alertname: prometheus-alert-center
    spec:
      initContainers:
      - name: init-time-sync
        image: harbor.dujie.com/dycloud/ntpdate
        command:
        - sh
        - -c
        - "ntpdate ntp.aliyun.com"
        securityContext:
          privileged: true
      containers:
      - image: harbor.dujie.com/dycloud/prometheus-alert:v4.9.1
        name: prometheus-alert-center
        env:
        - name: TZ
          value: "Asia/Shanghai"
        ports:
        - containerPort: 8080
          name: http
        resources:
          limits:
            cpu: 200m
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 100Mi
        volumeMounts:
        - name: prometheus-alert-center-conf-map
          mountPath: /app/conf/app.conf
          subPath: app.conf
        - name: prometheus-alert-center-conf-map
          mountPath: /app/user.csv
          subPath: user.csv
        - name: prometheus-alert-volume
          mountPath: /app/db
      volumes:
      - name: prometheus-alert-volume
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes: ["ReadWriteOnce"]
              storageClassName: "rook-ceph-block"
              resources:
                requests:
                  storage: 1Gi
      - name: prometheus-alert-center-conf-map
        configMap:
          name: prometheus-alert-center-conf
          items:
          - key: app.conf
            path: app.conf
          - key: user.csv
            path: user.csv
---
apiVersion: v1
kind: Service
metadata:
  labels:
    alertname: prometheus-alert-center
  name: prometheus-alert-center
  namespace: monitoring  
  annotations:
    prometheus.io/scrape: 'true'
    prometheus.io/port: '8080'  
spec:
  ports:
  - name: http
    port: 8080
    targetPort: http
  selector:
    app: prometheus-alert-center
  type: NodePort
---
# apiVersion: networking.k8s.io/v1beta1
# kind: Ingress
# metadata:
#   annotations:
#     kubernetes.io/ingress.class: nginx
#   name: prometheus-alert-center
#   namespace: monitoring
# spec:
#   rules:
#     - host: alert-center.local
#       http:
#         paths:
#           - backend:
#               serviceName: prometheus-alert-center
#               servicePort: 8080
#             path: /    

Text
#启动后可使用浏览器打开以下地址查看:http://[YOUR-PrometheusAlert-URL]:8080 #默认登录帐号和密码在app.conf中有配置

打开钉钉,进入钉钉群中,选择群设置–>智能群助手–>添加机器人–>自定义,可参下图:

image

image

新版本的钉钉加了安全设置,只需选择安全设置中的 自定义关键词 即可,并将关键词设置为 Prometheus或者app.conf中设置的title值均可,参考下图

image

image

复制图中的Webhook地址,并填入PrometheusAlert配置文件app.conf中对应配置项即可。

PS: 钉钉机器人目前已经支持 @某人 ,使用该功能需要取得对应用户的钉钉关联手机号码,如下图:

image

钉钉相关配置

** PrometheusAlert-Deployment.yaml**

#---------------------↓全局配置-----------------------
#告警消息标题
title=PrometheusAlert
#钉钉告警 告警logo图标地址
logourl=https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png
#钉钉告警 恢复logo图标地址
rlogourl=https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png

#---------------------↓webhook-----------------------
#是否开启钉钉告警通道,可同时开始多个通道0为关闭,1为开启
open-dingding=1
#默认钉钉机器人地址
ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxx
#是否开启 @所有人(0为关闭,1为开启)
dd_isatall=1
#是否开启钉钉机器人加签,0为关闭,1为开启
# 使用方法:https://oapi.dingtalk.com/robot/send?access_token=XXXXXX&secret=mysecret
open-dingding-secret=0

修改完之后需要apply此文件,然后修改alertmanager-secret.yaml

先查看下prometheusalert的svc地址和端口

[root@k8s-master01 manifests]# kubectl get svc -n monitoring 
NAME                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                         AGE
prometheus-alert-center       NodePort    10.96.167.90    <none>        8080:30621/TCP                  3d20h
[root@k8s-master01 manifests]# cat alertmanager-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  labels:
    app.kubernetes.io/component: alert-router
    app.kubernetes.io/instance: main
    app.kubernetes.io/name: alertmanager
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 0.27.0
  name: alertmanager-main
  namespace: monitoring
stringData:
  alertmanager.yaml: |-
    "global":
      "resolve_timeout": "5m"
      smtp_from: "15811047166@163.com"
      smtp_smarthost: "smtp.163.com:465"
      smtp_hello: "163.com"
      smtp_auth_username: "15811047166@163.com"
      smtp_auth_password: "KWLDKEYTJSMWYDRV"
      smtp_require_tls: false   
    "inhibit_rules":
    - "equal":
      - "alertname"
      "source_matchers":
      - "severity = critical"
      "target_matchers":
      - "severity =~ warning|info"
    - "equal":
      - "alertname"
      "source_matchers":
      - "severity = warning"
      "target_matchers":
      - "severity = info"
    - "equal":
      - "namespace"
      "source_matchers":
      - "alertname = InfoInhibitor"
      "target_matchers":
      - "severity = info"
    "receivers":
    - "name": "Default"
      "email_configs": 
      - "to": "15811047166@163.com"
        "send_resolved": true
    - "name": "webhook"
      webhook_configs: # 此url填写prometheusalert svc地址和端口,后面是固定格式type是媒介类型,fs表示飞书,dd表示钉钉,tpl是模板名,后面的url是飞书机器人的webhook连接
      - url: "http://10.96.167.90:8080/prometheusalert?type=fs&tpl=prometheus-fs&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxx"
    - "name": "Watchdog"
    - "name": "dingding"
      webhook_configs: # 此url填写prometheusalert svc地址和端口,后面是固定格式type是媒介类型,fs表示飞书,dd表示钉钉,tpl是模板名,后面的url是钉钉机器人的webhook连接
      - url: "http://10.96.167.90:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxx"
    - "name": "Critical"
    - "name": "null"
    "route":
      "group_by":
      - "alertname"
      "group_interval": "2m"
      "group_wait": "10s"
      "receiver": "Default"
      "repeat_interval": "2m"
      "routes":
      - "matchers":
        - "alertname = Watchdog"
        "receiver": "Watchdog"
      - "matchers":
        - "alertname = InfoInhibitor"
        "receiver": "null"
      - "matchers":
        - "severity = critical"
        "receiver": "dingding"
      - "matchers":
        - "severity = warning"
        "receiver": "dingding"
type: Opaque

修改完成后可以查看alertmanager和 prometheusalert的日志

[root@k8s-master01 manifests]# kubectl logs -f -n monitoring  alertmanager-main-0
ts=2024-07-15T02:25:59.706Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2024-07-15T02:25:59.707Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
出现此日志表示加载配置成功

可以到prometheusalert页面

填写自定义模板,可以用下面的模板,很好看

{{ $var := .externalURL}}{{ $status := .status}}{{ range $k,$v:=.alerts }} {{if eq $status "resolved"}}
## [告警恢复-通知]({{$var}})
#### 监控指标: {{$v.labels.alertname}}
{{ if eq $v.labels.severity "warning" }}
#### 告警级别: **<font color="#E6A23C">{{$v.labels.severity}}</font>**
{{ else if eq $v.labels.severity "critical"  }}
#### 告警级别: **<font color="#F56C6C">{{$v.labels.severity}}</font>**
{{ end }}
#### 当前状态: **<font color="#67C23A" size=4>已恢复</font>**
#### 故障主机: {{$v.labels.instance}}
* ###### 告警阈值: {{$v.labels.threshold}}
* ###### 开始时间: {{GetCSTtime $v.startsAt}}
* ###### 恢复时间: {{GetCSTtime $v.endsAt}}

#### 告警恢复: <font color="#67C23A">已恢复,{{$v.annotations.description}}</font>
{{ else }}
## [监控告警-通知]({{$var}})
#### 监控指标: {{$v.labels.alertname}}
{{ if eq $v.labels.severity "warning" }}
#### 告警级别: **<font color="#E6A23C" size=4>{{$v.labels.severity}}</font>**
#### 当前状态: **<font color="#E6A23C">需要处理</font>**
{{ else if eq $v.labels.severity "critical"  }}
#### 告警级别: **<font color="#F56C6C" size=4>{{$v.labels.severity}}</font>**
#### 当前状态: **<font color="#F56C6C">需要处理</font>**
{{ end }}
#### 故障主机: {{$v.labels.instance}}
* ###### 告警阈值: {{$v.labels.threshold}}
* ###### 持续时间: {{$v.labels.for_time}}
* ###### 触发时间: {{GetCSTtime $v.startsAt}}
{{ if eq $v.labels.severity "warning" }}
#### 告警触发: <font color="#E6A23C">{{$v.annotations.description}}</font>
{{ else if eq $v.labels.severity "critical" }}
#### 告警触发: <font color="#F56C6C">{{$v.annotations.description}}</font>
{{ end }}
{{ end }}
{{ end }}

填写钉钉机器人地址后点击保存模板,然后点击模板测试

image

查看日志

2024/07/15 03:48:13.583 [D] [value.go:586]  [1721015293583397817] sss
2024/07/15 03:48:13.583 [D] [server.go:2936]  |  172.16.32.128| 200 |    456.082µs|   match| POST     /prometheusalert   r:/prometheusalert
如果没有错误表示成功了

此时可以根据一个监控指标验证是否可以成功报警,我这里用kubelet来演示,首先查看所有Rule文件

[root@k8s-master01 manifests]# pwd 
/root/yaml/prometheus/kube-prometheus-main/manifests
[root@k8s-master01 manifests]# ll *Rule*
-rw-r--r-- 1 root root  6979 79 01:29 alertmanager-prometheusRule.yaml
-rw-r--r-- 1 root root  1418 79 01:29 grafana-prometheusRule.yaml
-rw-r--r-- 1 root root  4301 79 01:29 kubePrometheus-prometheusRule.yaml
-rw-r--r-- 1 root root 73239 79 01:29 kubernetesControlPlane-prometheusRule.yaml
-rw-r--r-- 1 root root  3830 712 16:39 kubeStateMetrics-prometheusRule.yaml
-rw-r--r-- 1 root root 19720 79 01:29 nodeExporter-prometheusRule.yaml
-rw-r--r-- 1 root root  6591 79 01:29 prometheusOperator-prometheusRule.yaml
-rw-r--r-- 1 root root 17256 79 01:29 prometheus-prometheusRule.yaml
# 或者通过这个方式查询
[root@k8s-master01 prometheus]# kubectl get prometheusrule -n monitoring 
NAME                              AGE
alertmanager-main-rules           5d3h
grafana-rules                     5d3h
kube-prometheus-rules             5d3h
kube-state-metrics-rules          2d20h
kubernetes-monitoring-rules       5d3h
node-exporter-rules               5d3h
prometheus-k8s-prometheus-rules   5d3h
prometheus-operator-rules         5d3h
labels:告警的标签,用于告警的路由

然后根据文件名,去添加你需要的指标,我这里在kubeStateMetrics-prometheusRule.yaml 文件添加了(不固定,想在哪个文件添加都可以,前提你的监控指标要和文件名相对应最好),在文件最后面添加

[root@k8s-master01 manifests]# vim kubeStateMetrics-prometheusRule.yaml
    - alert: K8S Node节点状态NotReady
      expr: kube_node_status_condition{condition="Ready",status="true"} == 0
      for: 1m
      labels:
        severity: critical
        level: "高"
      annotations:
        summary: K8s节点状态异常 (instance {{ $labels.instance }})
        description: "节点 {{ $labels.node }} NotReady,请尽快处理! \n "
        owner: "运维工程师-杜杰"
        panelURL: "http://192.168.31.21:32159/alerts?search="
        alertURL: "http://192.168.31.20:30336/d/3138fa155d5915769fbded898ac09fd9/kubernetes-kubelet?orgId=1&refresh=10s&var-datasource=default&var-cluster=&var-instance=Al
l"

[root@k8s-master01 manifests]# kubectl apply -f kubeStateMetrics-prometheusRule.yaml
  • alert:告警策略的名称
  • annotations:告警注释信息,一般写为告警信息
  • expr:告警表达式
  • for:评估等待时间,告警持续多久才会发送告警数据

以上的值都可以在模板中进行调用。

更新完成之后依然可以查看日志,等待更新之后就可以进行测试

[root@k8s-master01 manifests]# kubectl logs -f -n monitoring  alertmanager-main-0
ts=2024-07-15T02:25:59.706Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2024-07-15T02:25:59.707Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
出现此日志表示加载配置成功

告警测试

[root@k8s-master01 ~]# kubectl get nodes 
NAME           STATUS   ROLES           AGE     VERSION
k8s-master01   Ready    control-plane   5d23h   v1.30.2
k8s-master02   Ready    control-plane   5d23h   v1.30.2
k8s-master03   Ready    control-plane   5d23h   v1.30.2
k8s-node01     Ready    <none>          5d23h   v1.30.2
k8s-node02     Ready    <none>          5d23h   v1.30.2
k8s-node03     Ready    <none>          5d23h   v1.30.2
k8s-node04     Ready    <none>          5d23h   v1.30.2
k8s-node05     Ready    <none>          5d23h   v1.30.2
k8s-node06     Ready    <none>          5d23h   v1.30.2

# 关掉其中一台节点的kubelet,我这里关掉node06节点,稍等一会后等待状态为Not_Ready
[root@k8s-node06 ~]# systemctl stop kubelet 

[root@k8s-master01 ~]# kubectl get nodes 
NAME           STATUS     ROLES           AGE     VERSION
k8s-master01   Ready      control-plane   5d23h   v1.30.2
k8s-master02   Ready      control-plane   5d23h   v1.30.2
k8s-master03   Ready      control-plane   5d23h   v1.30.2
k8s-node01     Ready      <none>          5d23h   v1.30.2
k8s-node02     Ready      <none>          5d23h   v1.30.2
k8s-node03     Ready      <none>          5d23h   v1.30.2
k8s-node04     Ready      <none>          5d23h   v1.30.2
k8s-node05     Ready      <none>          5d23h   v1.30.2
k8s-node06     NotReady   <none>          5d23h   v1.30.2

稍等一会可以查看钉钉已经收到报警了,这个时间是根据你alertmanager 和Rule里面的for时间来报警的,我上面for的时间设置为1分钟,表示查询表达式的值必须持续1分钟都为真才会报警,目的是为了防止偶发的短暂问题而触发报警。

image

四、使用prometheus Alert实现飞书告警

和钉钉实现类似,这里只写修改的内容

4.1 修改PromethesuAlert配置

 fsurl请填自己飞书机器人的webhook地址

image

[root@k8s-master01 prometheus]# vim PrometheusAlert-Deployment.yaml 
    #是否开启飞书告警通道,可同时开始多个通道0为关闭,1为开启
    open-feishu=1
    #默认飞书机器人地址
    fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/61xxxxx
    # webhook 发送 http 请求的 contentType, 如 application/json, application/x-www-form-urlencoded,不配置默认 application/json
    wh_contenttype=application/json
4.2 修改alertmanger配置
[root@k8s-master01 manifests]# vim alertmanager-secret.yaml
...
    "receivers":
    - "name": "Default"
      "email_configs":
      - "to": "15811047166@163.com"
        "send_resolved": true
    - "name": "webhook"
      webhook_configs:
      - url: "http://10.96.167.90:8080/prometheusalert?type=fs&tpl=prometheus-fs&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/61132219-e82a-4aae-9186-3664efa9d8c1"
...
    "route":
      "group_by":
      - "alertname"
      "group_interval": "2m"
      "group_wait": "10s"
      "receiver": "Default"
      "repeat_interval": "2m"
      "routes":
      - "matchers":
        - "alertname = Watchdog"
        "receiver": "Watchdog"
      - "matchers":
        - "alertname = InfoInhibitor"
        "receiver": "null"
      - "matchers":
        - "severity = critical"
        "receiver": "webhook"
...

重启他们之后即可发送

五、域名访问延迟告警

域名访问延迟告警:

假设需要对域名访问延迟进行监控,访问延迟大于 0.01 秒进行告警(我这里是测试才用的0.01,生产环境需要按照各自业务设置),此时可以创建一个PrometheusRule 如下:

[root@k8s-master01 manifests]# cat blackbox.yaml 
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
 labels:
   app.kubernetes.io/component: exporter
   app.kubernetes.io/name: blackbox-exporter
   prometheus: k8s
   role: alert-rules
 name: blackbox 
 namespace: monitoring
spec:
 groups:
 - name: blackbox-exporter
   rules:
   - alert: DomainAccessDelayExceeds1s
     annotations:
       description: 域名:{{ $labels.instance }} 探测延迟大于 0.01 秒,当前延迟为:{{ $value }}
       summary: 域名探测,访问延迟超过 0.01 秒
     expr: sum(probe_http_duration_seconds{job=~"blackbox"}) by (instance) > 0.01
     for: 1m
     labels:
       severity: warning
       type: blackbox

创建并查看该 PrometheusRule:

[root@k8s-master01 manifests]# kubectl create -f blackbox.yaml 
prometheusrule.monitoring.coreos.com/blackbox created
[root@k8s-master01 manifests]# kubectl get -f blackbox.yaml 
NAME AGE
blackbox 65s

之后也可以在 Prometheus 的 Web UI 看到此规则:

image

然后可以看到刚才配置的钉钉已经收到告警了

image