污点、容忍

污点设计理念：Taint在一类服务器上打上污点，如果pod没有配置可以容忍这个污点，就不能部署在打了污点的服务器上面。Toleration是让Pod容忍节点上配置的污点，可以让一些需要特殊配置的Pod能够调用到具有污点和特殊配置的节点上。

https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/

创建污点（一个节点可以有多个污点）

kubectl taint node k8s-node01 ssd=true:NoSchedule

污点类型有三种：

NoExecute：禁止调度到该节点，如果不符合这个污点，会立马将该节点的pod驱逐（或在一段时间后）
NoSchedule：禁止调度到该节点，不会驱逐已有pod
PerferNoSchedule：尽量避免将pod调度到该节点，如果没有更合适的节点，可以部署到该节点

查看一个节点的污点：

root@VM-26-198-ubuntu:~# kubectl describe nodes 10.122.26.198  |grep -A 10 Taints
Taints:             node-role.kubernetes.io/master:NoSchedule
Unschedulable:      false

删除污点

# 基于key删除
kubectl taint node k8s-node01  ssd-
# 基于key+effect删除
kubectl taint node k8s-node01 ssd:PreferNoScheduler-
# 基于完整格式删除
kubectl taint node k8s-node01 ssd=true:preferNoScheduler-

修改污点(Key和effect相同)：

root@VM-26-130-ubuntu:~# kubectl taint  node 10.122.26.130  ssd=false:PreferNoSchedule --overwrite 
node/10.122.26.130 modified

常见的内置污点：

污点容忍：

1、完全匹配

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: test
spec:
  replicas: 6
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      tolerations:
      - key: "gpu"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
      - key: "fuck"
        operator: "Equal"
        value: "true"
        effect: "NoExecute"
      containers:
      - image: registry.cn-hangzhou.aliyuncs.com/dyclouds/nginx:1.15.12
        name: nginx

上面的配置的污点容忍，对应的node污点必须有gpt=true且类型为NoSchedule，还必须有fuck=true且类型等于NoExecute才能被调度到打了污点的节点上面

2、不完全匹配

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: test
spec:
  replicas: 6
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      tolerations:
      - key: "gpu"
        operator: "Exists"
        effect: "NoSchedule"
      - key: "fuck"
        operator: "Exists"
        effect: "NoExecute"
      containers:
      - image: registry.cn-hangzhou.aliyuncs.com/dyclouds/nginx:1.15.12
        name: nginx

不完全匹配，使用Exists匹配，表示只要节点上面打了对应key的污点且effect为对应的值，就可以被调度到该节点，比如污点类型为ssd的有很多种，ssd=ssd1 ssd=ssd2，这种情况会使用

3、大范围匹配

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: test
spec:
  replicas: 6
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      tolerations:
      - key: "gpu"
        operator: "Exists"
      - key: "fuck"
        operator: "Exists"
      containers:
      - image: registry.cn-hangzhou.aliyuncs.com/dyclouds/nginx:1.15.12
        name: nginx

不指定effect，表示只要节点的污点上面有对应的key就可以被调度到该节点

NoExecute 驱逐pod的时间设置

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoExecute"
  tolerationSeconds: 3600

污点和容忍使用案例

有一个节点是纯SSD硬盘的节点，现需要只有一些需要高性能存储的Pod才能调度到该节点上

1、首先给节点打上污点和标签

kubectl taint nodes k8s-node01 ssd=true:NoExecute（此时会驱逐没有容忍该污点的Pod）
kubectl taint nodes k8s-node01 ssd=true:NoSchedule
kubectl label node k8s-node01 ssd=true

[root@k8s-master01 yaml]# vim deployment.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: test
spec:
  replicas: 6
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      nodeSelector:
        ssd: "true"
      tolerations:
      - key: "ssd"
        operator: "Equal"
        value: "true"
        effect: "NoExecute"
        tolerationSeconds: 60
      containers:
      - image: registry.cn-hangzhou.aliyuncs.com/dyclouds/nginx:1.15.12
        name: nginx

内置污点

node.kubernetes.io/not-ready：节点未准备好，相当于节点状态Ready的值为False。
node.kubernetes.io/unreachable：Node Controller访问不到节点，相当于节点状态Ready的值为Unknown。
node.kubernetes.io/out-of-disk：节点磁盘耗尽。
node.kubernetes.io/memory-pressure：节点存在内存压力。
node.kubernetes.io/disk-pressure：节点存在磁盘压力。
node.kubernetes.io/network-unavailable：节点网络不可达。
node.kubernetes.io/unschedulable：节点不可调度。
node.cloudprovider.kubernetes.io/uninitialized：如果Kubelet启动时指定了一个外部的cloudprovider，它将给当前节点添加一个Taint将其标记为不可用。在cloud-controller-manager的一个controller初始化这个节点后，Kubelet将删除这个Taint。

节点不健康，6000秒后再驱逐（默认是300秒）：

tolerations:
- key: "node.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 6000

1、k8s主节点禁止调度

在生产环境中，Kubernetes 的主节点除了部署系统组件外，不推荐再部署任何服务，此时可以通过添加污点来禁止调度：

root@VM-26-130-ubuntu:~# kubectl taint node 10.122.26.130 10.122.26.132 10.122.26.138  node-role.kubernetes.io/control-plane:NoSchedule
node/10.122.26.130 tainted
node/10.122.26.132 tainted
node/10.122.26.138 tainted

也可以添加 NoExecute 类型的污点，此时不容忍该污点的 Pod 会被驱逐重建：

root@VM-26-130-ubuntu:~# kubectl taint node k8s-master01 node-role.kubernetes.io/control-plane:NoExecute

使用如下命令可以查看正在被驱逐重建的 Pod：

root@VM-26-130-ubuntu:~# kubectl get po -A -owide | grep k8s-master01 | grep -v Running

2、k8s新节点禁止调度

当 Kubernetes 集群添加新节点时，通常情况下不会立即调度 Pod 到该节点，需要经过完整的可用性测试之后才可以调度 Pod，此时也可以使用污点先临时禁止该节点的调度：

root@VM-26-130-ubuntu:~# kubectl taint node  10.122.26.143  new-node=true:NoSchedule
node/10.122.26.143 tainted

同样的道理，比如在禁止调度之前已经有 Pod 部署在该节点，可以进行驱逐：

root@VM-26-130-ubuntu:~# kubectl taint node  10.122.26.143  new-node=true:NoExecute
node/10.122.26.143 tainted

待新节点测试完毕后，在允许该节点可以进行调度：

root@VM-26-130-ubuntu:~# kubectl taint node  10.122.26.143  new-node-
node/10.122.26.143 untainted

3、k8s节点维护流程

当 Kubernetes 的节点需要进行下线维护时，此时需要先把该节点的服务进行驱逐和重新调度。

此时需要根据实际情况判断是直接驱逐还是选择重新调度，比如某个 Pod 只有一个副本，或者某个服务比较重要，就不能直接进行驱逐，而是需要先把节点关闭调度，然后在进行服务的重新部署。

关闭维护节点的调度：

root@VM-26-130-ubuntu:~# kubectl taint node 10.122.26.143 maintain:NoSchedule
node/10.122.26.143 tainted

重新触发某个服务的部署：

root@VM-26-130-ubuntu:~# kubectl get po -A -owide | grep 10.122.26.143
basic-component-dev redis-6b74d4b679-trmwh 1/1 Running 21 (84m ago) 60d 172.16.58.228 k8s-node02 <none> <none>
root@VM-26-130-ubuntu:~# kubectl rollout restart deploy redis -n basic-component-dev
deployment.apps/redis restarted

再次查看该服务：

root@VM-26-130-ubuntu:~# kubectl get po -n basic-component-dev -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
redis-6c8cf46df6-crj54 1/1 Running 0 47s 172.16.85.224 k8s-node01 <none> <none>

接下来没有重要服务，即可对该节点的 Pod 进行驱逐：

# kubectl taint node 10.122.26.143 maintain:NoExecute

驱逐后，即可按照预期进行对节点进行维护，维护完成以后，可以删除污点，恢复调度：

# kubectl taint node 10.122.26.143maintain
node/k8s-node02 untainted

除了自定义污点，也可以使用 kubectl 快捷指令将节点设置为维护状态：

root@VM-26-130-ubuntu:~# kubectl cordon  10.122.26.143 
node/10.122.26.143 cordoned

此时节点会被标记一个 SchedulingDisabled 状态，但是已经运行在该节点的 Pod 不收影响：

root@VM-26-130-ubuntu:~# kubectl get nodes 
NAME            STATUS                     ROLES    AGE   VERSION
10.122.26.130   Ready                      master   9d    v1.22.5-tke.19
10.122.26.132   Ready                      master   9d    v1.22.5-tke.19
10.122.26.134   Ready                      <none>   9d    v1.22.5-tke.19
10.122.26.136   Ready                      <none>   9d    v1.22.5-tke.19
10.122.26.138   Ready                      master   9d    v1.22.5-tke.19
10.122.26.140   Ready                      <none>   9d    v1.22.5-tke.19
10.122.26.141   Ready                      <none>   9d    v1.22.5-tke.19
10.122.26.142   Ready                      <none>   9d    v1.22.5-tke.19
10.122.26.143   Ready,SchedulingDisabled   <none>   9d    v1.22.5-tke.19

驱逐10.122.26.143 上面的服务：

root@VM-26-130-ubuntu:~# kubectl drain  10.122.26.143 --ignore-daemonsets --delete-emptydir-data 
node/10.122.26.143 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/csi-cbs-node-hjzjq, kube-system/csi-coslauncher-84tw5, kube-system/csi-cosplugin-59hdx, kube-system/csi-nodeplugin-cfsplugin-qp5wl, kube-system/ip-masq-agent-wtl5q, kube-system/kube-proxy-6z9mg, kube-system/networkpolicy-26bg9, kube-system/p2p-agent-ntgsz, kube-system/tke-bridge-agent-9qqgx, kube-system/tke-cni-agent-zkdbf
evicting pod kube-system/csi-cbs-controller-58c9656974-fnsh8
evicting pod kube-system/cluster-monitor-7c8c4c5b4-k8shq
evicting pod kube-system/coredns-5c5578f87d-2rb4h
evicting pod default/example-deployment-6c5fbb76f9-mjwlg
pod/example-deployment-6c5fbb76f9-mjwlg evicted
pod/cluster-monitor-7c8c4c5b4-k8shq evicted
pod/csi-cbs-controller-58c9656974-fnsh8 evicted

4、k8s节点特殊资源保留

当 Kubernetes 中存储特殊节点时，应该尽量保持不要特殊资源的 Pod 不要调度到这些节点上，此时可以通过污点进行控制。

比如包含了 GPU 的节点不能被任意调度：

# kubectl taint node k8s-node02 gpu=true:NoSchedule
node/k8s-node02 tainted

具有其它特殊资源，尽量不要调度：

# kubectl taint node k8s-node02 ssd=true:PreferNoSchedule
node/k8s-node02 tainted

5、使用容忍调度到具有污点的节点

在生产环境中，经常根据实际情况给节点打上污点，比如特殊资源节点不能随意调度、主节点不能随意调度，但是需要特殊资源的服务还是需要调度到该节点，一些监控和收集的服务还是需要调度到主节点，此时需要给这些服务添加合适的容忍才能部署到这些节点。

比如上述添加的 GPU 污点：

root@VM-26-130-ubuntu:~# kubectl taint  node 10.122.26.143  gpu=true:NoSchedule
node/10.122.26.143 tainted

如果某个服务需要 GPU 资源，就需要添加容忍才能部署至该节点。此时可以添加如下的容忍配置至 Pod 上：

tolerations:
- key: "gpu"
  operator: "Exists"
  effect: "NoSchedule"

完整配置，注意：容忍只是说pod可以被调度到有该污点的机器，并不是指定调度到该节点

root@VM-26-130-ubuntu:/usr/local/src/k8s/taint# cat toleraions.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
 name: gpu-example
spec:
 replicas: 1
 selector:
   matchLabels:
     app: gpu-example
 template:
   metadata:
     labels:
       app: gpu-example
   spec:
     nodeSelector:
       gpu: "true"
     tolerations:
     - key: "gpu"
       operator: "Exists"
       effect: "NoSchedule"
     containers:
     - name: gpu-example
       image: registry.cn-beijing.aliyuncs.com/dotbalo/nginx:1.15.12

6、k8s专用节点隔离

一个 Kubernetes 集群，很常见会有一些专用的节点，比如 ingress、gateway、storage 或者多租户环境等。这些节点通常不建议和其他服务交叉使用，所以需要利用污点和容忍将这些节点隔离起来。

比如选择一批节点作为 ingress 入口的节点：

# kubectl label node k8s-node02 ingress=true
# 添加一个污点，不让其他服务部署
# kubectl taint node k8s-node02 ingress=true:NoSchedule
node/k8s-node02 tainted

更改 Ingress 的部署资源，添加容忍和节点选择器：

# kubectl edit ds -n ingress-nginx
 nodeSelector:
   ingress: "true"
   kubernetes.io/os: linux
 
 tolerations:
 - effect: NoSchedule
   key: ingress
   operator: Exists

7、节点宕机快速恢复服务

当 Kubernetes 集群中有节点故障时，Kubernetes 会自动恢复故障节点上的服务，但是默认情况下，节点故障时五分钟才会重新调度服务，此时可以利用污点的 tolerationSeconds 快速恢复服务。

apiVersion: apps/v1
kind: Deployment
metadata:
 labels:
   app: tolerations-second
 name: tolerations-second
spec:
 replicas: 1
 selector:
   matchLabels:
     app: tolerations-second
 strategy: {}
   template:
     metadata:
       labels:
         app: tolerations-second
     spec:
       containers:
       - image: registry.cn-beijing.aliyuncs.com/dotbalo/nginx:1.15.12
         name: nginx
       tolerations:
       - effect: NoExecute
         key: node.kubernetes.io/unreachable
         operator: Exists
         tolerationSeconds: 10
       - effect: NoExecute
         key: node.kubernetes.io/not-ready
         operator: Exists
         tolerationSeconds: 10