节点亲和、pod亲和

污点和容忍可以实现简单的调度,但有些问题还是无法处理

pod和节点之间的关系:

某些Pod优先选择有ssd=true标签的节点,如果没有在考虑部署到其它节点;

某些Pod需要部署在ssd=true和type=physical的节点上,但是优先部署在ssd=true的节点上;

pod和pod之间的关系:

同一个应用的Pod不同的副本或者同一个项目的应用尽量或必须不部署在同一个节点或者符合某个标签的一类节点上或者不同的域;

相互依赖的两个Pod尽量或必须部署在同一个节点上或者同一个域内。

Affinity 亲和力:

NodeAffinity:节点亲和力/反亲和力

PodAffinity:pod亲和力

PodAntiAffinity:Pod反亲和力

image

节点亲和配置

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      # 亲和配置
      affinity:
        
        nodeAffinity:  # 节点亲和
          requiredDuringSchedulingIgnoredDuringExecution:   #节点硬亲和
            nodeSelectorTerms:      # node标签,可匹配多个
            - matchExpressions:     
              - key: region        # 标签key
                operator: In       # 匹配条件
                values:            # value
                - "chaoyang"       
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            preference:
              matchExpressions:
              - key: another-node-label-key
                operator: In
                values:
                - another-node-label-value
  • requiredDuringSchedulingIgnoredDuringExecution:硬亲和配置
    • nodeSelectorTerms:节点选择器配置,可以配置多个matchExpressions(满足一个即可),每个matchExpressions 下可以配置多个key、value的选择器(都需要满足),其中values可以配置多个(满足一个即可)
  • preferredDuringSchedulingIgnoredDuringExecution:软亲和配置
    • weight:软亲和力权重,权重越高优先级越大,范围1-100
    • preference:软亲和力配置项,和weight统计,可配置多个matchExpressions
  • operator
    • In:相当于key=value的形式
    • NotIn:相当于key!=value的形式
    • Exists:节点存在label的key即可,不能配置values字段
    • DoesNotExists:节点不存在label的key,不能配置values字段
    • Gt:大于value指定的值
    • Lt:小于value指定的值

案例一

k8s集群中有两台机器是朝阳区的机器,请吧某个应用部署在朝阳区的两台机器上,并且一台机器只能部署一个pod副本

1、首先给朝阳区两台机器添加label

[root@k8s-master01 affinity]# kubectl label node k8s-node03 k8s-node06 region=chaoyang
[root@k8s-master01 affinity]# kubectl get nodes --show-labels  |grep region=chaoyang
k8s-node03     Ready    <none>          4d2h   v1.28.2   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node03,kubernetes.io/os=linux,region=chaoyang,system=Ansion
k8s-node06     Ready    <none>          19h    v1.28.2   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,environment=test,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node06,kubernetes.io/os=linux,region=chaoyang

2、创建deployment

[root@k8s-master01 affinity]# cat deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      # 亲和配置
      affinity:
        
        nodeAffinity:  # 节点亲和
          requiredDuringSchedulingIgnoredDuringExecution:   #节点硬亲和
            nodeSelectorTerms:      # node标签,可匹配多个
            - matchExpressions:     
              - key: region        # 标签key
                operator: In       # 匹配条件
                values:            # value
                - "chaoyang"       
        podAntiAffinity:  # pod反亲和
          requiredDuringSchedulingIgnoredDuringExecution:   # pod 硬反亲和
          - labelSelector:   # 匹配pod标签,单数
              matchExpressions:   # 标签
              - key: app    # pod标签
                operator: In   #匹配条件
                values:  # value
                - nginx  # 标签value
            topologyKey: kubernetes.io/hostname
      containers:
      - image: registry.cn-hangzhou.aliyuncs.com/dyclouds/nginx:1.15.12
        name: nginx

可以看到pod已经调度到node3和node6节点了,并且如果这时候修改副本为大于2个,那么大于2个的pod会起不来,pending状态,因为pod 反亲和已经定义了相同label的不能在一个域topologyKey: kubernetes.io/hostname 每台节点的hostname都不一样,即是不同域

image

案例2

同一个应用的不同副本需要在同一台节点,比如这个应用是需要gpu服务器,所有副本都需要部署到有gpu服务器的节点

1、给gpu服务器打标签

[root@k8s-master01 affinity]# kubectl label node k8s-node07 gpu=true

2、创建deployment,配置pod亲和

[root@k8s-master01 affinity]# cat deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: test
spec:
  replicas: 6
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: gpu
                operator: In
                values:
                - "true"
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - nginx
            topologyKey: kubernetes.io/hostname
      containers:
      - image: registry.cn-hangzhou.aliyuncs.com/dyclouds/nginx:1.15.12
        name: nginx

可以看到所有服务都在node07上了

image

案例3

尽量调度到高配置服务器

[root@k8s-master01 affinity]# cat deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: test
spec:
  replicas: 6
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: region
                operator: In
                values:
                - haidian
       # podAffinity:
       #   requiredDuringSchedulingIgnoredDuringExecution:
       #   - labelSelector:
       #       matchExpressions:
       #       - key: app
       #         operator: In
       #         values:
       #         - nginx
       #     topologyKey: kubernetes.io/hostname
       # podAntiAffinity:
       #   preferredDuringSchedulingIgnoredDuringExecution:
       #   - weight: 100
       #     podAffinityTerm:
       #       labelSelector:
       #         matchExpressions:
       #         - key: security
       #           operator: In
       #           valu
       #           s:
       #           - S2
       #       namespaces:
       #         - default
       #       topologyKey: failure-domain.beta.kubernetes.io/zone
      containers:
      - image: registry.cn-hangzhou.aliyuncs.com/dyclouds/nginx:1.15.12
        name: nginx

案例4

同一个应用多区域部署,我有3个区域,一个是海淀,一个是朝阳,还一个是,每个区域都有2台机器,需求是将每个区域的每台节点都部署应用,且两个副本不能再同一台节点

给对应区域节点打label同上,忽略,下面的结果可以看到三个区域一共5台机器

[root@k8s-master01 affinity]# kubectl get nodes --show-labels |egrep "region=haidian|region=chaoyang|environment=prod"
k8s-node01     Ready    <none>          4d4h   v1.28.2   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,environment=prod,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node01,kubernetes.io/os=linux,region=daxing,ssd=true
k8s-node02     Ready    <none>          4d4h   v1.28.2   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,gpu=false,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node02,kubernetes.io/os=linux,region=haidian,ssd=true
k8s-node03     Ready    <none>          4d4h   v1.28.2   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node03,kubernetes.io/os=linux,region=chaoyang,system=Ansion
k8s-node04     Ready    <none>          22h    v1.28.2   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,environment=prod,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node04,kubernetes.io/os=linux,region=haidian,system=Ansion
k8s-node06     Ready    <none>          21h    v1.28.2   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,environment=test,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node06,kubernetes.io/os=linux,region=chaoyang

创建deployment并添加亲和配置

[root@k8s-master01 affinity]# cat deployment.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: test
spec:
  replicas: 6
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:  # node硬亲和,要求必须部署在对应的三个区域
          requiredDuringSchedulingIgnoredDuringExecution:
             nodeSelectorTerms:
             - matchExpressions:
                - key: region
                  operator: In
                  values: 
                  - chaoyang
                  - haidian
             - matchExpressions:
                - key: environment
                  operator: In
                  values:
                  - prod
        podAntiAffinity: # pod硬反亲和,要求每个副本必须在不同的节点
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - nginx
            topologyKey: kubernetes.io/hostname
      containers:
      - image: registry.cn-hangzhou.aliyuncs.com/dyclouds/nginx:1.15.12
        name: nginx

image

上图可以看到,有5台节点已经成功部署在对应的区域,因为我们的副本数是6,大于了5台,因为pod硬反亲和的配置,所以剩余的一台只能是pending状态。

拓扑域Topologykey详解

topologkey:拓扑域,主要针对宿主机,相当于对宿主机进行区域的划分,用label进行判断,不同的key和不同的value是属于不同的拓扑域

image

image

image

2.1 同一个应用必须部署在不同的宿主机

root@VM-26-130-ubuntu:/usr/local/src/k8s/affinity# cat test1.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
 name: diff-nodes
 labels:
   app: diff-nodes
spec:
 selector:
   matchLabels:
     app: diff-nodes
 replicas: 2
 template:
   metadata:
     labels:
       app: diff-nodes
   spec:
     affinity:
       podAntiAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
         - labelSelector:
             matchExpressions:
             - key: app
               operator: In
               values:
               - diff-nodes
           topologyKey: kubernetes.io/hostname
           namespaces: []
     containers:
     - name: diff-nodes
       image: registry.cn-beijing.aliyuncs.com/dotbalo/nginx:1.15.12
       imagePullPolicy: IfNotPresent

2.2 同一个应用尽量部署在不同的宿主机

root@VM-26-130-ubuntu:/usr/local/src/k8s/affinity# cat test2.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
 name: diff-nodes
 labels:
   app: diff-nodes
spec:
 selector:
   matchLabels:
     app: diff-nodes
 replicas: 2
 template:
   metadata:
     labels:
       app: diff-nodes
   spec:
     affinity:
       podAntiAffinity:
         preferredDuringSchedulingIgnoredDuringExecution:
         - podAffinityTerm:
             labelSelector:
               matchExpressions:
               - key: app
                 operator: In
                 values:
                 - diff-nodes
             topologyKey: kubernetes.io/hostname
             namespaces: []
           weight: 100
     containers:
     - name: diff-nodes
       image: registry.cn-beijing.aliyuncs.com/dotbalo/nginx:1.15.12
       imagePullPolicy: IfNotPresent

2.3 同一个应用分布在不同的机房

如果集群处于不同的可用域,可以把应用分布在不同的可用域,以提高服务的高可用性:

root@VM-26-130-ubuntu:/usr/local/src/k8s/affinity# cat test3.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
 name: diff-zone
 labels:
   app: diff-zone
spec:
 selector:
   matchLabels:
     app: diff-zone
 replicas: 2
 template:
   metadata:
     labels:
       app: diff-zone
   spec:
     affinity:
       podAntiAffinity:
         preferredDuringSchedulingIgnoredDuringExecution:
         - podAffinityTerm:
             labelSelector:
               matchExpressions:
               - key: app
                 operator: In
                 values:
                 - diff-zone
             topologyKey: zone
             namespaces: []
           weight: 100
     containers:
     - name: diff-nodes
       image: registry.cn-beijing.aliyuncs.com/dotbalo/nginx:1.15.12
       imagePullPolicy: IfNotPresent

2.4 应用尽量和缓存服务部署在同一个可用域

如果集群分布在不同的可用域,为了提升基础组件的使用性能,可以把应用程序尽量和缓存服务部署在同一个可用域。

首先部署一个缓存服务:

root@VM-26-130-ubuntu:/usr/local/src/k8s/affinity# cat test4.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
 name: my-app
 labels:
   app: my-app
spec:
 selector:
   matchLabels:
     app: my-app
 replicas: 2
 template:
   metadata:
     labels:
       app: my-app
   spec:
     affinity:
       podAffinity:
         preferredDuringSchedulingIgnoredDuringExecution:
         - podAffinityTerm:
             labelSelector:
               matchExpressions:
               - key: app
                 operator: In
                 values:
                 - cache
             topologyKey: kubernetes.io/hostname
           weight: 100
     containers:
     - name: my-app
       image: registry.cn-beijing.aliyuncs.com/dotbalo/nginx:1.15.12
       imagePullPolicy: IfNotPresent

2.5 计算服务必须部署到高性能机器

假设集群中有一批机器是高性能机器,而有一些需要密集计算的服务,需要部署至这些机器,以提高计算性能,此时可以使用节点亲和力来控制 Pod 尽量或者必须部署至这些节点上。

比如计算服务只能部署在 ssd 或 nvme 的节点上:

root@VM-26-130-ubuntu:/usr/local/src/k8s/affinity# cat test5.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
 name: compute
 labels:
   app: compute
spec:
 selector:
   matchLabels:
     app: compute
 replicas: 2
 template:
   metadata:
     labels:
       app: compute
   spec:
     affinity:
       nodeAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
           nodeSelectorTerms:
           - matchExpressions:
             - key: disktype
               operator: In
               values:
               - ssd
               - nvme
     containers:
     - name: compute
       image: registry.cn-beijing.aliyuncs.com/dotbalo/nginx:1.15.12
       imagePullPolicy: IfNotPresent

2.5 应用尽量不部署至低性能机器

假如已知集群中有一些机器可能性能不佳或者其他因素的影响,需要控制某个服务尽量不部署至这些机器,此时只需要把 operator 改为 NotIn 即可:

root@VM-26-130-ubuntu:/usr/local/src/k8s/affinity# cat test6.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
 name: compute-intensive
 labels:
   app: compute-intensive
spec:
 selector:
   matchLabels:
     app: compute-intensive
 replicas: 2
 template:
   metadata:
     labels:
       app: compute-intensive
   spec:
     affinity:
       nodeAffinity:
         preferredDuringSchedulingIgnoredDuringExecution:
         - weight: 100
           preference:
             matchExpressions:
             - key: disktype
               operator: NotIn
               values:
               - low
     containers:
     - name: compute-intensive
       image: registry.cn-beijing.aliyuncs.com/dotbalo/nginx:1.15.12
       imagePullPolicy: IfNotPresent

2.5 应用均匀分布在不同的机房

Kubernetes 的 topologySpreadConstraints(拓扑域约束) 是一种高级的调度策略,用于确保工作负载的副本在集群中的不同拓扑域(如节点、可用区、区域等)之间均匀分布。

root@VM-26-130-ubuntu:/usr/local/src/k8s/affinity# cat test7.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
 name: example-deployment
 labels:
   app: example
spec:
 selector:
   matchLabels:
     app: example
 replicas: 2
 template:
   metadata:
     labels:
       app: example
   spec:
     topologySpreadConstraints:
     - maxSkew: 1
       whenUnsatisfiable: DoNotSchedule
       topologyKey: kubernetes.io/hostname
       labelSelector:
         matchLabels:
           app: example
     containers:
     - name: example
       image: registry.cn-beijing.aliyuncs.com/dotbalo/nginx:1.15.12
       imagePullPolicy: IfNotPresent
  • topologySpreadConstraints:拓扑域约束配置,可以是多个副本均匀分布在不同的域中,配置多个时,需要全部满足
  • maxSkew:指定允许的最大偏差。例如,如果 maxSkew 设置为 1,那么在任何拓扑域中,副本的数量最多只能相差 1
  • whenUnsatisfiable:指定当无法满足拓扑约束时的行为
    • DoNotSchedule:不允许调度新的 Pod,直到满足约束
    • ScheduleAnyway:即使不满足约束,也允许调度新的 Pod
  • topologyKey:指定拓扑域的键
  • labelSelector:指定要应用拓扑约束的 Pod 的标签选择器,通常配置为当前 Pod 的标签