跳转到内容

K8s调度与资源管理

课程目标

通过本课程的学习,你将能够:

  • 理解K8s调度器的工作原理和调度流程
  • 掌握资源请求(requests)和限制(limits)的配置
  • 配置HPA实现应用的自动扩缩容
  • 理解QoS服务质量等级及其影响
  • 应用节点亲和性、Pod亲和性/反亲和性
  • 使用污点和容忍度控制Pod调度

前置要求:已完成《K8s存储与配置管理》课程,了解ConfigMap、Secret和持久化存储

一、K8s调度器概述

1.1 调度流程

┌─────────────────────────────────────────────────────────────────────────┐
│                        Kubernetes调度流程                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐      │
│  │  Pod创建  │────▶│ 调度队列  │────▶│ 过滤阶段  │────▶│ 打分阶段  │      │
│  └──────────┘     └──────────┘     └──────────┘     └──────────┘      │
│                                                           │            │
│                                                           ▼            │
│                                                    ┌──────────┐        │
│                                                    │ 选择节点  │        │
│                                                    └────┬─────┘        │
│                                                         │              │
│                                                         ▼              │
│  ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐      │
│  │  Pod运行  │◀────│ 绑定节点  │◀────│  kubelet  │◀────│  创建Pod  │      │
│  └──────────┘     └──────────┘     └──────────┘     └──────────┘      │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

调度阶段说明

阶段说明示例
过滤(Predicates)筛选满足条件的节点资源充足、节点正常、亲和性匹配
打分(Priorities)为候选节点打分排序资源利用率、Pod分散度、亲和性权重
绑定(Binding)将Pod与节点绑定更新Pod的nodeName字段

1.2 查看调度事件

bash
# 查看Pod调度事件
kubectl describe pod <pod-name> | grep -A 10 Events

# 查看调度器日志
kubectl logs -n kube-system <scheduler-pod-name>

# 查看节点资源分配情况
kubectl describe node <node-name> | grep -A 5 "Allocated resources"

二、资源管理

2.1 资源类型

Kubernetes支持两种资源类型:

资源类型单位说明
CPUmillicores (m)1核 = 1000m,可小数分配
内存bytes (Ki, Mi, Gi)二进制单位,1Gi = 1024Mi
yaml
resources:
  requests:
    cpu: "250m"      # 0.25核
    memory: "512Mi"  # 512MB
  limits:
    cpu: "1000m"     # 1核
    memory: "1Gi"    # 1GB

2.2 Requests vs Limits

┌─────────────────────────────────────────────────────────────────┐
│                    资源分配示意图                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Pod A: requests=250m, limits=500m                              │
│  ├─ 保证获得:250m CPU                                           │
│  ├─ 可以使用:最多500m CPU(有富余时)                            │
│  └─ 超过limits:被限制/节流                                      │
│                                                                 │
│  Pod B: requests=512Mi, limits=1Gi                              │
│  ├─ 保证获得:512MB内存                                          │
│  ├─ 可以使用:最多1GB内存                                        │
│  └─ 超过limits:被OOM Killed                                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

关键区别

特性RequestsLimits
调度依据用于选择有足够资源的节点不参与调度决策
资源保证Pod保证获得这些资源不保证,只是上限
超额使用不能超额可以使用超过requests的部分
超出处理-CPU: 节流;内存: OOM Kill

2.3 资源配置示例

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: app
        image: nginx:1.25
        resources:
          # 请求:调度时使用的资源需求
          requests:
            cpu: "100m"      # 0.1核
            memory: "128Mi"  # 128MB
          # 限制:运行时资源上限
          limits:
            cpu: "500m"      # 0.5核
            memory: "512Mi"  # 512MB

常见应用场景

yaml
# 场景1:CPU密集型应用(如视频处理)
resources:
  requests:
    cpu: "2000m"     # 需要较多CPU
    memory: "1Gi"
  limits:
    cpu: "4000m"     # 允许突发使用
    memory: "2Gi"

# 场景2:内存密集型应用(如Redis缓存)
resources:
  requests:
    cpu: "100m"
    memory: "4Gi"    # 需要大量内存
  limits:
    cpu: "500m"
    memory: "4Gi"    # 内存限制严格,避免OOM

# 场景3:轻量级微服务
resources:
  requests:
    cpu: "50m"
    memory: "64Mi"
  limits:
    cpu: "200m"
    memory: "256Mi"

2.4 命名空间资源配额

限制命名空间的总资源使用量:

yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: production
spec:
  hard:
    # 计算资源限制
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    
    # 对象数量限制
    pods: "50"
    services: "20"
    persistentvolumeclaims: "20"
    secrets: "20"
    configmaps: "20"

2.5 默认资源限制(LimitRange)

为命名空间设置默认的资源配置:

yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: default
spec:
  limits:
  - default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "100m"
      memory: "128Mi"
    type: Container
  - max:
      cpu: "2000m"
      memory: "4Gi"
    min:
      cpu: "50m"
      memory: "64Mi"
    type: Container

三、QoS服务质量等级

3.1 QoS分类

Kubernetes根据资源配置将Pod分为三个QoS等级:

┌─────────────────────────────────────────────────────────────────┐
│                      QoS等级分类                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Guaranteed (保证)                                               │
│  ├─ 条件:所有容器都设置了requests=limits(CPU和内存)             │
│  └─ 优先级:最高,最后被驱逐                                      │
│                                                                 │
│  Burstable (可突发)                                              │
│  ├─ 条件:至少一个容器设置了requests,但不满足Guaranteed           │
│  └─ 优先级:中等,按使用量超过requests的比例排序驱逐                │
│                                                                 │
│  BestEffort (尽力而为)                                           │
│  ├─ 条件:没有设置任何requests和limits                           │
│  └─ 优先级:最低,最先被驱逐                                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

3.2 QoS配置示例

yaml
# Guaranteed QoS
apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-pod
spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      requests:
        cpu: "500m"
        memory: "1Gi"
      limits:
        cpu: "500m"      # 必须等于requests
        memory: "1Gi"    # 必须等于requests

---
# Burstable QoS
apiVersion: v1
kind: Pod
metadata:
  name: burstable-pod
spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      requests:
        cpu: "200m"
        memory: "512Mi"
      limits:
        cpu: "1000m"     # 大于requests
        memory: "1Gi"    # 大于requests

---
# BestEffort QoS
apiVersion: v1
kind: Pod
metadata:
  name: besteffort-pod
spec:
  containers:
  - name: nginx
    image: nginx
    # 不设置resources

3.3 查看Pod QoS

bash
# 查看Pod的QoS等级
kubectl get pod <pod-name> -o jsonpath='{.status.qosClass}'

# 查看所有Pod的QoS
kubectl get pods --all-namespaces -o custom-columns=\
"NAME:.metadata.name,QOS:.status.qosClass,STATUS:.status.phase"

# 查看节点资源压力导致的驱逐
kubectl get events --field-selector reason=Evicted

四、HPA自动扩缩容

4.1 HPA工作原理

┌─────────────────────────────────────────────────────────────────┐
│                      HPA工作流程                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐         │
│  │  Metrics    │───▶│  HPA        │───▶│ Deployment  │         │
│  │  Server     │    │  Controller │    │  Scale      │         │
│  └─────────────┘    └─────────────┘    └─────────────┘         │
│         │                  │                                     │
│         │                  │                                     │
│         ▼                  ▼                                     │
│  ┌─────────────┐    ┌─────────────┐                             │
│  │ CPU使用率    │    │ 计算目标副本数 │                             │
│  │ 内存使用率   │    │ 调整Deployment │                             │
│  │ 自定义指标   │    │ replicas     │                             │
│  └─────────────┘    └─────────────┘                             │
│                                                                 │
│  扩缩容算法:                                                     │
│  desiredReplicas = ceil[currentReplicas * (currentMetric / targetMetric)] │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

4.2 HPA配置示例

yaml
# CPU和内存自动扩缩容
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2          # 最小副本数
  maxReplicas: 20         # 最大副本数
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70   # CPU平均使用率70%
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80   # 内存平均使用率80%
  behavior:               # 扩缩容行为配置
    scaleUp:
      stabilizationWindowSeconds: 60   # 扩容稳定窗口
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15    # 15秒内最多扩容100%
    scaleDown:
      stabilizationWindowSeconds: 300  # 缩容稳定窗口(5分钟)
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60    # 60秒内最多缩容10%

4.3 基于自定义指标的HPA

yaml
# 使用Prometheus Adapter提供自定义指标
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
  # 基于自定义指标:每秒请求数
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"   # 每个Pod 1000 RPS
  # 基于外部指标:消息队列深度
  - type: External
    external:
      metric:
        name: rabbitmq_queue_messages
        selector:
          matchLabels:
            queue: task-queue
      target:
        type: AverageValue
        averageValue: "100"    # 队列消息数

4.4 HPA使用前提

bash
# 1. 安装Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# 2. 验证Metrics Server
kubectl top nodes
kubectl top pods

# 3. 创建HPA
kubectl autoscale deployment web-app --cpu-percent=70 --min=2 --max=10

# 4. 查看HPA状态
kubectl get hpa
kubectl describe hpa web-app-hpa

4.5 压力测试验证

bash
# 使用ab或hey进行压力测试
# 安装hey
go install github.com/rakyll/hey@latest

# 压力测试
hey -z 5m -c 100 http://<service-ip>/

# 观察HPA自动扩容
watch kubectl get pods,hpa

五、调度策略

5.1 节点亲和性(Node Affinity)

控制Pod调度到特定节点:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-app
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          # 必须满足的条件(硬约束)
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: accelerator
                operator: In
                values:
                - nvidia-tesla-v100
                - nvidia-tesla-p100
          
          # 优先满足的条件(软约束)
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: disktype
                operator: In
                values:
                - ssd
          - weight: 50
            preference:
              matchExpressions:
              - key: zone
                operator: In
                values:
                - zone-a
      containers:
      - name: app
        image: gpu-app:1.0
        resources:
          limits:
            nvidia.com/gpu: 1

操作符说明

操作符说明
In值在列表中
NotIn值不在列表中
Exists键存在(不检查值)
DoesNotExist键不存在
Gt值大于指定值
Lt值小于指定值

5.2 Pod亲和性与反亲和性

控制Pod之间的相对位置:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  template:
    spec:
      affinity:
        # Pod亲和性:与某些Pod在同一节点
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - cache
            topologyKey: kubernetes.io/hostname
        
        # Pod反亲和性:与某些Pod不在同一节点
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - web-app
              topologyKey: kubernetes.io/hostname
      containers:
      - name: app
        image: web-app:1.0

拓扑域(topologyKey)

拓扑域说明
kubernetes.io/hostname节点级别
topology.kubernetes.io/zone可用区级别
topology.kubernetes.io/region地域级别

5.3 污点和容忍度(Taints & Tolerations)

┌─────────────────────────────────────────────────────────────────┐
│                    污点和容忍度机制                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  节点(Taint)                    Pod(Toleration)              │
│  ┌──────────────┐                ┌──────────────┐              │
│  │ key=value:NoSchedule          │ key=value    │              │
│  │ key=value:NoExecute           │ operator:Equal│             │
│  │ key=value:PreferNoSchedule    │ effect:NoSchedule│          │
│  └──────────────┘                └──────────────┘              │
│                                                                 │
│  效果(Effect):                                                │
│  ├─ NoSchedule:不调度新Pod                                      │
│  ├─ PreferNoSchedule:尽量不调度                                  │
│  └─ NoExecute:驱逐现有Pod                                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

添加污点

bash
# 给节点添加污点
kubectl taint nodes node1 dedicated=gpu:NoSchedule

# 查看节点污点
kubectl describe node node1 | grep Taints

# 移除污点
kubectl taint nodes node1 dedicated=gpu:NoSchedule-

Pod容忍污点

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-app
spec:
  template:
    spec:
      tolerations:
      # 完全匹配污点
      - key: "dedicated"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"
      
      # 只匹配key(任何value)
      - key: "dedicated"
        operator: "Exists"
        effect: "NoSchedule"
      
      # 容忍所有污点(不推荐生产环境)
      - operator: "Exists"
      
      containers:
      - name: app
        image: gpu-app:1.0

常见使用场景

bash
# 场景1:专用节点
kubectl taint nodes node-gpu-1 dedicated=gpu:NoSchedule

# 场景2:维护模式
kubectl taint nodes node-1 maintenance=true:NoExecute

# 场景3:控制Pod分布
kubectl taint nodes node-1 zone=a:NoSchedule
kubectl taint nodes node-2 zone=b:NoSchedule

5.4 节点选择器(Node Selector)

最简单的节点选择方式:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ssd-app
spec:
  template:
    spec:
      nodeSelector:
        disktype: ssd
        environment: production
      containers:
      - name: app
        image: myapp:1.0
bash
# 给节点打标签
kubectl label nodes node-1 disktype=ssd
kubectl label nodes node-1 environment=production

# 查看节点标签
kubectl get nodes --show-labels

六、实战案例:高可用应用部署

6.1 部署多可用区高可用应用

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ha-web-app
spec:
  replicas: 6
  selector:
    matchLabels:
      app: ha-web
  template:
    metadata:
      labels:
        app: ha-web
    spec:
      affinity:
        # Pod反亲和性:同一应用分散到不同节点
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - ha-web
            topologyKey: kubernetes.io/hostname
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - ha-web
              topologyKey: topology.kubernetes.io/zone
      
      # 优先调度到SSD节点
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
            - key: disktype
              operator: In
              values:
              - ssd
      
      containers:
      - name: web
        image: nginx:1.25
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"
        ports:
        - containerPort: 80
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ha-web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ha-web-app
  minReplicas: 6
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

6.2 验证高可用配置

bash
# 查看Pod分布
kubectl get pods -o wide -l app=ha-web

# 查看节点标签
kubectl get nodes --show-labels

# 模拟节点故障,观察Pod重新调度
drain node-1 --ignore-daemonsets
kubectl get pods -o wide -l app=ha-web -w

七、总结

核心概念回顾

资源管理
├── Requests vs Limits
│   ├── Requests:调度依据,资源保证
│   └── Limits:运行上限,超额处理

├── QoS等级
│   ├── Guaranteed:requests=limits
│   ├── Burstable:有requests但不满足Guaranteed
│   └── BestEffort:无resources配置

└── 资源配额
    ├── ResourceQuota:命名空间级别限制
    └── LimitRange:默认资源配置

自动扩缩容
└── HPA
    ├── 基于CPU/内存使用率
    ├── 基于自定义指标
    └── 扩缩容行为配置

调度策略
├── 节点选择
│   ├── nodeSelector:简单标签匹配
│   └── nodeAffinity:复杂节点选择

├── Pod分布
│   ├── podAffinity:Pod亲和性
│   └── podAntiAffinity:Pod反亲和性

└── 污点与容忍
    ├── Taint:节点排斥Pod
    └── Toleration:Pod容忍污点

最佳实践清单

yaml
# ✅ 生产环境推荐配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: production-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: prod-app
  template:
    metadata:
      labels:
        app: prod-app
    spec:
      # 1. 配置资源限制
      containers:
      - name: app
        image: myapp:1.0
        resources:
          requests:
            cpu: "100m"
            memory: "256Mi"
          limits:
            cpu: "1000m"
            memory: "1Gi"
      
      # 2. 高可用调度策略
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - prod-app
              topologyKey: kubernetes.io/hostname
      
      # 3. 优雅终止
      terminationGracePeriodSeconds: 60
---
# 4. 配置HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: prod-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: production-app
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # 5分钟稳定窗口

下节预告

在下一节《Helm Chart应用与管理》中,我们将学习:

  • Helm的基本概念和工作原理
  • 使用Helm安装和管理应用
  • Chart仓库的使用
  • 自定义values配置
  • Helm发布生命周期管理

💡 学习建议

  1. 在测试环境配置不同QoS等级的Pod,观察节点压力时的行为差异
  2. 为应用配置HPA并进行压力测试,观察自动扩缩容效果
  3. 使用亲和性和反亲和性配置实现应用的高可用部署

评论区

专业的Linux技术学习平台,从入门到精通的完整学习路径