K8s调度与资源管理

课程目标

通过本课程的学习，你将能够：

理解K8s调度器的工作原理和调度流程
掌握资源请求（requests）和限制（limits）的配置
配置HPA实现应用的自动扩缩容
理解QoS服务质量等级及其影响
应用节点亲和性、Pod亲和性/反亲和性
使用污点和容忍度控制Pod调度

前置要求：已完成《K8s存储与配置管理》课程，了解ConfigMap、Secret和持久化存储

一、K8s调度器概述

1.1 调度流程

┌─────────────────────────────────────────────────────────────────────────┐
│                        Kubernetes调度流程                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐      │
│  │  Pod创建  │────▶│ 调度队列  │────▶│ 过滤阶段  │────▶│ 打分阶段  │      │
│  └──────────┘     └──────────┘     └──────────┘     └──────────┘      │
│                                                           │            │
│                                                           ▼            │
│                                                    ┌──────────┐        │
│                                                    │ 选择节点  │        │
│                                                    └────┬─────┘        │
│                                                         │              │
│                                                         ▼              │
│  ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐      │
│  │  Pod运行  │◀────│ 绑定节点  │◀────│  kubelet  │◀────│  创建Pod  │      │
│  └──────────┘     └──────────┘     └──────────┘     └──────────┘      │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

调度阶段说明：

阶段	说明	示例
过滤（Predicates）	筛选满足条件的节点	资源充足、节点正常、亲和性匹配
打分（Priorities）	为候选节点打分排序	资源利用率、Pod分散度、亲和性权重
绑定（Binding）	将Pod与节点绑定	更新Pod的nodeName字段

1.2 查看调度事件

bash

# 查看Pod调度事件
kubectl describe pod <pod-name> | grep -A 10 Events

# 查看调度器日志
kubectl logs -n kube-system <scheduler-pod-name>

# 查看节点资源分配情况
kubectl describe node <node-name> | grep -A 5 "Allocated resources"

二、资源管理

2.1 资源类型

Kubernetes支持两种资源类型：

资源类型	单位	说明
CPU	millicores (m)	1核 = 1000m，可小数分配
内存	bytes (Ki, Mi, Gi)	二进制单位，1Gi = 1024Mi

yaml

resources:
  requests:
    cpu: "250m"      # 0.25核
    memory: "512Mi"  # 512MB
  limits:
    cpu: "1000m"     # 1核
    memory: "1Gi"    # 1GB

2.2 Requests vs Limits

┌─────────────────────────────────────────────────────────────────┐
│                    资源分配示意图                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Pod A: requests=250m, limits=500m                              │
│  ├─ 保证获得：250m CPU                                           │
│  ├─ 可以使用：最多500m CPU（有富余时）                            │
│  └─ 超过limits：被限制/节流                                      │
│                                                                 │
│  Pod B: requests=512Mi, limits=1Gi                              │
│  ├─ 保证获得：512MB内存                                          │
│  ├─ 可以使用：最多1GB内存                                        │
│  └─ 超过limits：被OOM Killed                                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

关键区别：

特性	Requests	Limits
调度依据	用于选择有足够资源的节点	不参与调度决策
资源保证	Pod保证获得这些资源	不保证，只是上限
超额使用	不能超额	可以使用超过requests的部分
超出处理	-	CPU: 节流；内存: OOM Kill

2.3 资源配置示例

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: app
        image: nginx:1.25
        resources:
          # 请求：调度时使用的资源需求
          requests:
            cpu: "100m"      # 0.1核
            memory: "128Mi"  # 128MB
          # 限制：运行时资源上限
          limits:
            cpu: "500m"      # 0.5核
            memory: "512Mi"  # 512MB

常见应用场景：

yaml

# 场景1：CPU密集型应用（如视频处理）
resources:
  requests:
    cpu: "2000m"     # 需要较多CPU
    memory: "1Gi"
  limits:
    cpu: "4000m"     # 允许突发使用
    memory: "2Gi"

# 场景2：内存密集型应用（如Redis缓存）
resources:
  requests:
    cpu: "100m"
    memory: "4Gi"    # 需要大量内存
  limits:
    cpu: "500m"
    memory: "4Gi"    # 内存限制严格，避免OOM

# 场景3：轻量级微服务
resources:
  requests:
    cpu: "50m"
    memory: "64Mi"
  limits:
    cpu: "200m"
    memory: "256Mi"

2.4 命名空间资源配额

限制命名空间的总资源使用量：

yaml

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: production
spec:
  hard:
    # 计算资源限制
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    
    # 对象数量限制
    pods: "50"
    services: "20"
    persistentvolumeclaims: "20"
    secrets: "20"
    configmaps: "20"

2.5 默认资源限制（LimitRange）

为命名空间设置默认的资源配置：

yaml

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: default
spec:
  limits:
  - default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "100m"
      memory: "128Mi"
    type: Container
  - max:
      cpu: "2000m"
      memory: "4Gi"
    min:
      cpu: "50m"
      memory: "64Mi"
    type: Container

三、QoS服务质量等级

3.1 QoS分类

Kubernetes根据资源配置将Pod分为三个QoS等级：

┌─────────────────────────────────────────────────────────────────┐
│                      QoS等级分类                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Guaranteed (保证)                                               │
│  ├─ 条件：所有容器都设置了requests=limits（CPU和内存）             │
│  └─ 优先级：最高，最后被驱逐                                      │
│                                                                 │
│  Burstable (可突发)                                              │
│  ├─ 条件：至少一个容器设置了requests，但不满足Guaranteed           │
│  └─ 优先级：中等，按使用量超过requests的比例排序驱逐                │
│                                                                 │
│  BestEffort (尽力而为)                                           │
│  ├─ 条件：没有设置任何requests和limits                           │
│  └─ 优先级：最低，最先被驱逐                                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

3.2 QoS配置示例

yaml

# Guaranteed QoS
apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-pod
spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      requests:
        cpu: "500m"
        memory: "1Gi"
      limits:
        cpu: "500m"      # 必须等于requests
        memory: "1Gi"    # 必须等于requests

---
# Burstable QoS
apiVersion: v1
kind: Pod
metadata:
  name: burstable-pod
spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      requests:
        cpu: "200m"
        memory: "512Mi"
      limits:
        cpu: "1000m"     # 大于requests
        memory: "1Gi"    # 大于requests

---
# BestEffort QoS
apiVersion: v1
kind: Pod
metadata:
  name: besteffort-pod
spec:
  containers:
  - name: nginx
    image: nginx
    # 不设置resources

3.3 查看Pod QoS

bash

# 查看Pod的QoS等级
kubectl get pod <pod-name> -o jsonpath='{.status.qosClass}'

# 查看所有Pod的QoS
kubectl get pods --all-namespaces -o custom-columns=\
"NAME:.metadata.name,QOS:.status.qosClass,STATUS:.status.phase"

# 查看节点资源压力导致的驱逐
kubectl get events --field-selector reason=Evicted

四、HPA自动扩缩容

4.1 HPA工作原理

┌─────────────────────────────────────────────────────────────────┐
│                      HPA工作流程                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐         │
│  │  Metrics    │───▶│  HPA        │───▶│ Deployment  │         │
│  │  Server     │    │  Controller │    │  Scale      │         │
│  └─────────────┘    └─────────────┘    └─────────────┘         │
│         │                  │                                     │
│         │                  │                                     │
│         ▼                  ▼                                     │
│  ┌─────────────┐    ┌─────────────┐                             │
│  │ CPU使用率    │    │ 计算目标副本数 │                             │
│  │ 内存使用率   │    │ 调整Deployment │                             │
│  │ 自定义指标   │    │ replicas     │                             │
│  └─────────────┘    └─────────────┘                             │
│                                                                 │
│  扩缩容算法：                                                     │
│  desiredReplicas = ceil[currentReplicas * (currentMetric / targetMetric)] │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

4.2 HPA配置示例

yaml

# CPU和内存自动扩缩容
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2          # 最小副本数
  maxReplicas: 20         # 最大副本数
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70   # CPU平均使用率70%
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80   # 内存平均使用率80%
  behavior:               # 扩缩容行为配置
    scaleUp:
      stabilizationWindowSeconds: 60   # 扩容稳定窗口
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15    # 15秒内最多扩容100%
    scaleDown:
      stabilizationWindowSeconds: 300  # 缩容稳定窗口（5分钟）
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60    # 60秒内最多缩容10%

4.3 基于自定义指标的HPA

yaml

# 使用Prometheus Adapter提供自定义指标
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
  # 基于自定义指标：每秒请求数
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"   # 每个Pod 1000 RPS
  # 基于外部指标：消息队列深度
  - type: External
    external:
      metric:
        name: rabbitmq_queue_messages
        selector:
          matchLabels:
            queue: task-queue
      target:
        type: AverageValue
        averageValue: "100"    # 队列消息数

4.4 HPA使用前提

bash

# 1. 安装Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# 2. 验证Metrics Server
kubectl top nodes
kubectl top pods

# 3. 创建HPA
kubectl autoscale deployment web-app --cpu-percent=70 --min=2 --max=10

# 4. 查看HPA状态
kubectl get hpa
kubectl describe hpa web-app-hpa

4.5 压力测试验证

bash

# 使用ab或hey进行压力测试
# 安装hey
go install github.com/rakyll/hey@latest

# 压力测试
hey -z 5m -c 100 http://<service-ip>/

# 观察HPA自动扩容
watch kubectl get pods,hpa

五、调度策略

5.1 节点亲和性（Node Affinity）

控制Pod调度到特定节点：

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-app
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          # 必须满足的条件（硬约束）
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: accelerator
                operator: In
                values:
                - nvidia-tesla-v100
                - nvidia-tesla-p100
          
          # 优先满足的条件（软约束）
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: disktype
                operator: In
                values:
                - ssd
          - weight: 50
            preference:
              matchExpressions:
              - key: zone
                operator: In
                values:
                - zone-a
      containers:
      - name: app
        image: gpu-app:1.0
        resources:
          limits:
            nvidia.com/gpu: 1

操作符说明：

操作符	说明
In	值在列表中
NotIn	值不在列表中
Exists	键存在（不检查值）
DoesNotExist	键不存在
Gt	值大于指定值
Lt	值小于指定值

5.2 Pod亲和性与反亲和性

控制Pod之间的相对位置：

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  template:
    spec:
      affinity:
        # Pod亲和性：与某些Pod在同一节点
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - cache
            topologyKey: kubernetes.io/hostname
        
        # Pod反亲和性：与某些Pod不在同一节点
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - web-app
              topologyKey: kubernetes.io/hostname
      containers:
      - name: app
        image: web-app:1.0

拓扑域（topologyKey）：

拓扑域	说明
kubernetes.io/hostname	节点级别
topology.kubernetes.io/zone	可用区级别
topology.kubernetes.io/region	地域级别

5.3 污点和容忍度（Taints & Tolerations）

┌─────────────────────────────────────────────────────────────────┐
│                    污点和容忍度机制                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  节点（Taint）                    Pod（Toleration）              │
│  ┌──────────────┐                ┌──────────────┐              │
│  │ key=value:NoSchedule          │ key=value    │              │
│  │ key=value:NoExecute           │ operator:Equal│             │
│  │ key=value:PreferNoSchedule    │ effect:NoSchedule│          │
│  └──────────────┘                └──────────────┘              │
│                                                                 │
│  效果（Effect）：                                                │
│  ├─ NoSchedule：不调度新Pod                                      │
│  ├─ PreferNoSchedule：尽量不调度                                  │
│  └─ NoExecute：驱逐现有Pod                                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

添加污点：

bash

# 给节点添加污点
kubectl taint nodes node1 dedicated=gpu:NoSchedule

# 查看节点污点
kubectl describe node node1 | grep Taints

# 移除污点
kubectl taint nodes node1 dedicated=gpu:NoSchedule-

Pod容忍污点：

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-app
spec:
  template:
    spec:
      tolerations:
      # 完全匹配污点
      - key: "dedicated"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"
      
      # 只匹配key（任何value）
      - key: "dedicated"
        operator: "Exists"
        effect: "NoSchedule"
      
      # 容忍所有污点（不推荐生产环境）
      - operator: "Exists"
      
      containers:
      - name: app
        image: gpu-app:1.0

常见使用场景：

bash

# 场景1：专用节点
kubectl taint nodes node-gpu-1 dedicated=gpu:NoSchedule

# 场景2：维护模式
kubectl taint nodes node-1 maintenance=true:NoExecute

# 场景3：控制Pod分布
kubectl taint nodes node-1 zone=a:NoSchedule
kubectl taint nodes node-2 zone=b:NoSchedule

5.4 节点选择器（Node Selector）

最简单的节点选择方式：

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ssd-app
spec:
  template:
    spec:
      nodeSelector:
        disktype: ssd
        environment: production
      containers:
      - name: app
        image: myapp:1.0

bash

# 给节点打标签
kubectl label nodes node-1 disktype=ssd
kubectl label nodes node-1 environment=production

# 查看节点标签
kubectl get nodes --show-labels

六、实战案例：高可用应用部署

6.1 部署多可用区高可用应用

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ha-web-app
spec:
  replicas: 6
  selector:
    matchLabels:
      app: ha-web
  template:
    metadata:
      labels:
        app: ha-web
    spec:
      affinity:
        # Pod反亲和性：同一应用分散到不同节点
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - ha-web
            topologyKey: kubernetes.io/hostname
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - ha-web
              topologyKey: topology.kubernetes.io/zone
      
      # 优先调度到SSD节点
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
            - key: disktype
              operator: In
              values:
              - ssd
      
      containers:
      - name: web
        image: nginx:1.25
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"
        ports:
        - containerPort: 80
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ha-web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ha-web-app
  minReplicas: 6
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

6.2 验证高可用配置

bash

# 查看Pod分布
kubectl get pods -o wide -l app=ha-web

# 查看节点标签
kubectl get nodes --show-labels

# 模拟节点故障，观察Pod重新调度
drain node-1 --ignore-daemonsets
kubectl get pods -o wide -l app=ha-web -w

七、总结

核心概念回顾

资源管理
├── Requests vs Limits
│   ├── Requests：调度依据，资源保证
│   └── Limits：运行上限，超额处理
│
├── QoS等级
│   ├── Guaranteed：requests=limits
│   ├── Burstable：有requests但不满足Guaranteed
│   └── BestEffort：无resources配置
│
└── 资源配额
    ├── ResourceQuota：命名空间级别限制
    └── LimitRange：默认资源配置

自动扩缩容
└── HPA
    ├── 基于CPU/内存使用率
    ├── 基于自定义指标
    └── 扩缩容行为配置

调度策略
├── 节点选择
│   ├── nodeSelector：简单标签匹配
│   └── nodeAffinity：复杂节点选择
│
├── Pod分布
│   ├── podAffinity：Pod亲和性
│   └── podAntiAffinity：Pod反亲和性
│
└── 污点与容忍
    ├── Taint：节点排斥Pod
    └── Toleration：Pod容忍污点

最佳实践清单

yaml

# ✅ 生产环境推荐配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: production-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: prod-app
  template:
    metadata:
      labels:
        app: prod-app
    spec:
      # 1. 配置资源限制
      containers:
      - name: app
        image: myapp:1.0
        resources:
          requests:
            cpu: "100m"
            memory: "256Mi"
          limits:
            cpu: "1000m"
            memory: "1Gi"
      
      # 2. 高可用调度策略
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - prod-app
              topologyKey: kubernetes.io/hostname
      
      # 3. 优雅终止
      terminationGracePeriodSeconds: 60
---
# 4. 配置HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: prod-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: production-app
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # 5分钟稳定窗口

下节预告

在下一节《Helm Chart应用与管理》中，我们将学习：

Helm的基本概念和工作原理
使用Helm安装和管理应用
Chart仓库的使用
自定义values配置
Helm发布生命周期管理

💡 学习建议：
在测试环境配置不同QoS等级的Pod，观察节点压力时的行为差异
为应用配置HPA并进行压力测试，观察自动扩缩容效果
使用亲和性和反亲和性配置实现应用的高可用部署

K8s调度与资源管理 ​

课程目标 ​

一、K8s调度器概述 ​

1.1 调度流程 ​

1.2 查看调度事件 ​

二、资源管理 ​

2.1 资源类型 ​

2.2 Requests vs Limits ​

2.3 资源配置示例 ​

2.4 命名空间资源配额 ​

2.5 默认资源限制（LimitRange） ​

三、QoS服务质量等级 ​

3.1 QoS分类 ​

3.2 QoS配置示例 ​

3.3 查看Pod QoS ​

四、HPA自动扩缩容 ​

4.1 HPA工作原理 ​

4.2 HPA配置示例 ​

4.3 基于自定义指标的HPA ​

4.4 HPA使用前提 ​

4.5 压力测试验证 ​

五、调度策略 ​

5.1 节点亲和性（Node Affinity） ​

5.2 Pod亲和性与反亲和性 ​

5.3 污点和容忍度（Taints & Tolerations） ​

5.4 节点选择器（Node Selector） ​

六、实战案例：高可用应用部署 ​

6.1 部署多可用区高可用应用 ​

6.2 验证高可用配置 ​

七、总结 ​

核心概念回顾 ​

最佳实践清单 ​

下节预告 ​

评论区

K8s调度与资源管理

课程目标

一、K8s调度器概述

1.1 调度流程

1.2 查看调度事件

二、资源管理

2.1 资源类型

2.2 Requests vs Limits

2.3 资源配置示例

2.4 命名空间资源配额

2.5 默认资源限制（LimitRange）

三、QoS服务质量等级

3.1 QoS分类

3.2 QoS配置示例

3.3 查看Pod QoS

四、HPA自动扩缩容

4.1 HPA工作原理

4.2 HPA配置示例

4.3 基于自定义指标的HPA

4.4 HPA使用前提

4.5 压力测试验证

五、调度策略

5.1 节点亲和性（Node Affinity）

5.2 Pod亲和性与反亲和性

5.3 污点和容忍度（Taints & Tolerations）

5.4 节点选择器（Node Selector）

六、实战案例：高可用应用部署

6.1 部署多可用区高可用应用

6.2 验证高可用配置

七、总结

核心概念回顾

最佳实践清单

下节预告