主题
K8s调度与资源管理
课程目标
通过本课程的学习,你将能够:
- 理解K8s调度器的工作原理和调度流程
- 掌握资源请求(requests)和限制(limits)的配置
- 配置HPA实现应用的自动扩缩容
- 理解QoS服务质量等级及其影响
- 应用节点亲和性、Pod亲和性/反亲和性
- 使用污点和容忍度控制Pod调度
前置要求:已完成《K8s存储与配置管理》课程,了解ConfigMap、Secret和持久化存储
一、K8s调度器概述
1.1 调度流程
┌─────────────────────────────────────────────────────────────────────────┐
│ Kubernetes调度流程 │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Pod创建 │────▶│ 调度队列 │────▶│ 过滤阶段 │────▶│ 打分阶段 │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ │ 选择节点 │ │
│ └────┬─────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Pod运行 │◀────│ 绑定节点 │◀────│ kubelet │◀────│ 创建Pod │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘调度阶段说明:
| 阶段 | 说明 | 示例 |
|---|---|---|
| 过滤(Predicates) | 筛选满足条件的节点 | 资源充足、节点正常、亲和性匹配 |
| 打分(Priorities) | 为候选节点打分排序 | 资源利用率、Pod分散度、亲和性权重 |
| 绑定(Binding) | 将Pod与节点绑定 | 更新Pod的nodeName字段 |
1.2 查看调度事件
bash
# 查看Pod调度事件
kubectl describe pod <pod-name> | grep -A 10 Events
# 查看调度器日志
kubectl logs -n kube-system <scheduler-pod-name>
# 查看节点资源分配情况
kubectl describe node <node-name> | grep -A 5 "Allocated resources"二、资源管理
2.1 资源类型
Kubernetes支持两种资源类型:
| 资源类型 | 单位 | 说明 |
|---|---|---|
| CPU | millicores (m) | 1核 = 1000m,可小数分配 |
| 内存 | bytes (Ki, Mi, Gi) | 二进制单位,1Gi = 1024Mi |
yaml
resources:
requests:
cpu: "250m" # 0.25核
memory: "512Mi" # 512MB
limits:
cpu: "1000m" # 1核
memory: "1Gi" # 1GB2.2 Requests vs Limits
┌─────────────────────────────────────────────────────────────────┐
│ 资源分配示意图 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Pod A: requests=250m, limits=500m │
│ ├─ 保证获得:250m CPU │
│ ├─ 可以使用:最多500m CPU(有富余时) │
│ └─ 超过limits:被限制/节流 │
│ │
│ Pod B: requests=512Mi, limits=1Gi │
│ ├─ 保证获得:512MB内存 │
│ ├─ 可以使用:最多1GB内存 │
│ └─ 超过limits:被OOM Killed │
│ │
└─────────────────────────────────────────────────────────────────┘关键区别:
| 特性 | Requests | Limits |
|---|---|---|
| 调度依据 | 用于选择有足够资源的节点 | 不参与调度决策 |
| 资源保证 | Pod保证获得这些资源 | 不保证,只是上限 |
| 超额使用 | 不能超额 | 可以使用超过requests的部分 |
| 超出处理 | - | CPU: 节流;内存: OOM Kill |
2.3 资源配置示例
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: app
image: nginx:1.25
resources:
# 请求:调度时使用的资源需求
requests:
cpu: "100m" # 0.1核
memory: "128Mi" # 128MB
# 限制:运行时资源上限
limits:
cpu: "500m" # 0.5核
memory: "512Mi" # 512MB常见应用场景:
yaml
# 场景1:CPU密集型应用(如视频处理)
resources:
requests:
cpu: "2000m" # 需要较多CPU
memory: "1Gi"
limits:
cpu: "4000m" # 允许突发使用
memory: "2Gi"
# 场景2:内存密集型应用(如Redis缓存)
resources:
requests:
cpu: "100m"
memory: "4Gi" # 需要大量内存
limits:
cpu: "500m"
memory: "4Gi" # 内存限制严格,避免OOM
# 场景3:轻量级微服务
resources:
requests:
cpu: "50m"
memory: "64Mi"
limits:
cpu: "200m"
memory: "256Mi"2.4 命名空间资源配额
限制命名空间的总资源使用量:
yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: production
spec:
hard:
# 计算资源限制
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
# 对象数量限制
pods: "50"
services: "20"
persistentvolumeclaims: "20"
secrets: "20"
configmaps: "20"2.5 默认资源限制(LimitRange)
为命名空间设置默认的资源配置:
yaml
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: default
spec:
limits:
- default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
type: Container
- max:
cpu: "2000m"
memory: "4Gi"
min:
cpu: "50m"
memory: "64Mi"
type: Container三、QoS服务质量等级
3.1 QoS分类
Kubernetes根据资源配置将Pod分为三个QoS等级:
┌─────────────────────────────────────────────────────────────────┐
│ QoS等级分类 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Guaranteed (保证) │
│ ├─ 条件:所有容器都设置了requests=limits(CPU和内存) │
│ └─ 优先级:最高,最后被驱逐 │
│ │
│ Burstable (可突发) │
│ ├─ 条件:至少一个容器设置了requests,但不满足Guaranteed │
│ └─ 优先级:中等,按使用量超过requests的比例排序驱逐 │
│ │
│ BestEffort (尽力而为) │
│ ├─ 条件:没有设置任何requests和limits │
│ └─ 优先级:最低,最先被驱逐 │
│ │
└─────────────────────────────────────────────────────────────────┘3.2 QoS配置示例
yaml
# Guaranteed QoS
apiVersion: v1
kind: Pod
metadata:
name: guaranteed-pod
spec:
containers:
- name: nginx
image: nginx
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "500m" # 必须等于requests
memory: "1Gi" # 必须等于requests
---
# Burstable QoS
apiVersion: v1
kind: Pod
metadata:
name: burstable-pod
spec:
containers:
- name: nginx
image: nginx
resources:
requests:
cpu: "200m"
memory: "512Mi"
limits:
cpu: "1000m" # 大于requests
memory: "1Gi" # 大于requests
---
# BestEffort QoS
apiVersion: v1
kind: Pod
metadata:
name: besteffort-pod
spec:
containers:
- name: nginx
image: nginx
# 不设置resources3.3 查看Pod QoS
bash
# 查看Pod的QoS等级
kubectl get pod <pod-name> -o jsonpath='{.status.qosClass}'
# 查看所有Pod的QoS
kubectl get pods --all-namespaces -o custom-columns=\
"NAME:.metadata.name,QOS:.status.qosClass,STATUS:.status.phase"
# 查看节点资源压力导致的驱逐
kubectl get events --field-selector reason=Evicted四、HPA自动扩缩容
4.1 HPA工作原理
┌─────────────────────────────────────────────────────────────────┐
│ HPA工作流程 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Metrics │───▶│ HPA │───▶│ Deployment │ │
│ │ Server │ │ Controller │ │ Scale │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ CPU使用率 │ │ 计算目标副本数 │ │
│ │ 内存使用率 │ │ 调整Deployment │ │
│ │ 自定义指标 │ │ replicas │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ 扩缩容算法: │
│ desiredReplicas = ceil[currentReplicas * (currentMetric / targetMetric)] │
│ │
└─────────────────────────────────────────────────────────────────┘4.2 HPA配置示例
yaml
# CPU和内存自动扩缩容
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2 # 最小副本数
maxReplicas: 20 # 最大副本数
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # CPU平均使用率70%
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80 # 内存平均使用率80%
behavior: # 扩缩容行为配置
scaleUp:
stabilizationWindowSeconds: 60 # 扩容稳定窗口
policies:
- type: Percent
value: 100
periodSeconds: 15 # 15秒内最多扩容100%
scaleDown:
stabilizationWindowSeconds: 300 # 缩容稳定窗口(5分钟)
policies:
- type: Percent
value: 10
periodSeconds: 60 # 60秒内最多缩容10%4.3 基于自定义指标的HPA
yaml
# 使用Prometheus Adapter提供自定义指标
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 50
metrics:
# 基于自定义指标:每秒请求数
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000" # 每个Pod 1000 RPS
# 基于外部指标:消息队列深度
- type: External
external:
metric:
name: rabbitmq_queue_messages
selector:
matchLabels:
queue: task-queue
target:
type: AverageValue
averageValue: "100" # 队列消息数4.4 HPA使用前提
bash
# 1. 安装Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# 2. 验证Metrics Server
kubectl top nodes
kubectl top pods
# 3. 创建HPA
kubectl autoscale deployment web-app --cpu-percent=70 --min=2 --max=10
# 4. 查看HPA状态
kubectl get hpa
kubectl describe hpa web-app-hpa4.5 压力测试验证
bash
# 使用ab或hey进行压力测试
# 安装hey
go install github.com/rakyll/hey@latest
# 压力测试
hey -z 5m -c 100 http://<service-ip>/
# 观察HPA自动扩容
watch kubectl get pods,hpa五、调度策略
5.1 节点亲和性(Node Affinity)
控制Pod调度到特定节点:
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-app
spec:
template:
spec:
affinity:
nodeAffinity:
# 必须满足的条件(硬约束)
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-tesla-v100
- nvidia-tesla-p100
# 优先满足的条件(软约束)
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
- weight: 50
preference:
matchExpressions:
- key: zone
operator: In
values:
- zone-a
containers:
- name: app
image: gpu-app:1.0
resources:
limits:
nvidia.com/gpu: 1操作符说明:
| 操作符 | 说明 |
|---|---|
| In | 值在列表中 |
| NotIn | 值不在列表中 |
| Exists | 键存在(不检查值) |
| DoesNotExist | 键不存在 |
| Gt | 值大于指定值 |
| Lt | 值小于指定值 |
5.2 Pod亲和性与反亲和性
控制Pod之间的相对位置:
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
template:
spec:
affinity:
# Pod亲和性:与某些Pod在同一节点
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cache
topologyKey: kubernetes.io/hostname
# Pod反亲和性:与某些Pod不在同一节点
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-app
topologyKey: kubernetes.io/hostname
containers:
- name: app
image: web-app:1.0拓扑域(topologyKey):
| 拓扑域 | 说明 |
|---|---|
| kubernetes.io/hostname | 节点级别 |
| topology.kubernetes.io/zone | 可用区级别 |
| topology.kubernetes.io/region | 地域级别 |
5.3 污点和容忍度(Taints & Tolerations)
┌─────────────────────────────────────────────────────────────────┐
│ 污点和容忍度机制 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 节点(Taint) Pod(Toleration) │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ key=value:NoSchedule │ key=value │ │
│ │ key=value:NoExecute │ operator:Equal│ │
│ │ key=value:PreferNoSchedule │ effect:NoSchedule│ │
│ └──────────────┘ └──────────────┘ │
│ │
│ 效果(Effect): │
│ ├─ NoSchedule:不调度新Pod │
│ ├─ PreferNoSchedule:尽量不调度 │
│ └─ NoExecute:驱逐现有Pod │
│ │
└─────────────────────────────────────────────────────────────────┘添加污点:
bash
# 给节点添加污点
kubectl taint nodes node1 dedicated=gpu:NoSchedule
# 查看节点污点
kubectl describe node node1 | grep Taints
# 移除污点
kubectl taint nodes node1 dedicated=gpu:NoSchedule-Pod容忍污点:
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-app
spec:
template:
spec:
tolerations:
# 完全匹配污点
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
# 只匹配key(任何value)
- key: "dedicated"
operator: "Exists"
effect: "NoSchedule"
# 容忍所有污点(不推荐生产环境)
- operator: "Exists"
containers:
- name: app
image: gpu-app:1.0常见使用场景:
bash
# 场景1:专用节点
kubectl taint nodes node-gpu-1 dedicated=gpu:NoSchedule
# 场景2:维护模式
kubectl taint nodes node-1 maintenance=true:NoExecute
# 场景3:控制Pod分布
kubectl taint nodes node-1 zone=a:NoSchedule
kubectl taint nodes node-2 zone=b:NoSchedule5.4 节点选择器(Node Selector)
最简单的节点选择方式:
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ssd-app
spec:
template:
spec:
nodeSelector:
disktype: ssd
environment: production
containers:
- name: app
image: myapp:1.0bash
# 给节点打标签
kubectl label nodes node-1 disktype=ssd
kubectl label nodes node-1 environment=production
# 查看节点标签
kubectl get nodes --show-labels六、实战案例:高可用应用部署
6.1 部署多可用区高可用应用
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ha-web-app
spec:
replicas: 6
selector:
matchLabels:
app: ha-web
template:
metadata:
labels:
app: ha-web
spec:
affinity:
# Pod反亲和性:同一应用分散到不同节点
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- ha-web
topologyKey: kubernetes.io/hostname
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- ha-web
topologyKey: topology.kubernetes.io/zone
# 优先调度到SSD节点
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
containers:
- name: web
image: nginx:1.25
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
ports:
- containerPort: 80
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ha-web-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ha-web-app
minReplicas: 6
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 706.2 验证高可用配置
bash
# 查看Pod分布
kubectl get pods -o wide -l app=ha-web
# 查看节点标签
kubectl get nodes --show-labels
# 模拟节点故障,观察Pod重新调度
drain node-1 --ignore-daemonsets
kubectl get pods -o wide -l app=ha-web -w七、总结
核心概念回顾
资源管理
├── Requests vs Limits
│ ├── Requests:调度依据,资源保证
│ └── Limits:运行上限,超额处理
│
├── QoS等级
│ ├── Guaranteed:requests=limits
│ ├── Burstable:有requests但不满足Guaranteed
│ └── BestEffort:无resources配置
│
└── 资源配额
├── ResourceQuota:命名空间级别限制
└── LimitRange:默认资源配置
自动扩缩容
└── HPA
├── 基于CPU/内存使用率
├── 基于自定义指标
└── 扩缩容行为配置
调度策略
├── 节点选择
│ ├── nodeSelector:简单标签匹配
│ └── nodeAffinity:复杂节点选择
│
├── Pod分布
│ ├── podAffinity:Pod亲和性
│ └── podAntiAffinity:Pod反亲和性
│
└── 污点与容忍
├── Taint:节点排斥Pod
└── Toleration:Pod容忍污点最佳实践清单
yaml
# ✅ 生产环境推荐配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: production-app
spec:
replicas: 3
selector:
matchLabels:
app: prod-app
template:
metadata:
labels:
app: prod-app
spec:
# 1. 配置资源限制
containers:
- name: app
image: myapp:1.0
resources:
requests:
cpu: "100m"
memory: "256Mi"
limits:
cpu: "1000m"
memory: "1Gi"
# 2. 高可用调度策略
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- prod-app
topologyKey: kubernetes.io/hostname
# 3. 优雅终止
terminationGracePeriodSeconds: 60
---
# 4. 配置HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: prod-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: production-app
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # 5分钟稳定窗口下节预告
在下一节《Helm Chart应用与管理》中,我们将学习:
- Helm的基本概念和工作原理
- 使用Helm安装和管理应用
- Chart仓库的使用
- 自定义values配置
- Helm发布生命周期管理
💡 学习建议:
- 在测试环境配置不同QoS等级的Pod,观察节点压力时的行为差异
- 为应用配置HPA并进行压力测试,观察自动扩缩容效果
- 使用亲和性和反亲和性配置实现应用的高可用部署