监控和日志

业务场景

云咖啡公司的系统已经上线运行，但我们需要实时监控系统运行状态，快速定位和解决问题。为了实现系统可观测性，我们需要部署监控和日志系统。

需求:

部署 Prometheus 监控系统
部署 Grafana 可视化面板
收集应用日志
配置告警规则

学习目标

完成本课程后，你将掌握：

Prometheus 的部署和配置
Grafana 的使用和仪表板创建
日志收集和分析
告警规则配置
系统可观测性的实现

前置准备

1. 确认环境

bash

# 检查命名空间
kubectl get namespace cloud-cafe

# 检查现有资源
kubectl get all -n cloud-cafe

# 检查 metrics-server
kubectl get pods -n kube-system | grep metrics

2. 创建监控命名空间

bash

# 创建监控命名空间
kubectl create namespace monitoring

# 查看命名空间
kubectl get namespaces

实战步骤

Step 1: 部署 Prometheus

概念: Prometheus 是一个开源的监控和告警工具，用于收集和存储时间序列数据。

1.1 创建 Prometheus ConfigMap

bash

# 创建 Prometheus 配置
kubectl create configmap prometheus-config \
  --from-literal=prometheus.yml='
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:9100'
        target_label: __address__
' \
  -n monitoring

# 查看 ConfigMap
kubectl get configmap prometheus-config -n monitoring

1.2 部署 Prometheus

bash

# 创建 Prometheus Deployment YAML 文件
cat > prometheus-deployment.yaml << 'EOF'
# Prometheus 监控系统 Deployment
# 用途：部署 Prometheus 监控服务器
# 功能：
#   - 收集和存储时间序列指标数据
#   - 提供 PromQL 查询接口
#   - 支持告警规则评估
# 配置说明：
#   - config.file：主配置文件路径
#   - storage.tsdb.path：时序数据存储路径
#   - web.enable-lifecycle：启用配置热重载（通过 /-/reload）
# 前置依赖：需要 prometheus-config ConfigMap 已创建
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
  labels:
    app: prometheus
spec:
  replicas: 1                   # Prometheus 单节点部署（高可用需要 Thanos 或 Cortex）
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:latest    # Prometheus 官方镜像
        args:
          # Prometheus 启动参数
          - '--config.file=/etc/prometheus/prometheus.yml'           # 主配置文件
          - '--storage.tsdb.path=/prometheus'                        # TSDB 存储路径
          - '--web.console.libraries=/usr/share/prometheus/console_libraries'    # 控制台库
          - '--web.console.templates=/usr/share/prometheus/consoles'             # 控制台模板
          - '--web.enable-lifecycle'                                 # 启用生命周期 API（支持热重载）
        ports:
        - containerPort: 9090
          name: web                 # Web UI 和 API 端口
        # 存储挂载
        volumeMounts:
        - name: prometheus-config
          mountPath: /etc/prometheus    # 配置文件目录
        - name: prometheus-data
          mountPath: /prometheus        # 数据存储目录
        # 资源限制（Prometheus 需要较多内存）
        resources:
          requests:
            memory: "512Mi"       # 最低内存要求
            cpu: "250m"           # 最低 CPU（0.25核）
          limits:
            memory: "1Gi"         # 最大内存限制
            cpu: "500m"           # 最大 CPU（0.5核）
        # 健康检查
        livenessProbe:
          httpGet:
            path: /-/healthy      # Prometheus 健康检查端点
            port: 9090
          initialDelaySeconds: 30    # 首次检查延迟（Prometheus 启动需要时间加载 WAL）
          periodSeconds: 10          # 检查间隔
        readinessProbe:
          httpGet:
            path: /-/ready        # Prometheus 就绪检查端点
            port: 9090
          initialDelaySeconds: 10    # 首次检查延迟
          periodSeconds: 5           # 检查间隔
      # 卷配置
      volumes:
      - name: prometheus-config
        configMap:
          name: prometheus-config     # 引用 Prometheus 配置 ConfigMap
      - name: prometheus-data
        emptyDir: {}                 # 使用 emptyDir（生产环境建议使用 PVC）
EOF

# 应用 Deployment
kubectl apply -f prometheus-deployment.yaml

# 等待 Prometheus 就绪
kubectl rollout status deployment/prometheus -n monitoring

# 查看 Pod
kubectl get pods -n monitoring

1.3 创建 Prometheus Service

bash

# 创建 Prometheus Service
kubectl expose deployment prometheus \
  --port=9090 \
  --target-port=9090 \
  --name=prometheus-svc \
  -n monitoring

# 查看 Service
kubectl get svc prometheus-svc -n monitoring

1.4 访问 Prometheus

bash

# 创建 NodePort Service
kubectl expose deployment prometheus \
  --port=9090 \
  --target-port=9090 \
  --name=prometheus-nodeport \
  --type=NodePort \
  -n monitoring

# 获取 NodePort
PROMETHEUS_PORT=$(kubectl get svc prometheus-nodeport -n monitoring -o jsonpath='{.spec.ports[0].nodePort}')
NODE_IP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')

echo "Prometheus 访问地址: http://$NODE_IP:$PROMETHEUS_PORT"

在浏览器中访问：http://192.168.56.10:xxxxx

你应该看到 Prometheus 的 Web UI。

Step 2: 为应用添加 Prometheus 指标

我们需要为应用添加 Prometheus 指标，以便 Prometheus 可以收集数据。

2.1 更新后端服务，添加 Prometheus 指标

我们需要为应用添加 Prometheus 指标，以便 Prometheus 可以收集数据。

首先，创建/编辑 order-backend-deployment.yaml 文件：

bash

# 创建/编辑后端服务 Deployment 文件
vim order-backend-deployment.yaml

order-backend-deployment.yaml 内容（添加 Prometheus 指标）：

点击查看 order-backend-deployment.yaml 内容

yaml

# 订单后端服务 Deployment
# 本次修改：添加 Prometheus 指标暴露
# 修改位置：
#   1. metadata.annotations 添加 prometheus.io/scrape 注解，让 Prometheus 自动发现
#   2. pip install 添加 prometheus-client 依赖
#   3. Python 代码中添加指标定义、装饰器和暴露端点
# 前置依赖：需要先创建 Redis、MySQL 和相关 ConfigMap/Secret
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-backend
  namespace: cloud-cafe
  labels:
    app: order-backend
spec:
  replicas: 2
  selector:
    matchLabels:
      app: order-backend
  template:
    metadata:
      labels:
        app: order-backend
      # [新增开始] Prometheus 自动发现注解
      # 这些注解告诉 Prometheus 如何抓取此应用的指标
      annotations:
        prometheus.io/scrape: "true"      # 启用指标采集
        prometheus.io/port: "5000"        # 指标暴露端口
        prometheus.io/path: "/metrics"    # 指标访问路径
      # [新增结束]
    spec:
      containers:
      - name: order-backend
        image: python:3.9-slim
        command: ["/bin/sh", "-c"]
        args:
          - |
            # [修改开始] 添加 prometheus-client 依赖
            # prometheus-client 是 Python 的官方 Prometheus 客户端库
            pip install flask pymysql flask-cors redis prometheus-client
            # [修改结束]
            cat > /app/app.py << 'PYEOF'
            from flask import Flask, request, jsonify
            from flask_cors import CORS
            import pymysql
            import redis
            import os
            import json
            # [新增开始] 导入 Prometheus 客户端库
            from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
            # Counter: 只增不减的计数器，适合记录请求总数
            # Histogram: 直方图，适合记录请求耗时分布
            # [新增结束]
            from datetime import timedelta

            app = Flask(__name__)
            CORS(app)

            # [新增开始] 定义 Prometheus 指标
            # order_requests_total: 记录 HTTP 请求总数，按方法、端点、状态码分类
            order_requests_total = Counter('order_requests_total', 'Total number of order requests', ['method', 'endpoint', 'status'])
            # order_duration_seconds: 记录请求处理耗时分布
            order_duration_seconds = Histogram('order_request_duration_seconds', 'Order request duration')
            # db_query_duration_seconds: 记录数据库查询耗时
            db_query_duration_seconds = Histogram('db_query_duration_seconds', 'Database query duration')
            # cache_operations_total: 记录缓存操作次数（命中/未命中）
            cache_operations_total = Counter('cache_operations_total', 'Total number of cache operations', ['operation', 'status'])
            # [新增结束]

            # 数据库配置
            db_config = {
                'host': os.getenv('DB_HOST', 'mysql-service'),
                'port': int(os.getenv('DB_PORT', 3306)),
                'user': os.getenv('DB_USER', 'cafeadmin'),
                'password': os.getenv('DB_PASSWORD', 'userpassword123'),
                'database': os.getenv('DB_NAME', 'cloudcafe')
            }

            # Redis 配置
            redis_client = redis.Redis(
                host=os.getenv('REDIS_HOST', 'redis-svc'),
                port=int(os.getenv('REDIS_PORT', 6379)),
                decode_responses=True
            )

            def get_db_connection():
                return pymysql.connect(**db_config)

            # [新增开始] Prometheus 指标暴露端点
            # Prometheus 会访问此端点抓取指标数据
            @app.route('/metrics')
            def metrics():
                return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}
            # [新增结束]

            @app.route('/health')
            @order_duration_seconds.time()  # [新增] 装饰器：自动记录函数执行时间
            def health():
                order_requests_total.labels(method='GET', endpoint='/health', status='200').inc()  # [新增] 计数器+1
                return jsonify({'status': 'healthy', 'redis': 'connected' if redis_client.ping() else 'disconnected'})

            @app.route('/orders', methods=['GET'])
            @order_duration_seconds.time()  # [新增] 记录请求耗时
            def get_orders():
                try:
                    # 尝试从缓存获取
                    cache_key = 'orders:all'
                    cached_orders = redis_client.get(cache_key)
                    
                    if cached_orders:
                        cache_operations_total.labels(operation='get', status='hit').inc()  # [新增] 缓存命中计数
                        app.logger.info('Orders retrieved from cache')
                        order_requests_total.labels(method='GET', endpoint='/orders', status='200').inc()  # [新增]
                        return jsonify(json.loads(cached_orders))
                    
                    # 缓存未命中，从数据库获取
                    with db_query_duration_seconds.time():  # [新增] 记录数据库查询耗时
                        conn = get_db_connection()
                        cursor = conn.cursor(pymysql.cursors.DictCursor)
                        cursor.execute('SELECT * FROM orders ORDER BY order_time DESC LIMIT 20')
                        orders = cursor.fetchall()
                        conn.close()
                    
                    # 存入缓存，过期时间 60 秒
                    redis_client.setex(cache_key, 60, json.dumps(orders))
                    cache_operations_total.labels(operation='get', status='miss').inc()  # [新增] 缓存未命中计数
                    app.logger.info('Orders retrieved from database and cached')
                    
                    order_requests_total.labels(method='GET', endpoint='/orders', status='200').inc()  # [新增]
                    return jsonify(orders)
                except Exception as e:
                    app.logger.error(f'Error getting orders: {str(e)}')
                    order_requests_total.labels(method='GET', endpoint='/orders', status='500').inc()  # [新增] 错误计数
                    return jsonify({'error': str(e)}), 500

            @app.route('/orders', methods=['POST'])
            @order_duration_seconds.time()  # [新增] 记录请求耗时
            def create_order():
                try:
                    data = request.json
                    with db_query_duration_seconds.time():  # [新增] 记录数据库查询耗时
                        conn = get_db_connection()
                        cursor = conn.cursor()
                        cursor.execute(
                            'INSERT INTO orders (customer_name, coffee_type, quantity, total_price) VALUES (%s, %s, %s, %s)',
                            (data['customer_name'], data['coffee_type'], data['quantity'], data['total_price'])
                        )
                        conn.commit()
                        order_id = cursor.lastrowid
                        conn.close()
                    
                    # 清除缓存
                    redis_client.delete('orders:all')
                    cache_operations_total.labels(operation='delete', status='success').inc()  # [新增] 缓存删除计数
                    app.logger.info(f'Order {order_id} created, cache cleared')
                    
                    order_requests_total.labels(method='POST', endpoint='/orders', status='201').inc()  # [新增]
                    return jsonify({'order_id': order_id, 'message': 'Order created successfully'}), 201
                except Exception as e:
                    app.logger.error(f'Error creating order: {str(e)}')
                    order_requests_total.labels(method='POST', endpoint='/orders', status='500').inc()  # [新增] 错误计数
                    return jsonify({'error': str(e)}), 500

            @app.route('/cache/stats', methods=['GET'])
            @order_duration_seconds.time()  # [新增] 记录请求耗时
            def cache_stats():
                try:
                    info = redis_client.info('stats')
                    order_requests_total.labels(method='GET', endpoint='/cache/stats', status='200').inc()  # [新增]
                    return jsonify({
                        'total_commands_processed': info.get('total_commands_processed', 0),
                        'total_connections_received': info.get('total_connections_received', 0),
                        'keyspace_hits': info.get('keyspace_hits', 0),
                        'keyspace_misses': info.get('keyspace_misses', 0)
                    })
                except Exception as e:
                    order_requests_total.labels(method='GET', endpoint='/cache/stats', status='500').inc()  # [新增] 错误计数
                    return jsonify({'error': str(e)}), 500

            if __name__ == '__main__':
                app.run(host='0.0.0.0', port=5000)
            PYEOF
            python /app/app.py
        ports:
        - containerPort: 5000
        env:
        - name: DB_HOST
          valueFrom:
            configMapKeyRef:
              name: app-config
              key: DB_HOST
        - name: DB_PORT
          valueFrom:
            configMapKeyRef:
              name: app-config
              key: DB_PORT
        - name: DB_USER
          valueFrom:
            configMapKeyRef:
              name: mysql-config
              key: MYSQL_USER
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: mysql-secret
              key: MYSQL_PASSWORD
        - name: DB_NAME
          valueFrom:
            configMapKeyRef:
              name: mysql-config
              key: MYSQL_DATABASE
        - name: REDIS_HOST
          valueFrom:
            configMapKeyRef:
              name: app-config
              key: REDIS_HOST
        - name: REDIS_PORT
          valueFrom:
            configMapKeyRef:
              name: app-config
              key: REDIS_PORT
        - name: FLASK_ENV
          valueFrom:
            configMapKeyRef:
              name: order-backend-config
              key: FLASK_ENV
        - name: FLASK_DEBUG
          valueFrom:
            configMapKeyRef:
              name: order-backend-config
              key: FLASK_DEBUG
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "200m"
        livenessProbe:
          httpGet:
            path: /health
            port: 5000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 5000
          initialDelaySeconds: 10
          periodSeconds: 5
        volumeMounts:
        - name: app-logs
          mountPath: /app/logs
      volumes:
      - name: app-logs
        persistentVolumeClaim:
          claimName: app-log-pvc

保存文件后，应用配置：

bash

# 应用后端服务 Deployment
kubectl apply -f order-backend-deployment.yaml

# 等待后端服务更新完成
kubectl rollout status deployment/order-backend -n cloud-cafe

# 查看 Pod
kubectl get pods -n cloud-cafe

2.2 测试 Prometheus 指标

bash

# 获取后端服务 Pod 名称
BACKEND_POD=$(kubectl get pod -l app=order-backend -n cloud-cafe -o jsonpath='{.items[0].metadata.name}')

# 测试指标端点
kubectl exec -it $BACKEND_POD -n cloud-cafe -- curl http://localhost:5000/metrics

# 在 Prometheus UI 中查看指标
# 访问: http://$NODE_IP:$PROMETHEUS_PORT
# 在查询框中输入: order_requests_total

Step 3: 部署 Grafana

概念: Grafana 是一个开源的可视化工具，可以创建美观的仪表板来展示 Prometheus 收集的数据。

3.1 部署 Grafana

bash

# 创建 Grafana Deployment YAML 文件
cat > grafana-deployment.yaml << 'EOF'
# Grafana 可视化监控平台 Deployment
# 用途：部署 Grafana 仪表板服务
# 功能：
#   - 连接 Prometheus 数据源
#   - 创建和展示监控仪表板
#   - 支持告警通知
# 环境变量说明：
#   - GF_SECURITY_ADMIN_USER：管理员用户名
#   - GF_SECURITY_ADMIN_PASSWORD：管理员密码（生产环境应使用 Secret）
#   - GF_INSTALL_PLUGINS：预安装插件列表
# 前置依赖：建议先部署 Prometheus
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring
  labels:
    app: grafana
spec:
  replicas: 1                   # Grafana 单节点部署（高可用需要共享存储）
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
      - name: grafana
        image: grafana/grafana:latest    # Grafana 官方镜像
        ports:
        - containerPort: 3000
          name: web                 # Web UI 端口
        # 环境变量配置
        env:
        - name: GF_SECURITY_ADMIN_USER
          value: "admin"            # 管理员用户名
        - name: GF_SECURITY_ADMIN_PASSWORD
          value: "admin"            # 管理员密码（默认，建议修改）
        - name: GF_INSTALL_PLUGINS
          value: ""                 # 预安装插件（逗号分隔）
        # 资源限制
        resources:
          requests:
            memory: "256Mi"         # 最低内存要求
            cpu: "100m"             # 最低 CPU（0.1核）
          limits:
            memory: "512Mi"         # 最大内存限制
            cpu: "200m"             # 最大 CPU（0.2核）
        # 健康检查
        livenessProbe:
          httpGet:
            path: /api/health       # Grafana 健康检查 API
            port: 3000
          initialDelaySeconds: 30    # 首次检查延迟（Grafana 启动需要时间）
          periodSeconds: 10          # 检查间隔
        readinessProbe:
          httpGet:
            path: /api/health       # Grafana 就绪检查 API
            port: 3000
          initialDelaySeconds: 10    # 首次检查延迟
          periodSeconds: 5           # 检查间隔
        # 数据持久化
        volumeMounts:
        - name: grafana-data
          mountPath: /var/lib/grafana    # Grafana 数据目录（仪表板配置等）
      volumes:
      - name: grafana-data
        emptyDir: {}                 # 使用 emptyDir（生产环境建议使用 PVC）
EOF

# 应用 Deployment
kubectl apply -f grafana-deployment.yaml

# 等待 Grafana 就绪
kubectl rollout status deployment/grafana -n monitoring

# 查看 Pod
kubectl get pods -n monitoring

3.2 创建 Grafana Service

bash

# 创建 Grafana Service
kubectl expose deployment grafana \
  --port=3000 \
  --target-port=3000 \
  --name=grafana-svc \
  -n monitoring

# 创建 NodePort Service
kubectl expose deployment grafana \
  --port=3000 \
  --target-port=3000 \
  --name=grafana-nodeport \
  --type=NodePort \
  -n monitoring

# 获取 NodePort
GRAFANA_PORT=$(kubectl get svc grafana-nodeport -n monitoring -o jsonpath='{.spec.ports[0].nodePort}')

echo "Grafana 访问地址: http://$NODE_IP:$GRAFANA_PORT"
echo "用户名: admin"
echo "密码: admin"

在浏览器中访问：http://192.168.56.10:xxxxx

使用用户名 admin 和密码 admin 登录。

3.3 配置 Grafana 数据源

登录 Grafana
点击左侧菜单的 "Configuration" → "Data Sources"
点击 "Add data source"
选择 "Prometheus"
配置数据源：
- Name: Prometheus
- URL: http://prometheus-svc.monitoring.svc.cluster.local:9090
点击 "Save & Test"

3.4 创建仪表板

点击左侧菜单的 "+" → "Dashboard"
点击 "Add new panel"
配置面板：
- Title: Order Requests Total
- Query: sum(rate(order_requests_total[5m])) by (endpoint, status)
- Visualization: Time series
点击 "Apply"
添加更多面板：
- Order Request Duration: histogram_quantile(0.95, rate(order_request_duration_seconds_bucket[5m]))
- Database Query Duration: histogram_quantile(0.95, rate(db_query_duration_seconds_bucket[5m]))
- Cache Operations: sum(rate(cache_operations_total[5m])) by (operation, status)

Step 4: 配置告警规则

4.1 创建告警规则 ConfigMap

bash

# 创建告警规则
kubectl create configmap prometheus-rules \
  --from-literal=alerts.yml='
groups:
  - name: cloud-cafe-alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(order_requests_total{status="500"}[5m])) /
          sum(rate(order_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is above 5% for the last 5 minutes"

      - alert: HighResponseTime
        expr: |
          histogram_quantile(0.95, rate(order_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High response time detected"
          description: "95th percentile response time is above 1 second"

      - alert: HighCacheMissRate
        expr: |
          sum(rate(cache_operations_total{operation="get",status="miss"}[5m])) /
          sum(rate(cache_operations_total{operation="get"}[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High cache miss rate detected"
          description: "Cache miss rate is above 50% for the last 5 minutes"

      - alert: PodNotReady
        expr: |
          kube_pod_status_ready{namespace="cloud-cafe",condition="true"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod not ready"
          description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is not ready"
' \
  -n monitoring

# 查看 ConfigMap
kubectl get configmap prometheus-rules -n monitoring

4.2 更新 Prometheus 配置

bash

# 更新 Prometheus 配置，添加告警规则
kubectl create configmap prometheus-config \
  --from-literal=prometheus.yml='
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "/etc/prometheus/rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: []

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

  - job_name: "kubernetes-nodes"
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - source_labels: [__address__]
        regex: "(.*):10250"
        replacement: "${1}:9100"
        target_label: __address__
' \
  -n monitoring \
  --dry-run=client -o yaml | kubectl apply -f -

# 创建更新后的 Prometheus Deployment YAML 文件（挂载告警规则）
cat > prometheus-deployment.yaml << 'EOF'
# Prometheus 监控系统 Deployment（集成告警规则版）
# 本次修改：添加告警规则挂载
# 修改位置：
#   1. 新增 prometheus-rules 卷挂载
#   2. 在 args 中添加 rule_files 配置（在 prometheus.yml 中配置）
# 告警规则说明：
#   - 规则文件挂载到 /etc/prometheus/rules/
#   - Prometheus 会自动加载这些规则并评估
# 前置依赖：需要 prometheus-config 和 prometheus-rules ConfigMap 已创建
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:latest
        args:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--web.console.libraries=/usr/share/prometheus/console_libraries'
          - '--web.console.templates=/usr/share/prometheus/consoles'
          - '--web.enable-lifecycle'
        ports:
        - containerPort: 9090
          name: web
        volumeMounts:
        - name: prometheus-config
          mountPath: /etc/prometheus
        # [新增开始] 挂载告警规则
        - name: prometheus-rules
          mountPath: /etc/prometheus/rules    # 告警规则目录
        # [新增结束]
        - name: prometheus-data
          mountPath: /prometheus
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /-/healthy
            port: 9090
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /-/ready
            port: 9090
          initialDelaySeconds: 10
          periodSeconds: 5
      volumes:
      - name: prometheus-config
        configMap:
          name: prometheus-config
      # [新增开始] 告警规则卷
      - name: prometheus-rules
        configMap:
          name: prometheus-rules    # 引用告警规则 ConfigMap
      # [新增结束]
      - name: prometheus-data
        emptyDir: {}
EOF

# 应用更新后的 Deployment
kubectl apply -f prometheus-deployment.yaml

# 等待 Prometheus 更新完成
kubectl rollout status deployment/prometheus -n monitoring

# 查看 Pod
kubectl get pods -n monitoring

4.3 查看告警规则

在 Prometheus UI 中查看告警规则：

访问 http://$NODE_IP:$PROMETHEUS_PORT
点击 "Status" → "Rules"
查看所有告警规则

Step 5: 日志收集

5.1 查看应用日志

bash

# 查看后端服务日志
kubectl logs -l app=order-backend -n cloud-cafe --tail=50

# 实时查看日志
kubectl logs -l app=order-backend -n cloud-cafe -f

# 查看特定 Pod 的日志
kubectl logs <pod-name> -n cloud-cafe

# 查看前一个容器的日志（如果容器重启了）
kubectl logs <pod-name> -n cloud-cafe --previous

📌 关于 kubectl logs 参数的解释
kubectl logs 用于查看 Pod 容器的日志，支持多种过滤和查看方式。
常用参数：
参数说明示例
--tail=N 只显示最后 N 行日志 --tail=50 查看最后 50 行
-f / --follow 实时跟踪日志输出类似 tail -f
--previous / -p 查看之前容器的日志容器重启后排查问题
--since=DURATION 查看最近一段时间的日志 --since=10m 最近 10 分钟
--since-time=TIME 查看指定时间之后的日志 ISO 8601 格式
--all-namespaces / -A 查看所有命名空间配合 -l 使用
使用场景：
bash
# 排查 CrashLoopBackOff（查看上次失败的日志）
kubectl logs my-pod --previous

# 实时跟踪日志并过滤错误
kubectl logs -l app=myapp -f | grep ERROR

# 查看今天所有的错误日志
kubectl logs -l app=myapp --since=24h | grep -i error
提示：如果 Pod 有多个容器，需要加 -c <容器名> 指定容器

参数	说明	示例
`--tail=N`	只显示最后 N 行日志	`--tail=50` 查看最后 50 行
`-f` / `--follow`	实时跟踪日志输出	类似 `tail -f`
`--previous` / `-p`	查看之前容器的日志	容器重启后排查问题
`--since=DURATION`	查看最近一段时间的日志	`--since=10m` 最近 10 分钟
`--since-time=TIME`	查看指定时间之后的日志	ISO 8601 格式
`--all-namespaces` / `-A`	查看所有命名空间	配合 `-l` 使用

5.2 使用 kubectl 搜索日志

bash

# 搜索包含特定关键词的日志
kubectl logs -l app=order-backend -n cloud-cafe | grep "error"

# 搜索最近 10 分钟的日志
kubectl logs -l app=order-backend -n cloud-cafe --since=10m

# 搜索特定时间段的日志
kubectl logs -l app=order-backend -n cloud-cafe --since-time="2024-01-01T00:00:00Z"

# 查看所有命名空间的日志
kubectl logs --all-namespaces --selector=app=order-backend

验证和测试

1. 检查所有资源状态

bash

# 查看 Deployment
kubectl get deployment -n monitoring

# 查看 Pod
kubectl get pods -n monitoring

# 查看 Service
kubectl get svc -n monitoring

# 查看 ConfigMap
kubectl get configmap -n monitoring

2. 测试 Prometheus

bash

# 访问 Prometheus UI
echo "Prometheus 访问地址: http://$NODE_IP:$PROMETHEUS_PORT"

# 测试查询
curl -s "http://$NODE_IP:$PROMETHEUS_PORT/api/v1/query?query=order_requests_total" | jq

# 查看目标
curl -s "http://$NODE_IP:$PROMETHEUS_PORT/api/v1/targets" | jq

3. 测试 Grafana

bash

# 访问 Grafana UI
echo "Grafana 访问地址: http://$NODE_IP:$GRAFANA_PORT"
echo "用户名: admin"
echo "密码: admin"

在 Grafana 中：

配置 Prometheus 数据源
创建仪表板
查看监控数据

4. 测试告警规则

bash

# 在 Prometheus UI 中查看告警规则
# 访问: http://$NODE_IP:$PROMETHEUS_PORT
# 点击: Status → Rules

# 查看当前告警
curl -s "http://$NODE_IP:$PROMETHEUS_PORT/api/v1/alerts" | jq

5. 测试日志收集

bash

# 生成一些日志
for i in {1..10}; do
  curl -X POST -H "Host: cloudcafe.local" -H "Content-Type: application/json" \
    -d "{\"customer_name\":\"日志测试$i\",\"coffee_type\":\"拿铁\",\"quantity\":1,\"total_price\":25.00}" \
    http://$NODE_IP:$INGRESS_PORT/api/orders
done

# 查看日志
kubectl logs -l app=order-backend -n cloud-cafe --tail=20

📝 总结和思考

本课程学到的知识点

Prometheus: 监控数据收集和存储
Grafana: 监控数据可视化
Prometheus 指标: Counter、Histogram 等指标类型
告警规则: 配置和管理告警
日志收集: 查看和分析应用日志

关键概念

可观测性: 监控、日志、追踪三位一体
指标类型: Counter（计数器）、Gauge（仪表盘）、Histogram（直方图）
告警规则: 基于指标阈值触发告警
日志分析: 通过日志定位问题

思考题

Prometheus 和 Grafana 有什么区别？它们如何协作？
Counter、Gauge、Histogram 有什么区别？分别在什么场景下使用？
如何实现告警通知？（提示：Alertmanager）
如何实现分布式追踪？（提示：Jaeger、Zipkin）
如何实现日志聚合和分析？（提示：ELK Stack、Loki）

最佳实践

合理设置指标: 不要收集过多指标，避免性能问题
设置合理的告警阈值: 避免告警风暴
定期检查告警规则: 确保告警规则有效
保留日志: 设置合理的日志保留策略
监控监控系统: 确保 Prometheus 和 Grafana 本身也被监控

下一步

本课程学习了如何部署监控系统，实现系统可观测性。

下一课程将学习如何使用 Helm 和 CI/CD，实现自动化部署。

下一课程: 07-自动化部署.md

清理环境

如果你想清理本课程创建的资源：

bash

# 删除监控资源
kubectl delete namespace monitoring

# 删除其他资源
kubectl delete hpa order-backend frontend -n cloud-cafe
kubectl delete deployment redis order-backend frontend -n cloud-cafe
kubectl delete svc redis-svc order-backend-svc frontend-svc -n cloud-cafe
kubectl delete statefulset mysql -n cloud-cafe
kubectl delete svc mysql-service -n cloud-cafe
kubectl delete pvc redis-pvc mysql-pvc app-log-pvc -n cloud-cafe
kubectl delete configmap redis-config frontend-html mysql-config app-config order-backend-config -n cloud-cafe
kubectl delete secret mysql-secret mysql-secret-manual -n cloud-cafe
kubectl delete ingress cloud-cafe-ingress -n cloud-cafe

# 删除命名空间
kubectl delete namespace cloud-cafe

提示: 如果你要继续学习下一个课程，建议保留这些资源，因为下一个课程会在此基础上进行。

监控和日志 ​

业务场景 ​

学习目标 ​

前置准备 ​

1. 确认环境 ​

2. 创建监控命名空间 ​

实战步骤 ​

Step 1: 部署 Prometheus ​

1.1 创建 Prometheus ConfigMap ​

1.2 部署 Prometheus ​

1.3 创建 Prometheus Service ​

1.4 访问 Prometheus ​

Step 2: 为应用添加 Prometheus 指标 ​

2.1 更新后端服务，添加 Prometheus 指标 ​

2.2 测试 Prometheus 指标 ​

Step 3: 部署 Grafana ​

3.1 部署 Grafana ​

3.2 创建 Grafana Service ​

3.3 配置 Grafana 数据源 ​

3.4 创建仪表板 ​

Step 4: 配置告警规则 ​

4.1 创建告警规则 ConfigMap ​

4.2 更新 Prometheus 配置 ​

4.3 查看告警规则 ​

Step 5: 日志收集 ​

5.1 查看应用日志 ​

5.2 使用 kubectl 搜索日志 ​

验证和测试 ​

1. 检查所有资源状态 ​

2. 测试 Prometheus ​

3. 测试 Grafana ​

4. 测试告警规则 ​

5. 测试日志收集 ​

📝 总结和思考 ​

本课程学到的知识点 ​

关键概念 ​

思考题 ​

最佳实践 ​

下一步 ​

清理环境 ​

评论区

监控和日志

业务场景

学习目标

前置准备

1. 确认环境

2. 创建监控命名空间

实战步骤

Step 1: 部署 Prometheus

1.1 创建 Prometheus ConfigMap

1.2 部署 Prometheus

1.3 创建 Prometheus Service

1.4 访问 Prometheus

Step 2: 为应用添加 Prometheus 指标

2.1 更新后端服务，添加 Prometheus 指标

2.2 测试 Prometheus 指标

Step 3: 部署 Grafana

3.1 部署 Grafana

3.2 创建 Grafana Service

3.3 配置 Grafana 数据源

3.4 创建仪表板

Step 4: 配置告警规则

4.1 创建告警规则 ConfigMap

4.2 更新 Prometheus 配置

4.3 查看告警规则

Step 5: 日志收集

5.1 查看应用日志

5.2 使用 kubectl 搜索日志

验证和测试

1. 检查所有资源状态

2. 测试 Prometheus

3. 测试 Grafana

4. 测试告警规则

5. 测试日志收集

📝 总结和思考

本课程学到的知识点

关键概念

思考题

最佳实践

下一步

清理环境